机器学习

平时注意构建知识体系,通过读论文和做实验不断为自己的知识体系添砖加瓦

重要概念

  • 归纳偏置(Inductive Bias)

  • 数据分布 IID

面试要求

  • 熟悉常见模型的原理、代码、如何实际应用、优缺点、常见面试问题

  • 考察范围包括ML breadth, ML depth, ML application, coding

    • 可能持续被追问为什么? 如为什么某个trick能起作用?

    • 算法背后的数学原理,写出其主要数学公式,并能进行白板推导

    • 一些较新的领域,会考察论文细节

    • 每一个算法的scale, 如何将算法map-reduce化

    • 每一个算法的复杂度、参数量、计算量

实例

  • 手写基础算法,以及一些优化的follow up

    • 写实现两层fully connected网络

    • 手写softmax的backpropagation

    • 手写AUC

    • 手写SGD

    • 实现dropout,前向和后向

    • random sample with weights

    • 实现focal loss

    • 手写n-gram

    • 手写multi head attention

    • 视觉:手写iou/nms

    • NLP: 手写tokenizer

  • 延伸

    • 给一个LSTM network的结构,计算how many parameters

    • convolution layer的output size怎么算? 写出公式

    • 设计一个sparse matrix (包括加减乘等运算)

  • 怎么解决nn的 over-fitting/ under-fitting

    • 过拟合:

      • 从数据角度,收集更多训练数据。求其次的话,数据增强方法。

      • 降低模型复杂度,如神经网络中的层数、宽度,树模型中的树深度、剪枝。模型正则化方法,如正则约束L2。集成学习方法,bagging方法。

      • Cross-validation to detect over-fitting.

      • Train with more data.

      • Data augmentation.

      • Feature selection.

      • Early stop.

      • Regularization.

      • Ensemble methods.

      • Pretrained model

    • 欠拟合:

      • 增加新特征,增加模型复杂度,减少正则化系数。

      • 训练模型的第一步就是要保证能够过拟合。

  • 怎么解决样本不平衡问题

    • 如果是classification,data是long tail的,只是取头部80%的label,其他的label不要了,mark as others

    • 如果真的特别imbalance,99.99% 和0.01%,类似spam的情况。 就只能试试别的方法,outlier detection之类

    • 最后继续引申到样本的难易

    • 评价指标:AP(average_precision_score)

    • downsampling

      • faster convergence, save disk space, calibration(=upweight?)

    • upweight

      • every sample contribute the loss equality

  • 怎么解决数据缺失的问题

  • 怎么解决类别变量中的高基数特征 high-cardinality

  • 优化器,如何选择优化器

    • MSE, loglikelihood+GD

    • SGD-training data太大量

    • ADAM-sparse input

  • 数据收集

    • production data, label

    • Internet dataset

  • 分布不一致怎么解决

    • distribution不是特别指的feature的,也有label的。label只能说多收集data,还是balance data的问题。

    • data distribution 改变,就是做auto train, auto deploy.如果参数drop太多,只能人工干预重新训练

  • 推荐,scale\abtesting\trouble-shooting

  • 怎么提升模型的latency

    • 小模型

    • 知识蒸馏

    • squeeze model to 8bit or 4bit

  • Generative vs Discriminative

    • A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data.

    • Discriminative models will generally outperform generative models on classification tasks. Discriminative model learns the predictive distribution p(y|x) directly while generative model learns the joint distribution p(x, y) then obtains the predictive distribution based on Bayes' rule.

  • The bias-variance tradeoff is a central problem in supervised learning

    • Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously.

    • High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data.

    • In contrast, algorithms with high bias typically produce simpler models that don't tend to overfit but may underfit their training data, failing to capture important regularities.

  • 模型的并行

    • 线性/逻辑回归

    • xgboost

    • cnn

    • RNN

    • transformer

    • 在深度学习框架中,单个张量的乘法内部会自动并行

参考

Last updated