机器学习
面试需要一些准备和技巧,但功夫在诗外。平时注意构建知识体系,论文和实验不断给体系添砖加瓦。本章侧重理论部分,系统设计参考3.3 机器学习系统设计
1. 面试要求
熟悉常见模型的原理、代码、如何实际应用、优缺点、常见问题等
归纳偏置(Inductive Bias),数据同分布(IID)
考察范围包括ML breadth, ML depth, ML application, coding
算法背后的数学原理,写出主要数学公式,并能进行白板推导介绍
一些较新的领域如大模型,会考察论文细节
可能被持续追问为什么? 某个trick为什么能起作用?
每一个算法如何scale,如何将算法map-reduce化
每一个算法的复杂度、参数量、计算量
2. 八股问题实例
模型细节与八股见具体模型页面
Generative vs Discriminative
A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data.
Discriminative models will generally outperform generative models on classification tasks. Discriminative model learns the predictive distribution p(y|x) directly while generative model learns the joint distribution p(x, y) then obtains the predictive distribution based on Bayes' rule.
The bias-variance tradeoff
Bias Variance Decomposition: Error = Bias ** 2 + Variance + Irreducible Error
Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously.
High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data.
In contrast, algorithms with high bias typically produce simpler models that don't tend to overfit but may underfit their training data, failing to capture important regularities.
怎么解决over-fitting
track: underfitting means large training error, large generalization error; overfitting means small training error, large generalization error
数据角度,收集更多训练数据(more data);求其次,数据增强(Data augmentation);或Pretrained model
特征角度,Feature selection
模型角度
降低模型复杂度,如神经网络的层数、宽度,树模型的树深度、剪枝;
模型正则化(Regularization),如正则约束L2,dropout
集成学习方法,bagging
训练角度,Early stop,weight decay
怎么解决under-fitting
特征角度,增加新特征
模型角度,增加模型复杂度,减少正则化系数
训练角度,训练模型第一步就是要保证能够过拟合,增加epoch
怎么解决样本不平衡问题
评价指标:AP(average_precision_score)
downsampling: faster convergence, save disk space, calibration. 样本多少可继续引申到样本的难易
upweight: every sample contribute the loss equality
long tail classification,只取头部80%的label,其他label mark as others
极端imbalance,99.99% 和0.01%,outlier detection的方法
怎么解决数据缺失的问题
label data较少的情况
怎么解决类别变量中的高基数特征 high-cardinality
Feature Hashing
Target Encoding
Clustering Encoding
Embedding Encoding
如何选择优化器
MSE, loglikelihood+GD
SGD-training data太大量
ADAM-sparse input
怎么解决Gradient Vanishing & Exploding
梯度消失
激活函数activations, 如ReLU
residual network
batch normalization
梯度爆炸
gradient clipping
LSTM gate
数据收集
production data, label
Internet dataset
分布不一致怎么解决
distribution有feature和label的问题。label尽量多收集data,还是balance data的问题
data distribution 改变,就是做auto train, auto deploy. 如果性能drop太多,人工干预重新训练
穿越特征也会造成分布不一致的表象,从避免穿越角度解决
线上线下不一致
model behaviors in production: data/feature distribution drift, feature bug
model generalization: offline metrics alignment
curse of dimensionality
Feature Selection
PCA
embedding
怎么提升模型的latency
小模型
知识蒸馏
squeeze model to 8bit or 4bit
模型的并行
线性/逻辑回归
xgboost
cnn
RNN
transformer
在深度学习框架中,单个张量的乘法内部会自动并行
3. 手写ML代码实例
手写softmax的backpropagation
手写AUC
手写SGD
手写两层fully connected网络
convolution layer的output size怎么算? 写出公式
实现dropout,前向和后向
实现focal loss
手写LSTM
给一个LSTM network的结构,计算参数量
NLP:
手写n-gram
手写tokenizer
白板介绍位置编码
手写multi head attention (MHA)
视觉:
手写iou/nms
参考
Last updated