k近邻
应用场景
产品推荐和推荐引擎
自然语言处理 (NLP) 算法的相关性排名
图像或视频的相似性搜索
过程
计算训练样本和测试样本中每个样本之间的距离
对距离进行排序
选择距离最小的k个样本: K小,噪音多,K大,边界模糊
根据k个样本的标签进行投票, 得到分类类别
距离
欧几里得距离
明可夫斯基距离
曼哈顿距离
余弦相似度
优化
NDTree
Approximate kNN 加速服务
Tree-based ANN
Locality-sensitive hashing (LSH)
Clustering based
faiss
问答
Pros:
Almost no assumption on f other than smoothness
High capacity/complexity
High accuracy given a large training set
Supports online training (by simply memorizing)
Readily extensible to multi-class and regression problems
Cons:
Storage demanding
Sensitive to outliers
Sensitive to irrelevant data features (vs. decision trees)
Needs to deal with missing data (e.g., special distances)
Computationally expensive: O(ND) time for making each prediction
Can speed up with index and/or approximation
code
Last updated