KNN

应用场景

  • 产品推荐和推荐引擎

  • 自然语言处理 (NLP) 算法的相关性排名

  • 图像或视频的相似性搜索

过程

  • 计算训练样本和测试样本中每个样本之间的距离

  • 对距离进行排序

  • 选择距离最小的k个样本: K小,噪音多,K大,边界模糊

  • 根据k个样本的标签进行投票, 得到分类类别

距离

  • 欧几里得距离

  • 明可夫斯基距离

  • 曼哈顿距离

  • 余弦相似度

优化

  • NDTree

  • Approximate kNN 加速服务

    • Tree-based ANN

    • Locality-sensitive hashing (LSH)

    • Clustering based

  • faiss

问答

Pros:

  • Almost no assumption on f other than smoothness

  • High capacity/complexity

  • High accuracy given a large training set

  • Supports online training (by simply memorizing)

  • Readily extensible to multi-class and regression problems

Cons:

  • Storage demanding

  • Sensitive to outliers

  • Sensitive to irrelevant data features (vs. decision trees)

  • Needs to deal with missing data (e.g., special distances)

  • Computationally expensive: O(ND) time for making each prediction

  • Can speed up with index and/or approximation

code

Reference

Last updated