mle-interview
  • 面试指南
  • 数据结构与算法
    • 列表
      • 912. Sort an Array
      • 215. Kth Largest Element
      • 977. Squares of a Sorted Array
      • 605. Can Place Flowers
      • 59. Spiral Matrix II
      • 179. Largest Number
      • 31. Next Permutation
    • 二分查找
      • 704. Binary Search
      • 69. Sqrt(x)
      • 278. First Bad Version
      • 34. Find First and Last Position of Element in Sorted Array
      • 33. Search in Rotated Sorted Array
      • 81. Search in Rotated Sorted Array II
      • 162. Find Peak Element
      • 4. Median of Two Sorted Arrays
      • 1095. Find in Mountain Array
      • 240. Search a 2D Matrix II
      • 540. Single Element in a Sorted Array
      • 528. Random Pick with Weight
      • 1300. Sum of Mutated Array Closest to Target
      • 410. Split Array Largest Sum
      • 1044. Longest Duplicate Substring
      • *644. Maximum Average Subarray II
      • *1060. Missing Element in Sorted Array
      • *1062. Longest Repeating Substring
      • *1891. Cutting Ribbons
    • 双指针
      • 26. Remove Duplicate Numbers in Array
      • 283. Move Zeroes
      • 75. Sort Colors
      • 88. Merge Sorted Arrays
      • 167. Two Sum II - Input array is sorted
      • 11. Container With Most Water
      • 42. Trapping Rain Water
      • 15. 3Sum
      • 16. 3Sum Closest
      • 18. 4Sum
      • 454. 4Sum II
      • 409. Longest Palindrome
      • 125. Valid Palindrome
      • 647. Palindromic Substrings
      • 209. Minimum Size Subarray Sum
      • 5. Longest Palindromic Substring
      • 395. Longest Substring with At Least K Repeating Characters
      • 424. Longest Repeating Character Replacement
      • 76. Minimum Window Substring
      • 3. Longest Substring Without Repeating Characters
      • 1004. Max Consecutive Ones III
      • 1658. Minimum Operations to Reduce X to Zero
      • *277. Find the Celebrity
      • *340. Longest Substring with At Most K Distinct Characters
    • 链表
      • 203. Remove Linked List Elements
      • 19. Remove Nth Node From End of List
      • 876. Middle of the Linked List
      • 206. Reverse Linked List
      • 92. Reverse Linked List II
      • 24. Swap Nodes in Pairs
      • 707. Design Linked List
      • 148. Sort List
      • 160. Intersection of Two Linked Lists
      • 141. Linked List Cycle
      • 142. Linked List Cycle II
      • 328. Odd Even Linked List
    • 哈希表
      • 706. Design HashMap
      • 1. Two Sum
      • 146. LRU Cache
      • 128. Longest Consecutive Sequence
      • 73. Set Matrix Zeroes
      • 380. Insert Delete GetRandom O(1)
      • 49. Group Anagrams
      • 350. Intersection of Two Arrays II
      • 299. Bulls and Cows
      • *348. Design Tic-Tac-Toe
    • 字符串
      • 242. Valid Anagram
      • 151. Reverse Words in a String
      • 205. Isomorphic Strings
      • 647. Palindromic Substrings
      • 696. Count Binary Substrings
      • 28. Find the Index of the First Occurrence in a String
      • *186. Reverse Words in a String II
    • 栈与队列
      • 225. Implement Stack using Queues
      • 54. Spiral Matrix
      • 155. Min Stack
      • 232. Implement Queue using Stacks
      • 150. Evaluate Reverse Polish Notation
      • 224. Basic Calculator
      • 20. Valid Parentheses
      • 1472. Design Browser History
      • 1209. Remove All Adjacent Duplicates in String II
      • 1249. Minimum Remove to Make Valid Parentheses
      • *281. Zigzag Iterator
      • *1429. First Unique Number
      • *346. Moving Average from Data Stream
    • 优先队列/堆
      • 692. Top K Frequent Words
      • 347. Top K Frequent Elements
      • 973. K Closest Points
      • 23. Merge K Sorted Lists
      • 264. Ugly Number II
      • 378. Kth Smallest Element in a Sorted Matrix
      • 295. Find Median from Data Stream
      • 767. Reorganize String
      • 1438. Longest Continuous Subarray With Absolute Diff Less Than or Equal to Limit
      • 895. Maximum Frequency Stack
      • 1705. Maximum Number of Eaten Apples
      • *1086. High Five
    • 深度优先DFS
      • 二叉树
      • 543. Diameter of Binary Tree
      • 101. Symmetric Tree
      • 124. Binary Tree Maximum Path Sum
      • 226. Invert Binary Tree
      • 104. Maximum Depth of Binary Tree
      • 951. Flip Equivalent Binary Trees
      • 236. Lowest Common Ancestor of a Binary Tree
      • 987. Vertical Order Traversal of a Binary Tree
      • 572. Subtree of Another Tree
      • 863. All Nodes Distance K in Binary Tree
      • 1110. Delete Nodes And Return Forest
      • 230. Kth Smallest element in a BST
      • 98. Validate Binary Search Tree
      • 235. Lowest Common Ancestor of a Binary Search Tree
      • 669. Trim a Binary Search Tree
      • 700. Search in a Binary Search Tree
      • 108. Convert Sorted Array to Binary Search Tree
      • 450. Delete Node in a BST
      • 938. Range Sum of BST
      • *270. Closest Binary Search Tree Value
      • *333. Largest BST Subtree
      • *285. Inorder Successor in BST
      • *1485. Clone Binary Tree With Random Pointer
      • 回溯
      • 39. Combination Sum
      • 78. Subsets
      • 46. Permutation
      • 77. Combinations
      • 17. Letter Combinations of a Phone Number
      • 51. N-Queens
      • 93. Restore IP Addresses
      • 22. Generate Parentheses
      • 856. Score of Parentheses
      • 301. Remove Invalid Parentheses
      • 37. Sodoku Solver
      • 图DFS
      • 126. Word Ladder II
      • 212. Word Search II
      • 79. Word Search
      • 399. Evaluate Division
      • 1376. Time Needed to Inform All Employees
      • 131. Palindrome Partitioning
      • 491. Non-decreasing Subsequences
      • 698. Partition to K Equal Sum Subsets
      • 526. Beautiful Arrangement
      • 139. Word Break
      • 377. Combination Sum IV
      • 472. Concatenated Words
      • 403. Frog Jump
      • 329. Longest Increasing Path in a Matrix
      • 797. All Paths From Source to Target
      • 695. Max Area of Island
      • 341. Flatten Nested List Iterator
      • 394. Decode String
      • *291. Word Pattern II
      • *694. Number of Distinct Islands
      • *1274. Number of Ships in a Rectangle
      • *1087. Brace Expansion
    • 广度优先BFS
      • 102. Binary Tree Level Order Traversal
      • 103. Binary Tree Zigzag Level Order Traversal
      • 297. Serialize and Deserialize Binary Tree
      • 310. Minimum Height Trees
      • 127. Word Ladder
      • 934. Shortest Bridge
      • 200. Number of Islands
      • 133. Clone Graph
      • 130. Surrounded Regions
      • 752. Open the Lock
      • 815. Bus Routes
      • 1091. Shortest Path in Binary Matrix
      • 542. 01 Matrix
      • 1293. Shortest Path in a Grid with Obstacles Elimination
      • 417. Pacific Atlantic Water Flow
      • 207. Course Schedule
      • 210. Course Schedule II
      • 787. Cheapest Flights Within K Stops
      • 444. Sequence Reconstruction
      • 994. Rotting Oranges
      • 785. Is Graph Bipartite?
      • *366. Find Leaves of Binary Tree
      • *314. Binary Tree Vertical Order Traversal
      • *269. Alien Dictionary
      • *323. Connected Component in Undirected Graph
      • *490. The Maze
    • 动态规划
      • 70. Climbing Stairs
      • 72. Edit Distance
      • 377. Combination Sum IV
      • 1335. Minimum Difficulty of a Job Schedule
      • 97. Interleaving String
      • 472. Concatenated Words
      • 403. Frog Jump
      • 674. Longest Continuous Increasing Subsequence
      • 62. Unique Paths
      • 64. Minimum Path Sum
      • 368. Largest Divisible Subset
      • 300. Longest Increasing Subsequence
      • 354. Russian Doll Envelopes
      • 121. Best Time to Buy and Sell Stock
      • 132. Palindrome Partitioning II
      • 312. Burst Balloons
      • 1143. Longest Common Subsequence
      • 718. Maximum Length of Repeated Subarray
      • 174. Dungeon Game
      • 115. Distinct Subsequences
      • 91. Decode Ways
      • 639. Decode Ways II
      • 712. Minimum ASCII Delete Sum for Two Strings
      • 221. Maximal Square
      • 1277. Count Square Submatrices with All Ones
      • 198. House Robber
      • 213. House Robber II
      • 1235. Maximum Profit in Job Scheduling
      • 740. Delete and Earn
      • 87. Scramble String
      • 1140. Stone Game II
      • 322. Coin Change
      • 518. Coin Change II
      • 1048. Longest String Chain
      • 44. Wildcard Matching
      • 10. Regular Expression Matching
      • 32. Longest Valid Parentheses
      • 1043. Partition Array for Maximum Sum
      • *256. Paint House
      • 926. Flip String to Monotone Increasing
      • *1062. Longest Repeating Substring
      • *1216. Valid Palindrome III
    • 贪心
      • 56. Merge Intervals
      • 621. Task Scheduler
      • 135. Candy
      • 376. Wiggle Subsequence
      • 55. Jump Game
      • 134. Gas Station
      • 1005. Maximize Sum Of Array After K Negations
      • 406. Queue Reconstruction by Height
      • 452. Minimum Number of Arrows to Burst Balloons
      • 738. Monotone Increasing Digits
    • 单调栈
      • 739. Daily Temperatures
      • 503. Next Greater Element II
      • 901. Online Stock Span
      • 85. Maximum Rectangle
      • 84. Largest Rectangle in Histogram
      • 907. Sum of Subarray Minimums
      • 239. Sliding Window Maximum
    • 前缀和
      • 53. Maximum Subarray
      • 523. Continuous Subarray Sum
      • 304. Range Sum Query 2D - Immutable
      • 1423. Maximum Points You Can Obtain from Cards
      • 1031. Maximum Sum of Two Non-Overlapping Subarrays
    • 并查集
      • 684. Redundant Connection
      • 721. Accounts Merge
      • 547. Number of Provinces
      • 737. Sentence Similarity II
      • *305. Number of Islands II
    • 字典树trie
      • 208. Implement Trie
      • 211. Design Add and Search Words Data Structure
      • 1268. Search Suggestions System
      • *1166. Design File System
      • *642. Design Search Autocomplete System
    • 扫描线sweep line
      • 253. Meeting Room II
      • 1094. Car Pooling
      • 218. The Skyline Problem
      • *759. Employee Free Time
    • tree map
      • 729. My Calendar I
      • 981. Time Based Key-Value Store
      • 846. Hand of Straights
      • 480. Sliding Window Median
      • 318. Count of Smaller Numbers After Self
    • 数学类
      • 50. Pow(x, n)
      • *311. Sparse Matrix Multiplication
      • 382. Linked List Random Node
      • 398. Random Pick Index
      • 29. Divide Two Integers
    • 设计类
      • 1603. Design Parking System
      • 355. Design Twitter
      • 1396. Design Underground System
      • *359. Logger Rate Limiter
      • *353. Design Snake Game
      • *379. Design Phone Directory
      • *588. Design In-Memory File System
      • *1244. Design A Leaderboard
    • SQL
  • 机器学习
    • 数学基础
    • 评价指标
    • 线性回归
    • 逻辑回归
    • 树模型
    • 深度学习
    • 支持向量机
    • KNN
    • 无监督学习
    • k-means
    • 强化学习 RL
    • 自然语言处理 NLP
    • 大语言模型 LLM
    • 机器视觉 CV
    • 多模态 MM
    • 分布式机器学习
    • 推荐系统
    • 异常检测与风控
    • 模型解释性
    • 多任务学习
    • MLops
    • 特征工程
    • 在线学习
    • 硬件 cuda/triton
    • 产品case分析
    • 项目deep dive
    • 机器学习代码汇总
  • 系统设计
    • 面向对象设计
      • 电梯设计
      • 停车场设计
      • Unix文件系统设计
    • 系统设计
      • 设计社交网站Twitter
      • 设计视频网站Youtube
      • 短网址系统
      • 爬虫系统
      • 任务调度系统
      • 日志系统
      • 分布式缓存
      • 广告点击聚合系统
      • webhook
    • 机器学习系统设计
      • 推荐系统
      • 搜索引擎
      • Youtube视频推荐
      • Twitter推荐
      • 广告点击预测
      • 新闻推送推荐
      • POI推荐
      • Youtube视频搜索
      • 有害内容检测
      • 大模型RAG
      • 大模型Agent
      • 信贷风控
      • 朋友推荐
      • 去重复性/版权检测
      • 情感分析
      • 目标检测
      • 问答系统
      • 知识图谱问答
  • 行为面试
    • 领导力法则
    • 问答举例
  • 案例分享
    • 准备工作
    • 面试小抄
    • 面试之后
Powered by GitBook
On this page
  • 1. requirements
  • 2. ML task & MVP
  • 3. data & labels
  • 4. feature
  • 5. model
  • 6. evaluation
  • 7. deploy & service
  • 8. Monitoring & maintenance
  • 9. 优化与问答
  • Reference
  1. 系统设计
  2. 机器学习系统设计

推荐系统

Previous机器学习系统设计Next搜索引擎

Last updated 27 days ago

经典机器学习系统设计题目,注意把各环节串联起来,形成框架性思考。

例如大规模推荐,为什么需要召回加排序的漏斗结构?是因为item数量大,召回能快速把大规模candidate显著减小,常见的双塔模型为什么能够快速召回呢?因为用户塔和物品塔无交叉,物品塔可离线计算,用户塔为了体现兴趣可以实时计算,但一次请求只需计算一个用户向量,通过ANN快速计算索引。 因此也决定了召回模型是late fusion, 而排序模型是early fusion(较早进行特征融合能够提升预测精度).

配合,针对不同领域,如电商、O2O,针对领域提出针对性的优化

  • 电商推荐业务:曝光->点击->购买

  • 地点约束,例如yelp的饭馆推荐涉及geolocation information

  • user的 graph network,例如facebook

  • 音乐、的embedding,例如spotify音乐推荐

  • Ins Story推荐,每条Story是独一无二的并且是有时间性的

  • O2O场景广告特点:移动化、本地化、场景化、多样性

1. requirements

推荐系统的核心功能还是推荐的personalization accuracy,diversity

Functional/use cases

  • what is the product for which we have to build a recommendation system, How are we different from XX? Who is the producer/consumer

  • Homepage recommendation, session based next item recommendation (short term interest), or related item recommendations

  • Explicit feedback or Implicit feedback (即使有explicit, 一般也会选择implicit)

  • it's a new product, or we have some current product built already

  • the biggest goal of recommendation system is user engagement. Can I assume that to be the goal?

  • Is there any demography or country we are targeting

  • Text, image, video

  • Ranking / Localization

  • 是否有friend, follow. 有的话可以作为一个召回通道

Non-Functional

一定要clarify: scalability, low latency. 因为这两个non-functional requirement决定了后面怎么设计

  • MVP and Non-MVP

  • users should have a real time/ near real time / small latency experience Idempotency/ exact-once/ at-least-once/ at-most-once

  • reliability: data not get lost/ job not executed in parallel retention policy

  • security: store encoded personal data

  • consistency: read/ write heavy

  • dau/ qps / peak qps

2. ML task & MVP

名词解释

  • 曝光(impression): 文档被用户看到

  • 点击率(click-through-rate,CTR): 文档d曝光的前提下,用户点击d的概率

  • 交互行为(engagement): 在点击的前提下, 文档的点赞、收藏、转发、关注作者、评论;电商的加购物车、下单、付款

推荐系统的重要性源自信息过载与人们行为的长尾分布. 目的是Link user with items in a reasonable way.

  • 长尾/热门

    • 头部用户精细刻画,准确记忆;占比更大的稀疏长尾,需要很好地泛化

  • 记忆/探索

  • 稀疏/embedding

3. data & labels

  • 关键:分布

  • 通过客户端以及服务端的实时数据,经过流处理的平台,把用户、商品、场景的信息以及端侧的信息全部都收集全

  • 再通过特征工程的很多方法,如归一化、离散化、非线性离散等很多的变换,把特征处理成下游模型处理的方式。

  • 处理完后一般会放在特征存储中,如 KV 存储: Redis、阿里的 iGraph 等

  • 样本

    • 不均衡

    • 置信度: skip-above(点击的item位置以上的展现可以当做负样本, 最深位置以后的样本过滤掉); 完全无正样本session(可能是碰巧唤醒)

  • labels

    • explicit vs. implicit labels

  • 延迟转化 Delayed Feedback

4. feature

  • dense feature

    • log变换

    • 分箱,可以根据业务分箱,分箱思路:等频,等宽,卡方

      • 分桶交叉特征提升泛化能力

  • sparse feature

    • onehot

    • embedding

  • cross feature

  • sequence feature

  • 处理

    • 缺失

    • 标准化

    • 平滑与消偏

5. model

熟练掌握模型: DNN, DeepFM, DIN, MMOE

召回

  • Generates a much smaller subset of candidates from huge corpus with low latency and computational cost

  • ensemble retrieval from different models (rule, filtering, nn). A given model may provide multiple candidate generators, each nominating a different subset of candidates

  • Neighborhood models (itemCF, userCF) are most effective at detecting very localized relationships, but unable to capture the totality of weak signals encompassed in all of a user’s ratings.

  • Latent factor models (MF) are generally effective at estimating overall structure that relates simultaneously to most or all items. However, these models are poor at detecting strong associations among a small set of closely related items.

  • 召回系统的要求是,“低延时”与“高精度(precision)

  • 多路召回 (ensemble retrieval)

  • 负样本选择

    • recommendation as extreme multiclass classification: 每一个正样本,意味着其他所有class都是负样本 -> negative sampling -> generation, accuracy

    • 全局负采样,采样方式: uniform subsampling, negative down sampling

    • Batch内负采样的问题: batch内都是热门物品,导致采样后的负样本中也大都为热门物品,造成对热门物品的过度打压,校正:

    • 负采样带来的问题:CTR预估值漂移, 校正: p / (p + (1-p)/w)

  • model

    • 规则:热度高,同一作者、tag

    • content based

      • The model doesn't need any data about other users

    • itemCF

      • ItemCF基于item之间的共现关系计算相似度,item行为越多,就会与更多的item发生共现,进而获得更多的曝光,即推荐系统中的马太效应或长尾效应

      • pro: 无需训练,长于记忆;效果好

      • con: 泛化能力弱;容易产生马太效应,推荐的都是头部和中部产品; Cannot handle fresh items; Hard to include side features for query/item

    • two power

      • arbitrary continuous and categorical features can be easily added to the model, 还有图片、文本等多模态都很容易加到双塔模型

      • 更容易scalable, 处理大规模数据

    • matrix factorization

      • con: worse performance on tail users

      • inability to add content-based features

    • embedding: graph, picture, text

  • 局部敏感哈希,KD树

  • similarity metrics

排序

  • 粗排、精排、重排

  • multi-task deep learning

  • 粗排一致性

  • 记忆和泛化

  • feature/embedding

  • model

    • rule based model, matrix factorization, factorization machine

    • 转化为classification/regression模型

    • 深度学习 (wide and deep learning)

      • pro:泛化能力,增量训练,推理速度

      • con: performance ramp-up, preprocessing

    • GBDT

      • pro: without preprocess, good performance

      • con: large scale, unsuitable for continual learning

  • rerank

    • take into account additional constraints for the final ranking

    • DPP多样性算法

  • multi-task

    • 稀疏task可以借助其他task学习

    • consistency among tasks

    • richer feature

    • 不好的地方是: 优化,尤其是想优化某个task变得更难

    • shared-bottom,MMoE,PLE

6. evaluation

  • offline

    • ranking: map, ndcg

  • online

    • CTR

7. deploy & service

  • parameter server

    • 本质是一个: 分布式键值存储系统

  • 召回中user embedding为什么线上实时计算

    • 用户的兴趣短时间内可能会发生显著变化

    • 用户的行为数据通常是稀疏的

    • 缓解用户冷启动

  • 召回中item embedding为什么离线计算好

    • 量大,静态

    • 实时性要求

  • pagination + sort_key / video start timestamp / offset, page_size

  • user context/ client info(ios, network condition) diagram

  • list high level diagram first, then each component

  • different choices/ tradeoffs and give your recommendation

  • MQ or no mq

  • cache: consistency/ failure/ cold start

  • data model/database: 1 master, 2 replica, primary key, user_id, timestamp, status

  • embedding矩阵的lazy初始化和lazy更新

  • embedding矩阵对分布式通信的影响

8. Monitoring & maintenance

注意区分statistical metric和business metric。后者意义更大,但经常无法直接optimize,只能通过ab-testing测试

  • 电商:根据业务需要,在 GMV (商品交易总额) 主目标之外,通常还要兼顾 IPV、转化率、人均订单数等多个次目标

  • ctr(点击率)和 CVR (Conversion Rate) 转化率

  • impression per second

  • candidate count (from recall)

  • budget burn rate

  • 通用metrics:cpu, qps, latency

  • 淘宝主搜将 “全域成交 Hit rate” 作为粗排最重要的评价标准,提出两类评价指标,分别描述“粗排->精排损失”和“召回->粗排损失”

9. 优化与问答

  • 数据采集和处理

    • 如何建立index

  • 召回

    • faiss: 使用PCA和PQ(Product quantization乘积量化)两种技术进行向量压缩和编码

    • 向量召回、排序没用实时行为序列特征

  • 怎么做counterfactual evaluation

  • 怎么deploy?

    • embedding retrieval

    • real time or batch, Batch Prediction Versus Online Prediction

  • 怎么上线?

    • A/B testing, metric

      • setup A/B testing

      • control group (current system), treatment (system w/ our new model)

      • use a canary release, start with 99% control and 1% treatment

      • compare the key business metrics between the two groups (conversion rate etc.)

      • slowly ramp up the traffic in the treatment group

    • 热启动与冷启动

    • 老汤模型的本质,是样本空间不一致导致的公平性问题

      • 方案一:回滚历史数据,对齐样本空间

      • 方案二:模型热启动warm-up

  • 实时化

    • 特征实时化难点:实时数据处理能力(Flink等),并将处理结果实时写入Redis

    • 模型实时化

      • 样本数据流:实时样本落地、实时样本拼接(反馈延时问题)

      • 模型训练:在线学习(资源和稳定性问题)、实时模型纠偏

  • 连续特征离散化方法以及为什么需要对连续特征离散化

  • FM、FFM的参数量以及时间复杂度

  • 多目标模型

    • 参数共享及不共享参数各自的优缺点

  • 用户长期兴趣和多兴趣怎么建模

  • 如何冷启动

    • 思路一: 尽可能应用side information/多模态进行推荐

    • 思路二: 尽可能用小流量探索出新物料的真实质量,扶持新item

  • bias

    • 如何解决 position bias

      • 数据入手:点击前的展示作为负例

      • 模型入手:bias特征须通过线性层接入模型;训练时将位置作为特征喂入模型,预测时置0

    • popularity bias

  • 对热门的打压 (long tail)

    • 召回:多路召回增加多样性

      • itemCF/swing相似度计算中,把共同点击商品A和B的用户pair对的交集放在分母,对热门用户行为的打击

      • 召回时,对负样本进行下采样,采样概率大致为点击数的0.75次方,对热门进行打压。注意in-batch-negative中,打压过度了

      • log q correction

    • 精排

      • item count取对数

      • 热门特征加入到偏差的网络中,预测时去掉

    • 流量调控

  • 稀疏

  • 模型更新策略, retrain plan

    • 按天增量训练,实时训练(在线学习),或者 在线学习+天级增量结合

    • embedding的更新

    • 全量更新:可以每天更新一次,shuffle, 更新ID embedding 和全连接层,1 epoch。每次更新的还是上一天的全量的模型更新,而不是增量

    • 增量更新:不停做,可以几十分钟更新一次,online learning只更新ID embedding参数, 尽量实时追踪用户兴趣。但其实是有偏的

      • Real-time Logging

      • 在线学习所需的实时特征可通过在线inference系统实时输出

  • 怎么加user and item metadata

  • 线上评价,线上线性不一致

  • model debugging, offline online inconsistency, light ranking, ab test, heavy ranking, two tower

  • 统计特征用等宽分桶导致特征值聚集

  • 多语言搜索召回率低

  • 有些国家节日多,模型T+1更新导致节日后消费数据下降

  • 有一些情况下同一用户对不同item的 pctr 是同一个值

  • 模型目标和业务目标不一致

  • Itemid hash 碰撞率太高

  • E&E

    • embedding: 特征转化为可以学习的向量,模糊查找

    • embedding in sequence: 共现

  • 多任务

  • 多场景

    • 不同用户群体(如新老用户)、APP不同频道模块、不同客户端等

Reference

精读

扩展

代码

  • TF1:Microsoft recommendation

  • TF2.5+:GitHub - ZiyaoGeng/RecLearn: Recommender Learning with Tensorflow2.x

  • Pytorch:RecBole推荐系统框架

  • https://github.com/CheckChe0803/flink-recommandSystem-demo

部署

  • MOBIUS: Towards the Next Generation of Query-Ad Matchingin Baidu’s Sponsored Search

  • COLD: Towards the Next Generation of Pre-Ranking System

, embedding初始化采用default, 而不是random

: 支持淘宝搜索、猜你喜欢、定向、直通车等核心业务,支撑着千亿特征、万亿样本超大规模的稀疏训练

推荐系统理论
newsfeed推荐
视频
Point of interest
那些年,我们追过的 Feature
电商场景下的itemid有上亿,embedding
特征工程中的特征交叉究竟是什么? - 谢陈龙的回答 - 知乎
How to Monitor a Recommender System
embedding 冷启动
魔改结构什么时候有效 - 古德猫宁的文章 - 知乎
Recommendations: What and Why?
Best Practices for Building and Deploying Recommender Systems
https://github.com/Doragd/Algorithm-Practice-in-Industry
超详细:完整的推荐系统架构设计
推荐系统--完整的架构设计和算法(协同过滤、隐语义)
https://www.6aiq.com/article/1553963227373
闲鱼搜广推类技术文章汇总
Improving Deep Learning for Ranking Stays at Airbnb
Machine Learning-Powered Search Ranking of Airbnb Experiences
wish冷启动
阿里-推荐系统综述
推荐算法优化闲聊
负样本为王:评Facebook的向量化召回算法
探讨特征工程的方法论
现在互联网公司还有做特征工程的工作吗?
Recommender Systems, Not Just Recommender Models
在工业界,应用 Multi-Armed Bandit 的例子多吗? - 曾文俊的回答 - 知乎
强化学习在美团“猜你喜欢”的实践
MLSYS-深度推荐系统
https://research.facebook.com/blog/2018/5/the-facebook-field-guide-to-machine-learning-video-series/
推荐系统架构
ML Systems Design Interview Guide
CTR/推荐系统 特征工程文章汇总
推荐系统简介之特征工程笔记 - Shard Zhang的文章 - 知乎
CTR/推荐系统中embedding应用概述文章汇总
CTR/推荐系统中多任务/多目标学习应用概述文章汇总
CTR/推荐系统冷启动文章汇总
CTR/推荐系统 冷启动Exploitation & Exploration文章汇总
CTR/推荐系统中Debias应用概述文章汇总
CTR/推荐系统线上线下不一致讨论文章汇总
CTR/推荐系统 转化延迟文章汇总
推荐系统简介之标签拼接和延迟反馈 - Shard Zhang的文章 - 知乎
都说数据是上限,推荐系统ctr模型中,构造正负样本有哪些实用的trick? - 武侠超人的回答 - 知乎
一文读懂「Parameter Server」的分布式机器学习训练原理 - 王喆的文章 - 知乎
Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters
推荐系统Serving架构分析 - Peter的文章 - 知乎
fun-rec
facebook-DLRM
torch-rechub
EasyRec
https://github.com/wzhe06/SparrowRecSys
https://github.com/twitter/the-algorithm
spotify_mpd_two_tower
https://github.com/jiawei-chen/RecDebiasing
Multitask-Recommendation-Library
https://github.com/rixwew/pytorch-fm/tree/master/torchfm/model
https://github.com/LongmaoTeamTf/deep_recommenders/tree/master
DeepRec
feature