mle-interview
  • 面试指南
  • 数据结构与算法
    • 列表
      • 912. Sort an Array
      • 215. Kth Largest Element
      • 977. Squares of a Sorted Array
      • 605. Can Place Flowers
      • 59. Spiral Matrix II
      • 179. Largest Number
      • 31. Next Permutation
    • 二分查找
      • 704. Binary Search
      • 69. Sqrt(x)
      • 278. First Bad Version
      • 34. Find First and Last Position of Element in Sorted Array
      • 33. Search in Rotated Sorted Array
      • 81. Search in Rotated Sorted Array II
      • 162. Find Peak Element
      • 4. Median of Two Sorted Arrays
      • 1095. Find in Mountain Array
      • 240. Search a 2D Matrix II
      • 540. Single Element in a Sorted Array
      • 528. Random Pick with Weight
      • 1300. Sum of Mutated Array Closest to Target
      • 410. Split Array Largest Sum
      • 1044. Longest Duplicate Substring
      • *644. Maximum Average Subarray II
      • *1060. Missing Element in Sorted Array
      • *1062. Longest Repeating Substring
      • *1891. Cutting Ribbons
    • 双指针
      • 26. Remove Duplicate Numbers in Array
      • 283. Move Zeroes
      • 75. Sort Colors
      • 88. Merge Sorted Arrays
      • 167. Two Sum II - Input array is sorted
      • 11. Container With Most Water
      • 42. Trapping Rain Water
      • 15. 3Sum
      • 16. 3Sum Closest
      • 18. 4Sum
      • 454. 4Sum II
      • 409. Longest Palindrome
      • 125. Valid Palindrome
      • 647. Palindromic Substrings
      • 209. Minimum Size Subarray Sum
      • 5. Longest Palindromic Substring
      • 395. Longest Substring with At Least K Repeating Characters
      • 424. Longest Repeating Character Replacement
      • 76. Minimum Window Substring
      • 3. Longest Substring Without Repeating Characters
      • 1004. Max Consecutive Ones III
      • 1658. Minimum Operations to Reduce X to Zero
      • *277. Find the Celebrity
      • *340. Longest Substring with At Most K Distinct Characters
    • 链表
      • 203. Remove Linked List Elements
      • 19. Remove Nth Node From End of List
      • 876. Middle of the Linked List
      • 206. Reverse Linked List
      • 92. Reverse Linked List II
      • 24. Swap Nodes in Pairs
      • 707. Design Linked List
      • 148. Sort List
      • 160. Intersection of Two Linked Lists
      • 141. Linked List Cycle
      • 142. Linked List Cycle II
      • 328. Odd Even Linked List
    • 哈希表
      • 706. Design HashMap
      • 1. Two Sum
      • 146. LRU Cache
      • 128. Longest Consecutive Sequence
      • 73. Set Matrix Zeroes
      • 380. Insert Delete GetRandom O(1)
      • 49. Group Anagrams
      • 350. Intersection of Two Arrays II
      • 299. Bulls and Cows
      • *348. Design Tic-Tac-Toe
    • 字符串
      • 242. Valid Anagram
      • 151. Reverse Words in a String
      • 205. Isomorphic Strings
      • 647. Palindromic Substrings
      • 696. Count Binary Substrings
      • 28. Find the Index of the First Occurrence in a String
      • *186. Reverse Words in a String II
    • 栈与队列
      • 225. Implement Stack using Queues
      • 54. Spiral Matrix
      • 155. Min Stack
      • 232. Implement Queue using Stacks
      • 150. Evaluate Reverse Polish Notation
      • 224. Basic Calculator
      • 20. Valid Parentheses
      • 1472. Design Browser History
      • 1209. Remove All Adjacent Duplicates in String II
      • 1249. Minimum Remove to Make Valid Parentheses
      • *281. Zigzag Iterator
      • *1429. First Unique Number
      • *346. Moving Average from Data Stream
    • 优先队列/堆
      • 692. Top K Frequent Words
      • 347. Top K Frequent Elements
      • 973. K Closest Points
      • 23. Merge K Sorted Lists
      • 264. Ugly Number II
      • 378. Kth Smallest Element in a Sorted Matrix
      • 295. Find Median from Data Stream
      • 767. Reorganize String
      • 1438. Longest Continuous Subarray With Absolute Diff Less Than or Equal to Limit
      • 895. Maximum Frequency Stack
      • 1705. Maximum Number of Eaten Apples
      • *1086. High Five
    • 深度优先DFS
      • 二叉树
      • 543. Diameter of Binary Tree
      • 101. Symmetric Tree
      • 124. Binary Tree Maximum Path Sum
      • 226. Invert Binary Tree
      • 104. Maximum Depth of Binary Tree
      • 951. Flip Equivalent Binary Trees
      • 236. Lowest Common Ancestor of a Binary Tree
      • 987. Vertical Order Traversal of a Binary Tree
      • 572. Subtree of Another Tree
      • 863. All Nodes Distance K in Binary Tree
      • 1110. Delete Nodes And Return Forest
      • 230. Kth Smallest element in a BST
      • 98. Validate Binary Search Tree
      • 235. Lowest Common Ancestor of a Binary Search Tree
      • 669. Trim a Binary Search Tree
      • 700. Search in a Binary Search Tree
      • 108. Convert Sorted Array to Binary Search Tree
      • 450. Delete Node in a BST
      • 938. Range Sum of BST
      • *270. Closest Binary Search Tree Value
      • *333. Largest BST Subtree
      • *285. Inorder Successor in BST
      • *1485. Clone Binary Tree With Random Pointer
      • 回溯
      • 39. Combination Sum
      • 78. Subsets
      • 46. Permutation
      • 77. Combinations
      • 17. Letter Combinations of a Phone Number
      • 51. N-Queens
      • 93. Restore IP Addresses
      • 22. Generate Parentheses
      • 856. Score of Parentheses
      • 301. Remove Invalid Parentheses
      • 37. Sodoku Solver
      • 图DFS
      • 126. Word Ladder II
      • 212. Word Search II
      • 79. Word Search
      • 399. Evaluate Division
      • 1376. Time Needed to Inform All Employees
      • 131. Palindrome Partitioning
      • 491. Non-decreasing Subsequences
      • 698. Partition to K Equal Sum Subsets
      • 526. Beautiful Arrangement
      • 139. Word Break
      • 377. Combination Sum IV
      • 472. Concatenated Words
      • 403. Frog Jump
      • 329. Longest Increasing Path in a Matrix
      • 797. All Paths From Source to Target
      • 695. Max Area of Island
      • 341. Flatten Nested List Iterator
      • 394. Decode String
      • *291. Word Pattern II
      • *694. Number of Distinct Islands
      • *1274. Number of Ships in a Rectangle
      • *1087. Brace Expansion
    • 广度优先BFS
      • 102. Binary Tree Level Order Traversal
      • 103. Binary Tree Zigzag Level Order Traversal
      • 297. Serialize and Deserialize Binary Tree
      • 310. Minimum Height Trees
      • 127. Word Ladder
      • 934. Shortest Bridge
      • 200. Number of Islands
      • 133. Clone Graph
      • 130. Surrounded Regions
      • 752. Open the Lock
      • 815. Bus Routes
      • 1091. Shortest Path in Binary Matrix
      • 542. 01 Matrix
      • 1293. Shortest Path in a Grid with Obstacles Elimination
      • 417. Pacific Atlantic Water Flow
      • 207. Course Schedule
      • 210. Course Schedule II
      • 787. Cheapest Flights Within K Stops
      • 444. Sequence Reconstruction
      • 994. Rotting Oranges
      • 785. Is Graph Bipartite?
      • *366. Find Leaves of Binary Tree
      • *314. Binary Tree Vertical Order Traversal
      • *269. Alien Dictionary
      • *323. Connected Component in Undirected Graph
      • *490. The Maze
    • 动态规划
      • 70. Climbing Stairs
      • 72. Edit Distance
      • 377. Combination Sum IV
      • 1335. Minimum Difficulty of a Job Schedule
      • 97. Interleaving String
      • 472. Concatenated Words
      • 403. Frog Jump
      • 674. Longest Continuous Increasing Subsequence
      • 62. Unique Paths
      • 64. Minimum Path Sum
      • 368. Largest Divisible Subset
      • 300. Longest Increasing Subsequence
      • 354. Russian Doll Envelopes
      • 121. Best Time to Buy and Sell Stock
      • 132. Palindrome Partitioning II
      • 312. Burst Balloons
      • 1143. Longest Common Subsequence
      • 718. Maximum Length of Repeated Subarray
      • 174. Dungeon Game
      • 115. Distinct Subsequences
      • 91. Decode Ways
      • 639. Decode Ways II
      • 712. Minimum ASCII Delete Sum for Two Strings
      • 221. Maximal Square
      • 1277. Count Square Submatrices with All Ones
      • 198. House Robber
      • 213. House Robber II
      • 1235. Maximum Profit in Job Scheduling
      • 740. Delete and Earn
      • 87. Scramble String
      • 1140. Stone Game II
      • 322. Coin Change
      • 518. Coin Change II
      • 1048. Longest String Chain
      • 44. Wildcard Matching
      • 10. Regular Expression Matching
      • 32. Longest Valid Parentheses
      • 1043. Partition Array for Maximum Sum
      • *256. Paint House
      • 926. Flip String to Monotone Increasing
      • *1062. Longest Repeating Substring
      • *1216. Valid Palindrome III
    • 贪心
      • 56. Merge Intervals
      • 621. Task Scheduler
      • 135. Candy
      • 376. Wiggle Subsequence
      • 55. Jump Game
      • 134. Gas Station
      • 1005. Maximize Sum Of Array After K Negations
      • 406. Queue Reconstruction by Height
      • 452. Minimum Number of Arrows to Burst Balloons
      • 738. Monotone Increasing Digits
    • 单调栈
      • 739. Daily Temperatures
      • 503. Next Greater Element II
      • 901. Online Stock Span
      • 85. Maximum Rectangle
      • 84. Largest Rectangle in Histogram
      • 907. Sum of Subarray Minimums
      • 239. Sliding Window Maximum
    • 前缀和
      • 53. Maximum Subarray
      • 523. Continuous Subarray Sum
      • 304. Range Sum Query 2D - Immutable
      • 1423. Maximum Points You Can Obtain from Cards
      • 1031. Maximum Sum of Two Non-Overlapping Subarrays
    • 并查集
      • 684. Redundant Connection
      • 721. Accounts Merge
      • 547. Number of Provinces
      • 737. Sentence Similarity II
      • *305. Number of Islands II
    • 字典树trie
      • 208. Implement Trie
      • 211. Design Add and Search Words Data Structure
      • 1268. Search Suggestions System
      • *1166. Design File System
      • *642. Design Search Autocomplete System
    • 扫描线sweep line
      • 253. Meeting Room II
      • 1094. Car Pooling
      • 218. The Skyline Problem
      • *759. Employee Free Time
    • tree map
      • 729. My Calendar I
      • 981. Time Based Key-Value Store
      • 846. Hand of Straights
      • 480. Sliding Window Median
      • 318. Count of Smaller Numbers After Self
    • 数学类
      • 50. Pow(x, n)
      • *311. Sparse Matrix Multiplication
      • 382. Linked List Random Node
      • 398. Random Pick Index
      • 29. Divide Two Integers
    • 设计类
      • 1603. Design Parking System
      • 355. Design Twitter
      • 1396. Design Underground System
      • *359. Logger Rate Limiter
      • *353. Design Snake Game
      • *379. Design Phone Directory
      • *588. Design In-Memory File System
      • *1244. Design A Leaderboard
    • SQL
  • 机器学习
    • 数学基础
    • 评价指标
    • 线性回归
    • 逻辑回归
    • 树模型
    • 深度学习
    • 支持向量机
    • KNN
    • 无监督学习
    • k-means
    • 强化学习 RL
    • 自然语言处理 NLP
    • 大语言模型 LLM
    • 机器视觉 CV
    • 多模态 MM
    • 分布式机器学习
    • 推荐系统
    • 异常检测与风控
    • 模型解释性
    • 多任务学习
    • MLops
    • 特征工程
    • 在线学习
    • 硬件 cuda/triton
    • 产品case分析
    • 项目deep dive
    • 机器学习代码汇总
  • 系统设计
    • 面向对象设计
      • 电梯设计
      • 停车场设计
      • Unix文件系统设计
    • 系统设计
      • 设计社交网站Twitter
      • 设计视频网站Youtube
      • 短网址系统
      • 爬虫系统
      • 任务调度系统
      • 日志系统
      • 分布式缓存
      • 广告点击聚合系统
      • webhook
    • 机器学习系统设计
      • 推荐系统
      • 搜索引擎
      • Youtube视频推荐
      • Twitter推荐
      • 广告点击预测
      • 新闻推送推荐
      • POI推荐
      • Youtube视频搜索
      • 有害内容检测
      • 大模型RAG
      • 大模型Agent
      • 信贷风控
      • 朋友推荐
      • 去重复性/版权检测
      • 情感分析
      • 目标检测
      • 问答系统
      • 知识图谱问答
  • 行为面试
    • 领导力法则
    • 问答举例
  • 案例分享
    • 准备工作
    • 面试小抄
    • 面试之后
Powered by GitBook
On this page
  • 1. requirements
  • 2. ML task & pipeline
  • 3. data
  • 3.1 前处理
  • 3.2 正负样本
  • 4. feature
  • 5. model
  • 5.1 召回
  • 5.2 排序
  • 5.3 重排
  • 6. evaluation
  • 7. deployment & prediction service
  • 8. monitoring & maintenance
  • 9. 优化与问答
  • 参考
  1. 系统设计
  2. 机器学习系统设计

Youtube视频推荐

1. requirements

场景/功能类

  • Use case

    • Homepage? similar item recommendation?

    • Similar to previously played, or personalized for the user?

    • UGC(user generated content) or not? penetration rate as goal if UGC, 即时曝光new content

    • Do users have any favorite lists, play later, etc?

    • 项目阶段,是否都是冷启动数据,还是已经有了交互

  • user info: can users become friends on the platform and do we want to take that into account?

  • item info: any text, tag, image, video

  • engagement type: click, like, share, comment

    • explicit feedback? (thumbs up/down, in-product surveys), implicit user history

目标类

  • Business objective?

    • Increase user engagement (play, like, click, share), purchase?, create a better ultimate gaming experience

    • maximize users’ engagement and recommend new types of content to users

  • For online recommendations, it’s important to find the balance between exploration vs. exploitation. If the model over-exploits historical data, new videos might not get exposed to users. We want to balance between relevancy and fresh new content.

  • diversity

约束类

  • Data access

    • Do we log and have access to any data? Can we build a dataset using user interactions?

  • Latency requirements

    • For every user to visit the homepage, the system will have to recommend 100 videos for them. The latency needs to be under 200ms, ideally sub 100ms.

  • scale: how many user and item

    • How many videos? 100 million

    • How many users? 100 million DAU

2. ML task & pipeline

根据需求转化为一个机器学习问题(类型,输入、输出-机器学习Label),并给出high level的设计。

The reason for two stages is to make the system scale.

  • [1] In this first stage, the system starts from a potentially huge corpus and generates a much smaller subset of candidates. For example, the candidate generator in YouTube reduces billions of videos down to hundreds or thousands. The model needs to evaluate queries quickly given the enormous size of the corpus. A given model may provide multiple candidate generators, each nominating a different subset of candidates.

  • [2] Next, another model scores and ranks the candidates in order to select the set of items (on the order of 10) to display to the user. Since this model evaluates a relatively small subset of items, the system can use a more precise model relying on additional queries.

  • [3] Finally, the system must take into account additional constraints for the final ranking. For example, the system removes items that the user explicitly disliked or boosts the score of fresher content. Re-ranking can also help ensure diversity, freshness, and fairness.

3. data

  • Collect data on ad impressions, clicks, conversions, and user interactions.

  • Capture user behavior data on your website or platform to understand post-click engagement.

  • Gather ad creative data, such as headlines, images, and ad copy.

3.1 前处理

  • Segment data by ad type, campaign, audience demographics, and other relevant factors to analyze performance at different levels.

  • Clean and preprocess data to remove anomalies, missing values, and outliers.

  • Removing duplicates

  • filling missing values

  • normalizing data

3.2 正负样本

  • 召回负样本: 随机采样,batch内,粗排或精排过滤掉的

  • 排序负样本: 曝光未点击

4. feature

  • 用户特征,item特征,场景特征

    • 用户:用户画像特征、用户统计特征、用户行为特征

    • item:

    • 用户与item交叉特征:

    • 场景特征:有缺失,重要特征提升其覆盖率

  • sparse和dense特征

    • 离散特征(ID,类目)做embedding

    • 连续特征()转换 log1p, 分桶变成离散特征

  • 根据特征需要的时效性(例如历史engagement)、变化频率(例如用户与item基本信息)、数据大小与计算快慢决定其存储与服务时获取的方式

    • 内存数据库:用户画像(少,偏静态)、item画像(多,静态)、统计特征(用户少,item多,时效要求高)

5. model

从简单可行的模型开始设计,不要追求fancy

5.1 召回

Candidate generation is the first stage of recommendation. Given a query (also known as context), the system generates a set of relevant candidates

  • content-based filtering: Uses similarity between items to recommend items similar to what the user likes.

  • collaborative filtering: Uses similarities between queries and items simultaneously to provide recommendations.

    • pros

      • Easy to discover users' new areas of interest

      • Efficient. Models based on CF are usually faster and less compute-intensive

    • cons

      • Cold-start problem

      • Cannot handle niche interests

  • for large scale system (Facebook, Google), we don’t use Collaborative Filtering and prefer low latency method to get candidate. One example is to leverage Inverted Index (commonly used in Lucene, Elastic Search). Another powerful technique can be found FAISS or Google ScaNN

负样本选择

  • 简单负样本: 没有被召回的样本,全体样本中采样。根据热门/冷门进行随机非均匀,抽样概率与热门程度(点击次数)正相关,次数的0.75次方

  • 简单负样本: batch负样本,对比学习

  • 困难负样本: 被粗排淘汰的物品,精排分数靠后的物品

  • 注意不能选择精排的"曝光未点击"作为负样本,精排的样本是感兴趣与非常感兴趣的区别,不一定是不感兴趣,召回需要见多识广

双塔模型

  • Two-tower architectures can capture the semantics of query and candidate entities, and map these to a shared embedding space such that semantically similar entities cluster closer together

  • 步骤:1 离线训练得到user/item embedding, 2 离线建立item的索引,3 在线,计算user embedding,搜索item top n相似度

  • 数据:Training data is sourced from positive <query, candidate> pairs

  • pros

    • latency and scalability

  • cons

    • cold start

5.2 排序

排序算法给召回的每一个物品ID打分。优化目标有:pairwise,pointwise,listwise。pairwise是搜索排序提出的,正负例之间是有明显界限。对于推荐排序,是基于场景的,用户的反馈具有随机性,因此推荐排序pointwise经常优于parewise。

多目标最终如何给出排序

  • 预估分数融合

  • 图文笔记排序的主要依据: 点击、点赞、收藏、转发、评论; 视频排序的依据还有播放时长和完播

5.3 重排

  • 视频要否要先通过审核

  • region restricted videos

  • videos freshness

  • video spreading misinformation

  • duplicated

  • fairness and bias

6. evaluation

  • North star metric

    • Watch time, Other metrics : No. of photos watched, engaged with ( by clicking, commenting, liking etc.) DAU, WAU, MAU, Stickiness, Weekly retention, 30 day retention etc

  • offline

    • precision, recall, ranking loss, and logloss

  • online

    • Use A/B testing to compare Click Through Rates, watch time, and Conversion rates.

    • Click-Through Rate (CTR): The ratio of clicks to impressions, indicating the effectiveness of an ad in attracting user attention

    • Conversion Rate: The ratio of conversions (e.g., sign-ups, purchases) to clicks, measuring how well an ad drives desired actions.

    • Return on Ad Spend (ROAS): The revenue generated from an ad campaign divided by the cost, demonstrating the profitability of the campaign.

    • Quality Score: A score assigned to ads based on relevance, user experience, and expected click-through rate.

    • Engagement Metrics: Metrics like bounce rate, time on site, and pages per visit to assess the quality of user engagement with the landing page after clicking on the ad.

  • split the data

7. deployment & prediction service

    • are synchronous requests made to a model that is deployed to an endpoint. Therefore, before sending a request, you must first deploy the Model resource to an endpoint. This associates compute resources with the model so that it can serve online predictions with low latency. Use online predictions when you are making requests in response to application input or in situations that require timely inference.

    • rest/grpc

  • Batch predictions

    • are asynchronous requests made to a model that isn't deployed to an endpoint. You send the request (as a BatchPredictionsJob resource) directly to the Model resource. Use batch predictions when you don't require an immediate response and want to process accumulated data by using a single request.

    • precomputed, decouple compute from serving, lower load. 周期性更新,比如贷中评分,推荐系统的客户画像,一些dashboard,linkedin job推荐

  • application server

  • candidate generation service

    • two-tower network inference: find the k-top most relevant items given a user ->

    • It's a classic nearest neighbor problem -> use approximate nearest neighbor (ANN) algorithms

  • ranking service

8. monitoring & maintenance

  • User behavior is generally unpredictable, and videos can become viral during the day. Ideally, we want to train many times during the day to capture temporal changes

  • retrain everyday + online learning

9. 优化与问答

  • cold start

    • new users

      • use user basic features, like age, gender

      • after the user interacts with more videos, better

    • new videos

      • use basic metadata and content

      • display videos to random users to collect interaction data

  • bias

    • 向量召回过程中如何打压热门视频, popularity trap

      • youtube论文,训练时cosine(a, bi)-log(pi), 预测时cosine(a, bi)

参考

精读

扩展

Previous搜索引擎NextTwitter推荐

Last updated 27 days ago

损失函数:

计算the dot product representing their similarity,batch内负样本
Online predictions
论文: Deep Neural Networks for YouTube Recommendations
论文: Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations
论文: Recommending What Video to Watch Next: A Multitask Ranking System
Netflix: System Architectures for Personalization and Recommendation
How does Netflix recommend movies? Matrix Factorization
Improving Recommendation Quality in Google Drive
推荐系统-短视频推荐中的一些记录 - tangwang的文章 - 知乎
alirezadir/Machine-Learning-Interviews
https://github.com/wangshusen/RecommenderSystem
Scaling deep retrieval with TensorFlow Recommenders and Vertex AI Matching Engine
虎牙直播推荐系统架构详解
QQ音乐推荐系统算法架构实践
网易云音乐推荐系统的冷启动技术
双塔模型Batch内负采样如何解决热度降权和SSB的问题
Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations
System Design for Recommendations and Search // Eugene Yan // MLOps Meetup #78
都说数据是上限,推荐系统ctr模型中,构造正负样本有哪些实用的trick? - 傅聪Cong的回答 - 知乎
03ml-youtube-reco-structure