mle-interview
  • 面试指南
  • 数据结构与算法
    • 列表
      • 912. Sort an Array
      • 215. Kth Largest Element
      • 977. Squares of a Sorted Array
      • 605. Can Place Flowers
      • 59. Spiral Matrix II
      • 179. Largest Number
      • 31. Next Permutation
    • 二分查找
      • 704. Binary Search
      • 69. Sqrt(x)
      • 278. First Bad Version
      • 34. Find First and Last Position of Element in Sorted Array
      • 33. Search in Rotated Sorted Array
      • 81. Search in Rotated Sorted Array II
      • 162. Find Peak Element
      • 4. Median of Two Sorted Arrays
      • 1095. Find in Mountain Array
      • 240. Search a 2D Matrix II
      • 540. Single Element in a Sorted Array
      • 528. Random Pick with Weight
      • 1300. Sum of Mutated Array Closest to Target
      • 410. Split Array Largest Sum
      • 1044. Longest Duplicate Substring
      • *644. Maximum Average Subarray II
      • *1060. Missing Element in Sorted Array
      • *1062. Longest Repeating Substring
      • *1891. Cutting Ribbons
    • 双指针
      • 26. Remove Duplicate Numbers in Array
      • 283. Move Zeroes
      • 75. Sort Colors
      • 88. Merge Sorted Arrays
      • 167. Two Sum II - Input array is sorted
      • 11. Container With Most Water
      • 42. Trapping Rain Water
      • 15. 3Sum
      • 16. 3Sum Closest
      • 18. 4Sum
      • 454. 4Sum II
      • 409. Longest Palindrome
      • 125. Valid Palindrome
      • 647. Palindromic Substrings
      • 209. Minimum Size Subarray Sum
      • 5. Longest Palindromic Substring
      • 395. Longest Substring with At Least K Repeating Characters
      • 424. Longest Repeating Character Replacement
      • 76. Minimum Window Substring
      • 3. Longest Substring Without Repeating Characters
      • 1004. Max Consecutive Ones III
      • 1658. Minimum Operations to Reduce X to Zero
      • *277. Find the Celebrity
      • *340. Longest Substring with At Most K Distinct Characters
    • 链表
      • 203. Remove Linked List Elements
      • 19. Remove Nth Node From End of List
      • 876. Middle of the Linked List
      • 206. Reverse Linked List
      • 92. Reverse Linked List II
      • 24. Swap Nodes in Pairs
      • 707. Design Linked List
      • 148. Sort List
      • 160. Intersection of Two Linked Lists
      • 141. Linked List Cycle
      • 142. Linked List Cycle II
      • 328. Odd Even Linked List
    • 哈希表
      • 706. Design HashMap
      • 1. Two Sum
      • 146. LRU Cache
      • 128. Longest Consecutive Sequence
      • 73. Set Matrix Zeroes
      • 380. Insert Delete GetRandom O(1)
      • 49. Group Anagrams
      • 350. Intersection of Two Arrays II
      • 299. Bulls and Cows
      • *348. Design Tic-Tac-Toe
    • 字符串
      • 242. Valid Anagram
      • 151. Reverse Words in a String
      • 205. Isomorphic Strings
      • 647. Palindromic Substrings
      • 696. Count Binary Substrings
      • 28. Find the Index of the First Occurrence in a String
      • *186. Reverse Words in a String II
    • 栈与队列
      • 225. Implement Stack using Queues
      • 54. Spiral Matrix
      • 155. Min Stack
      • 232. Implement Queue using Stacks
      • 150. Evaluate Reverse Polish Notation
      • 224. Basic Calculator
      • 20. Valid Parentheses
      • 1472. Design Browser History
      • 1209. Remove All Adjacent Duplicates in String II
      • 1249. Minimum Remove to Make Valid Parentheses
      • *281. Zigzag Iterator
      • *1429. First Unique Number
      • *346. Moving Average from Data Stream
    • 优先队列/堆
      • 692. Top K Frequent Words
      • 347. Top K Frequent Elements
      • 973. K Closest Points
      • 23. Merge K Sorted Lists
      • 264. Ugly Number II
      • 378. Kth Smallest Element in a Sorted Matrix
      • 295. Find Median from Data Stream
      • 767. Reorganize String
      • 1438. Longest Continuous Subarray With Absolute Diff Less Than or Equal to Limit
      • 895. Maximum Frequency Stack
      • 1705. Maximum Number of Eaten Apples
      • *1086. High Five
    • 深度优先DFS
      • 二叉树
      • 543. Diameter of Binary Tree
      • 101. Symmetric Tree
      • 124. Binary Tree Maximum Path Sum
      • 226. Invert Binary Tree
      • 104. Maximum Depth of Binary Tree
      • 951. Flip Equivalent Binary Trees
      • 236. Lowest Common Ancestor of a Binary Tree
      • 987. Vertical Order Traversal of a Binary Tree
      • 572. Subtree of Another Tree
      • 863. All Nodes Distance K in Binary Tree
      • 1110. Delete Nodes And Return Forest
      • 230. Kth Smallest element in a BST
      • 98. Validate Binary Search Tree
      • 235. Lowest Common Ancestor of a Binary Search Tree
      • 669. Trim a Binary Search Tree
      • 700. Search in a Binary Search Tree
      • 108. Convert Sorted Array to Binary Search Tree
      • 450. Delete Node in a BST
      • 938. Range Sum of BST
      • *270. Closest Binary Search Tree Value
      • *333. Largest BST Subtree
      • *285. Inorder Successor in BST
      • *1485. Clone Binary Tree With Random Pointer
      • 回溯
      • 39. Combination Sum
      • 78. Subsets
      • 46. Permutation
      • 77. Combinations
      • 17. Letter Combinations of a Phone Number
      • 51. N-Queens
      • 93. Restore IP Addresses
      • 22. Generate Parentheses
      • 856. Score of Parentheses
      • 301. Remove Invalid Parentheses
      • 37. Sodoku Solver
      • 图DFS
      • 126. Word Ladder II
      • 212. Word Search II
      • 79. Word Search
      • 399. Evaluate Division
      • 1376. Time Needed to Inform All Employees
      • 131. Palindrome Partitioning
      • 491. Non-decreasing Subsequences
      • 698. Partition to K Equal Sum Subsets
      • 526. Beautiful Arrangement
      • 139. Word Break
      • 377. Combination Sum IV
      • 472. Concatenated Words
      • 403. Frog Jump
      • 329. Longest Increasing Path in a Matrix
      • 797. All Paths From Source to Target
      • 695. Max Area of Island
      • 341. Flatten Nested List Iterator
      • 394. Decode String
      • *291. Word Pattern II
      • *694. Number of Distinct Islands
      • *1274. Number of Ships in a Rectangle
      • *1087. Brace Expansion
    • 广度优先BFS
      • 102. Binary Tree Level Order Traversal
      • 103. Binary Tree Zigzag Level Order Traversal
      • 297. Serialize and Deserialize Binary Tree
      • 310. Minimum Height Trees
      • 127. Word Ladder
      • 934. Shortest Bridge
      • 200. Number of Islands
      • 133. Clone Graph
      • 130. Surrounded Regions
      • 752. Open the Lock
      • 815. Bus Routes
      • 1091. Shortest Path in Binary Matrix
      • 542. 01 Matrix
      • 1293. Shortest Path in a Grid with Obstacles Elimination
      • 417. Pacific Atlantic Water Flow
      • 207. Course Schedule
      • 210. Course Schedule II
      • 787. Cheapest Flights Within K Stops
      • 444. Sequence Reconstruction
      • 994. Rotting Oranges
      • 785. Is Graph Bipartite?
      • *366. Find Leaves of Binary Tree
      • *314. Binary Tree Vertical Order Traversal
      • *269. Alien Dictionary
      • *323. Connected Component in Undirected Graph
      • *490. The Maze
    • 动态规划
      • 70. Climbing Stairs
      • 72. Edit Distance
      • 377. Combination Sum IV
      • 1335. Minimum Difficulty of a Job Schedule
      • 97. Interleaving String
      • 472. Concatenated Words
      • 403. Frog Jump
      • 674. Longest Continuous Increasing Subsequence
      • 62. Unique Paths
      • 64. Minimum Path Sum
      • 368. Largest Divisible Subset
      • 300. Longest Increasing Subsequence
      • 354. Russian Doll Envelopes
      • 121. Best Time to Buy and Sell Stock
      • 132. Palindrome Partitioning II
      • 312. Burst Balloons
      • 1143. Longest Common Subsequence
      • 718. Maximum Length of Repeated Subarray
      • 174. Dungeon Game
      • 115. Distinct Subsequences
      • 91. Decode Ways
      • 639. Decode Ways II
      • 712. Minimum ASCII Delete Sum for Two Strings
      • 221. Maximal Square
      • 1277. Count Square Submatrices with All Ones
      • 198. House Robber
      • 213. House Robber II
      • 1235. Maximum Profit in Job Scheduling
      • 740. Delete and Earn
      • 87. Scramble String
      • 1140. Stone Game II
      • 322. Coin Change
      • 518. Coin Change II
      • 1048. Longest String Chain
      • 44. Wildcard Matching
      • 10. Regular Expression Matching
      • 32. Longest Valid Parentheses
      • 1043. Partition Array for Maximum Sum
      • *256. Paint House
      • 926. Flip String to Monotone Increasing
      • *1062. Longest Repeating Substring
      • *1216. Valid Palindrome III
    • 贪心
      • 56. Merge Intervals
      • 621. Task Scheduler
      • 135. Candy
      • 376. Wiggle Subsequence
      • 55. Jump Game
      • 134. Gas Station
      • 1005. Maximize Sum Of Array After K Negations
      • 406. Queue Reconstruction by Height
      • 452. Minimum Number of Arrows to Burst Balloons
      • 738. Monotone Increasing Digits
    • 单调栈
      • 739. Daily Temperatures
      • 503. Next Greater Element II
      • 901. Online Stock Span
      • 85. Maximum Rectangle
      • 84. Largest Rectangle in Histogram
      • 907. Sum of Subarray Minimums
      • 239. Sliding Window Maximum
    • 前缀和
      • 53. Maximum Subarray
      • 523. Continuous Subarray Sum
      • 304. Range Sum Query 2D - Immutable
      • 1423. Maximum Points You Can Obtain from Cards
      • 1031. Maximum Sum of Two Non-Overlapping Subarrays
    • 并查集
      • 684. Redundant Connection
      • 721. Accounts Merge
      • 547. Number of Provinces
      • 737. Sentence Similarity II
      • *305. Number of Islands II
    • 字典树trie
      • 208. Implement Trie
      • 211. Design Add and Search Words Data Structure
      • 1268. Search Suggestions System
      • *1166. Design File System
      • *642. Design Search Autocomplete System
    • 扫描线sweep line
      • 253. Meeting Room II
      • 1094. Car Pooling
      • 218. The Skyline Problem
      • *759. Employee Free Time
    • tree map
      • 729. My Calendar I
      • 981. Time Based Key-Value Store
      • 846. Hand of Straights
      • 480. Sliding Window Median
      • 318. Count of Smaller Numbers After Self
    • 数学类
      • 50. Pow(x, n)
      • *311. Sparse Matrix Multiplication
      • 382. Linked List Random Node
      • 398. Random Pick Index
      • 29. Divide Two Integers
    • 设计类
      • 1603. Design Parking System
      • 355. Design Twitter
      • 1396. Design Underground System
      • *359. Logger Rate Limiter
      • *353. Design Snake Game
      • *379. Design Phone Directory
      • *588. Design In-Memory File System
      • *1244. Design A Leaderboard
    • SQL
  • 机器学习
    • 数学基础
    • 评价指标
    • 线性回归
    • 逻辑回归
    • 树模型
    • 深度学习
    • 支持向量机
    • KNN
    • 无监督学习
    • k-means
    • 强化学习 RL
    • 自然语言处理 NLP
    • 大语言模型 LLM
    • 机器视觉 CV
    • 多模态 MM
    • 分布式机器学习
    • 推荐系统
    • 异常检测与风控
    • 模型解释性
    • 多任务学习
    • MLops
    • 特征工程
    • 在线学习
    • 硬件 cuda/triton
    • 产品case分析
    • 项目deep dive
    • 机器学习代码汇总
  • 系统设计
    • 面向对象设计
      • 电梯设计
      • 停车场设计
      • Unix文件系统设计
    • 系统设计
      • 设计社交网站Twitter
      • 设计视频网站Youtube
      • 短网址系统
      • 爬虫系统
      • 任务调度系统
      • 日志系统
      • 分布式缓存
      • 广告点击聚合系统
      • webhook
    • 机器学习系统设计
      • 推荐系统
      • 搜索引擎
      • Youtube视频推荐
      • Twitter推荐
      • 广告点击预测
      • 新闻推送推荐
      • POI推荐
      • Youtube视频搜索
      • 有害内容检测
      • 大模型RAG
      • 大模型Agent
      • 信贷风控
      • 朋友推荐
      • 去重复性/版权检测
      • 情感分析
      • 目标检测
      • 问答系统
      • 知识图谱问答
  • 行为面试
    • 领导力法则
    • 问答举例
  • 案例分享
    • 准备工作
    • 面试小抄
    • 面试之后
Powered by GitBook
On this page
  • 1. 决策树
  • 2. adaboost
  • 3. 随机森林
  • 4. 梯度提升树GBDT
  • 4.1. xgboost
  • 4.2. lightgbm
  • 4.3. catboost
  • 5. 特征重要性
  • 6. 分布式
  • 7. 问答
  • 8. 代码
  • 参考
  1. 机器学习

树模型

Previous逻辑回归Next深度学习

Last updated 21 days ago

决策树的设计思路与优化思路不同,通过特征的分裂来拟合目标;演化过程可以结合ensemble原理(bagging、boosting)

1. 决策树

优化目标

  • To find the best split that maximize the separation between different classes or reduces the impurity with each resulting node

分裂依据

  • 熵 Entropy

    • KL(p|q) = cross entropy(p, q) - H(p)

H(X)=−∑i=1np(xi)log⁡p(xi)H(X) = -\sum_{i=1}^{n} p(x_i) \log p(x_i)H(X)=−i=1∑n​p(xi​)logp(xi​)
  • 信息增益 Information Gain Gain(D|A) = H(D) - H(D|A)

  • 基尼系数 Gini impurity

  • squared loss

2. adaboost

  • 损失函数exp

  • 对分类正确的样本降低权重,对错误分类的样本升高或者保持全中不变。在模型融合过程中,根据错误率对基分类器器进行加权融合,错误率低的分类器拥有更大的“话语权”

3. 随机森林

  • 有放回采样地选取n个样本,建立m个决策树分类器。多个分类器采用投票机制产生最终分类结果

  • 样本随机选择,特征随机选择

  • 损失函数

  • 可以并行训练,不容易过拟合

4. 梯度提升树GBDT

  • In Gradient Boosting Tree, there's only regression tree, 不论是回归还是分类任务

  • The labels become 1.0 for Yes and 0.0 for No for training; We sum up all the "raw scores" and then convert them to probabilities by applying sigmoid function.

  • GBDT拟合的是负梯度,下一棵树拟合的是前面的负梯度。当损失函数为平方损失MSE时,负梯度正好为残差

L(y,ft(x))=L(y,ft−1(x)+ht(x))L(y, f_t(x)) = L(y, f_{t-1}(x) + h_t(x))L(y,ft​(x))=L(y,ft−1​(x)+ht​(x))
  • 引入learning rate的概念

  • Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for summing up how much splitting on each feature allowed you to reduce the impurity across all the splits in the tree

4.1. xgboost

  • XGBoost使用二阶泰勒展开(taylor expansion)表示梯度,即每棵树拟合的是二阶泰勒的梯度,相比GBDT的一阶泰勒展开、对梯度的表示更准确

  • the first derivative (gradient) and the second derivative (hessian)

  • 不适合用在streaming data, 在线学习/增量训练

f(x+Δx)≈f(x)+f′(x)Δx+12f′′(x)Δx2f(x+\Delta x) \approx f(x) + f'(x)\Delta x + \frac12 f''(x)\Delta x^2f(x+Δx)≈f(x)+f′(x)Δx+21​f′′(x)Δx2
  • 损失函数中显式加入了正则项,对叶子数目和叶子权重做惩罚

  • 每次分裂时,选择(每个特征的每个可能的分割值)能够最大化目标函数增益,只有损失函数更小才会进行分裂

  • 特征重要性

    • Split contains numbers of times the feature is used in a model. 作为划分属性的次数, 默认值(导致一些ID类基数大容易成为重要特征)

    • Gain result contains total gains of splits which use the feature. 特征在作为划分属性时loss的降低量

    • cover,特征在作为划分属性时对样本的覆盖度

  • 基于预排序方法加速

4.2. lightgbm

  • 基于Histogram直方图统计: lightgbm遍历每个特征寻找最优分裂点时,将每个特征进行了分桶,比如可指定分为64个桶,那么该特征所有的值都落入这64个桶中,遍历这个特征时,最多只需要遍历64次,则每次分裂的复杂度为O(特征数桶数),如果不分桶,则可能所有样本的值都不同,则复杂度为O(特征数样本数)。 为什么能分桶:因为每棵树都是弱分类器,不需要非常精准,且分桶一定程度上提高了泛化能力,降低了误差

  • lightgbm的分枝模式为leaf-wise,即遍历当前所有待分枝节点,不需要一定在最下边一层,谁的分裂增益大就分谁。而XGBoost的分枝模式为level-wise,即分完一层再分下一层,可能一层中有些叶子分裂增益极小,但是仍然要花费时间和空间去分裂

  • 单边梯度采样(Gradient-based One-Side Sampling,GOSS):使用GOSS可以减少大量只具有小梯度的数据实例,这样在计算信息增益的时候只利用剩下的具有高梯度的数据就可以了

  • 互斥特征捆绑(Exclusive Feature Bundling,EFB):使用EFB可以将许多互斥的特征绑定为一个特征,这样达到了降维的目的

  • Cache命中率优化

4.3. catboost

  • 类别特征

  • 使用对称树,对称树是平衡的,不容易过拟合

5. 特征重要性

  • 所有树中作为划分属性的次数

  • 使用特征在作为划分属性时loss平均的降低量

  • 作为划分属性时对样本的覆盖度

  • permutation

  • SHAP

6. 分布式

训练

部署

7. 问答

  • GBDT中的梯度与深度学习的梯度优化方法?

    • 二者都是借助梯度进行优化,GBDT是下一颗树拟合,深度学习是向后传播

    • GBDT中计算的梯度是针对每个样本的,m个样本的回归任务其梯度shape是m, 分类任务取决于类别数量

    • 深度学习中最后一层是针对样本的,其他layer每一层都有针对可训练的weights梯度

  • xgboost的cache awareness如何提高计算效率?

  • 如何并行

    • Boosting算法的弱学习器是没法并行迭代的,但是单个弱学习器里面最耗时的是决策树的分裂过程,XGBoost针对这个分裂做了比较大的并行优化。在训练之前,预先对每个特征内部进行了排序找出候选切割点,然后保存为block结构,后面的迭代中重复地使用这个block,减小计算量。进行节点的分裂时,需要计算每个特征的增益,最终选增益最大的那个特征去做分裂,那么各个特征的增益计算可以多线程进行,即在不同的特征属性上采用多线程并行方式寻找最佳分割点。

    • In its distributed form, XGBoost is able to train models on large datasets by splitting the computation across multiple machines or workers

    • "all-reduce" is a technique used to aggregate the results of computations across different machines (workers)

    • data parallelism (splitting data across workers) and model parallelism (splitting model training tasks like tree construction across workers)

  • xgboost如何快速分裂

    • xgboost基于分位数考虑分割点,lgb基于直方图分桶考虑

  • xgboost分类节点的依据

    • 贪心寻找分裂点,目标函数的增益

  • xgboost如何处理缺失值

    • 寻找split point的时候,忽略缺失值,该特征为missing的样本不进行遍历统计,只统计non-missing的样本,这个工程技巧减少了为稀疏离散特征寻找split point的时间开销

    • 训练时,分别将特征missing的样本分配到左叶子结点和右叶子结点,分到那个子节点带来的增益大,默认的方向就是哪个子节点

    • 推理时,如果训练过程中,出现过缺失值,则按照训练过程中缺失值划分的方向(left or right),进行划分;如果训练过程中,没有出现过缺失值,将缺失值的划分到默认方向(左子树)

  • xgboost如何正则化的

    • 代价函数里加入了正则项,树的叶子节点个数、每个叶子节点上输出的score的L2

  • 树模型的缺点

    • 在回归的外推问题

  • 模型细节

    • 模型里有特别大的值对结果有什么影响,需要如何处理

  • random forest每个node怎么做split

  • GBM中的gradient怎么定义

  • 决策树,要怎样部署在1000台机器上

    • 决策树模型序列化Serialization (即保存为文件,Pickle,Joblib,pmml,onnx)

    • 序列化后的模型文件上传到一个分布式文件系统(如HDFS、S3、GCS等),以便所有机器都能访问

8. 代码

class Decision_node(object):
    def __init__(self, feature_i=None, threshold=None, value=None, true_branch=None, false_branch=None):
        self.feature_i = feature_i          # Index for the feature that is tested
        self.threshold = threshold          # Threshold value for feature
        self.value = value                  # Value if the node is a leaf in the tree
        self.true_branch = true_branch      # 'Left' subtree
        self.false_branch = false_branch    # 'Right' subtree

class Decision_tree(object):
    def __init__(self, min_samples_split=2, min_impurity=1e-7, max_depth=float("inf"), loss=None):
        self.root = None  # Root node in dec. tree
        self.min_samples_split = min_samples_split  # Minimum n of samples to justify split
        self.min_impurity = min_impurity  # The minimum impurity to justify split
        self.max_depth = max_depth  # The maximum depth to grow the tree to
        self._impurity_calculation = None  # Function to calculate impurity (classif.=>info gain, reg=>variance reduct.)
        self._leaf_value_calculation = None  # Function to determine prediction of y at leaf
        self.one_dim = None   # If y is one-hot encoded (multi-dim) or not (one-dim)
        self.loss = loss  # If Gradient Boost

xgboost中如何自定义损失函数

参考

熵、联合熵、条件熵、交叉熵、

KL散度(相对熵)
Introduction to Boosted Trees
sagemaker
spark
dask
从sklearn源码简析GBDT
梯度提升树(GBDT)原理小结
【机器学习】决策树(上)——ID3、C4.5、CART
【机器学习】决策树(中)——Random Forest、Adaboost、GBDT
【机器学习】决策树(下)——XGBoost、LightGBM
机器学习-LightGBM - 马一凡的文章 - 知乎
Productionizing Distributed XGBoost to Train Deep Tree Models with Large Data Sets at Uber
Use XGBoost with the SageMaker Python SDK
How leave's scores are calculated in this XGBoost trees?