mle-interview
  • 面试指南
  • 数据结构与算法
    • 列表
      • 912. Sort an Array
      • 215. Kth Largest Element
      • 977. Squares of a Sorted Array
      • 605. Can Place Flowers
      • 59. Spiral Matrix II
      • 179. Largest Number
      • 31. Next Permutation
    • 二分查找
      • 704. Binary Search
      • 69. Sqrt(x)
      • 278. First Bad Version
      • 34. Find First and Last Position of Element in Sorted Array
      • 33. Search in Rotated Sorted Array
      • 81. Search in Rotated Sorted Array II
      • 162. Find Peak Element
      • 4. Median of Two Sorted Arrays
      • 1095. Find in Mountain Array
      • 240. Search a 2D Matrix II
      • 540. Single Element in a Sorted Array
      • 528. Random Pick with Weight
      • 1300. Sum of Mutated Array Closest to Target
      • 410. Split Array Largest Sum
      • 1044. Longest Duplicate Substring
      • *644. Maximum Average Subarray II
      • *1060. Missing Element in Sorted Array
      • *1062. Longest Repeating Substring
      • *1891. Cutting Ribbons
    • 双指针
      • 26. Remove Duplicate Numbers in Array
      • 283. Move Zeroes
      • 75. Sort Colors
      • 88. Merge Sorted Arrays
      • 167. Two Sum II - Input array is sorted
      • 11. Container With Most Water
      • 42. Trapping Rain Water
      • 15. 3Sum
      • 16. 3Sum Closest
      • 18. 4Sum
      • 454. 4Sum II
      • 409. Longest Palindrome
      • 125. Valid Palindrome
      • 647. Palindromic Substrings
      • 209. Minimum Size Subarray Sum
      • 5. Longest Palindromic Substring
      • 395. Longest Substring with At Least K Repeating Characters
      • 424. Longest Repeating Character Replacement
      • 76. Minimum Window Substring
      • 3. Longest Substring Without Repeating Characters
      • 1004. Max Consecutive Ones III
      • 1658. Minimum Operations to Reduce X to Zero
      • *277. Find the Celebrity
      • *340. Longest Substring with At Most K Distinct Characters
    • 链表
      • 203. Remove Linked List Elements
      • 19. Remove Nth Node From End of List
      • 876. Middle of the Linked List
      • 206. Reverse Linked List
      • 92. Reverse Linked List II
      • 24. Swap Nodes in Pairs
      • 707. Design Linked List
      • 148. Sort List
      • 160. Intersection of Two Linked Lists
      • 141. Linked List Cycle
      • 142. Linked List Cycle II
      • 328. Odd Even Linked List
    • 哈希表
      • 706. Design HashMap
      • 1. Two Sum
      • 146. LRU Cache
      • 128. Longest Consecutive Sequence
      • 73. Set Matrix Zeroes
      • 380. Insert Delete GetRandom O(1)
      • 49. Group Anagrams
      • 350. Intersection of Two Arrays II
      • 299. Bulls and Cows
      • *348. Design Tic-Tac-Toe
    • 字符串
      • 242. Valid Anagram
      • 151. Reverse Words in a String
      • 205. Isomorphic Strings
      • 647. Palindromic Substrings
      • 696. Count Binary Substrings
      • 28. Find the Index of the First Occurrence in a String
      • *186. Reverse Words in a String II
    • 栈与队列
      • 225. Implement Stack using Queues
      • 54. Spiral Matrix
      • 155. Min Stack
      • 232. Implement Queue using Stacks
      • 150. Evaluate Reverse Polish Notation
      • 224. Basic Calculator
      • 20. Valid Parentheses
      • 1472. Design Browser History
      • 1209. Remove All Adjacent Duplicates in String II
      • 1249. Minimum Remove to Make Valid Parentheses
      • *281. Zigzag Iterator
      • *1429. First Unique Number
      • *346. Moving Average from Data Stream
    • 优先队列/堆
      • 692. Top K Frequent Words
      • 347. Top K Frequent Elements
      • 973. K Closest Points
      • 23. Merge K Sorted Lists
      • 264. Ugly Number II
      • 378. Kth Smallest Element in a Sorted Matrix
      • 295. Find Median from Data Stream
      • 767. Reorganize String
      • 1438. Longest Continuous Subarray With Absolute Diff Less Than or Equal to Limit
      • 895. Maximum Frequency Stack
      • 1705. Maximum Number of Eaten Apples
      • *1086. High Five
    • 深度优先DFS
      • 二叉树
      • 543. Diameter of Binary Tree
      • 101. Symmetric Tree
      • 124. Binary Tree Maximum Path Sum
      • 226. Invert Binary Tree
      • 104. Maximum Depth of Binary Tree
      • 951. Flip Equivalent Binary Trees
      • 236. Lowest Common Ancestor of a Binary Tree
      • 987. Vertical Order Traversal of a Binary Tree
      • 572. Subtree of Another Tree
      • 863. All Nodes Distance K in Binary Tree
      • 1110. Delete Nodes And Return Forest
      • 230. Kth Smallest element in a BST
      • 98. Validate Binary Search Tree
      • 235. Lowest Common Ancestor of a Binary Search Tree
      • 669. Trim a Binary Search Tree
      • 700. Search in a Binary Search Tree
      • 108. Convert Sorted Array to Binary Search Tree
      • 450. Delete Node in a BST
      • 938. Range Sum of BST
      • *270. Closest Binary Search Tree Value
      • *333. Largest BST Subtree
      • *285. Inorder Successor in BST
      • *1485. Clone Binary Tree With Random Pointer
      • 回溯
      • 39. Combination Sum
      • 78. Subsets
      • 46. Permutation
      • 77. Combinations
      • 17. Letter Combinations of a Phone Number
      • 51. N-Queens
      • 93. Restore IP Addresses
      • 22. Generate Parentheses
      • 856. Score of Parentheses
      • 301. Remove Invalid Parentheses
      • 37. Sodoku Solver
      • 图DFS
      • 126. Word Ladder II
      • 212. Word Search II
      • 79. Word Search
      • 399. Evaluate Division
      • 1376. Time Needed to Inform All Employees
      • 131. Palindrome Partitioning
      • 491. Non-decreasing Subsequences
      • 698. Partition to K Equal Sum Subsets
      • 526. Beautiful Arrangement
      • 139. Word Break
      • 377. Combination Sum IV
      • 472. Concatenated Words
      • 403. Frog Jump
      • 329. Longest Increasing Path in a Matrix
      • 797. All Paths From Source to Target
      • 695. Max Area of Island
      • 341. Flatten Nested List Iterator
      • 394. Decode String
      • *291. Word Pattern II
      • *694. Number of Distinct Islands
      • *1274. Number of Ships in a Rectangle
      • *1087. Brace Expansion
    • 广度优先BFS
      • 102. Binary Tree Level Order Traversal
      • 103. Binary Tree Zigzag Level Order Traversal
      • 297. Serialize and Deserialize Binary Tree
      • 310. Minimum Height Trees
      • 127. Word Ladder
      • 934. Shortest Bridge
      • 200. Number of Islands
      • 133. Clone Graph
      • 130. Surrounded Regions
      • 752. Open the Lock
      • 815. Bus Routes
      • 1091. Shortest Path in Binary Matrix
      • 542. 01 Matrix
      • 1293. Shortest Path in a Grid with Obstacles Elimination
      • 417. Pacific Atlantic Water Flow
      • 207. Course Schedule
      • 210. Course Schedule II
      • 787. Cheapest Flights Within K Stops
      • 444. Sequence Reconstruction
      • 994. Rotting Oranges
      • 785. Is Graph Bipartite?
      • *366. Find Leaves of Binary Tree
      • *314. Binary Tree Vertical Order Traversal
      • *269. Alien Dictionary
      • *323. Connected Component in Undirected Graph
      • *490. The Maze
    • 动态规划
      • 70. Climbing Stairs
      • 72. Edit Distance
      • 377. Combination Sum IV
      • 1335. Minimum Difficulty of a Job Schedule
      • 97. Interleaving String
      • 472. Concatenated Words
      • 403. Frog Jump
      • 674. Longest Continuous Increasing Subsequence
      • 62. Unique Paths
      • 64. Minimum Path Sum
      • 368. Largest Divisible Subset
      • 300. Longest Increasing Subsequence
      • 354. Russian Doll Envelopes
      • 121. Best Time to Buy and Sell Stock
      • 132. Palindrome Partitioning II
      • 312. Burst Balloons
      • 1143. Longest Common Subsequence
      • 718. Maximum Length of Repeated Subarray
      • 174. Dungeon Game
      • 115. Distinct Subsequences
      • 91. Decode Ways
      • 639. Decode Ways II
      • 712. Minimum ASCII Delete Sum for Two Strings
      • 221. Maximal Square
      • 1277. Count Square Submatrices with All Ones
      • 198. House Robber
      • 213. House Robber II
      • 1235. Maximum Profit in Job Scheduling
      • 740. Delete and Earn
      • 87. Scramble String
      • 1140. Stone Game II
      • 322. Coin Change
      • 518. Coin Change II
      • 1048. Longest String Chain
      • 44. Wildcard Matching
      • 10. Regular Expression Matching
      • 32. Longest Valid Parentheses
      • 1043. Partition Array for Maximum Sum
      • *256. Paint House
      • 926. Flip String to Monotone Increasing
      • *1062. Longest Repeating Substring
      • *1216. Valid Palindrome III
    • 贪心
      • 56. Merge Intervals
      • 621. Task Scheduler
      • 135. Candy
      • 376. Wiggle Subsequence
      • 55. Jump Game
      • 134. Gas Station
      • 1005. Maximize Sum Of Array After K Negations
      • 406. Queue Reconstruction by Height
      • 452. Minimum Number of Arrows to Burst Balloons
      • 738. Monotone Increasing Digits
    • 单调栈
      • 739. Daily Temperatures
      • 503. Next Greater Element II
      • 901. Online Stock Span
      • 85. Maximum Rectangle
      • 84. Largest Rectangle in Histogram
      • 907. Sum of Subarray Minimums
      • 239. Sliding Window Maximum
    • 前缀和
      • 53. Maximum Subarray
      • 523. Continuous Subarray Sum
      • 304. Range Sum Query 2D - Immutable
      • 1423. Maximum Points You Can Obtain from Cards
      • 1031. Maximum Sum of Two Non-Overlapping Subarrays
    • 并查集
      • 684. Redundant Connection
      • 721. Accounts Merge
      • 547. Number of Provinces
      • 737. Sentence Similarity II
      • *305. Number of Islands II
    • 字典树trie
      • 208. Implement Trie
      • 211. Design Add and Search Words Data Structure
      • 1268. Search Suggestions System
      • *1166. Design File System
      • *642. Design Search Autocomplete System
    • 扫描线sweep line
      • 253. Meeting Room II
      • 1094. Car Pooling
      • 218. The Skyline Problem
      • *759. Employee Free Time
    • tree map
      • 729. My Calendar I
      • 981. Time Based Key-Value Store
      • 846. Hand of Straights
      • 480. Sliding Window Median
      • 318. Count of Smaller Numbers After Self
    • 数学类
      • 50. Pow(x, n)
      • *311. Sparse Matrix Multiplication
      • 382. Linked List Random Node
      • 398. Random Pick Index
      • 29. Divide Two Integers
    • 设计类
      • 1603. Design Parking System
      • 355. Design Twitter
      • 1396. Design Underground System
      • *359. Logger Rate Limiter
      • *353. Design Snake Game
      • *379. Design Phone Directory
      • *588. Design In-Memory File System
      • *1244. Design A Leaderboard
    • SQL
  • 机器学习
    • 数学基础
    • 评价指标
    • 线性回归
    • 逻辑回归
    • 树模型
    • 深度学习
    • 支持向量机
    • KNN
    • 无监督学习
    • k-means
    • 强化学习 RL
    • 自然语言处理 NLP
    • 大语言模型 LLM
    • 机器视觉 CV
    • 多模态 MM
    • 分布式机器学习
    • 推荐系统
    • 异常检测与风控
    • 模型解释性
    • 多任务学习
    • MLops
    • 特征工程
    • 在线学习
    • 硬件 cuda/triton
    • 产品case分析
    • 项目deep dive
    • 机器学习代码汇总
  • 系统设计
    • 面向对象设计
      • 电梯设计
      • 停车场设计
      • Unix文件系统设计
    • 系统设计
      • 设计社交网站Twitter
      • 设计视频网站Youtube
      • 短网址系统
      • 爬虫系统
      • 任务调度系统
      • 日志系统
      • 分布式缓存
      • 广告点击聚合系统
      • webhook
    • 机器学习系统设计
      • 推荐系统
      • 搜索引擎
      • Youtube视频推荐
      • Twitter推荐
      • 广告点击预测
      • 新闻推送推荐
      • POI推荐
      • Youtube视频搜索
      • 有害内容检测
      • 大模型RAG
      • 大模型Agent
      • 信贷风控
      • 朋友推荐
      • 去重复性/版权检测
      • 情感分析
      • 目标检测
      • 问答系统
      • 知识图谱问答
  • 行为面试
    • 领导力法则
    • 问答举例
  • 案例分享
    • 准备工作
    • 面试小抄
    • 面试之后
Powered by GitBook
On this page
  • 1. scaling law
  • 2. 数据
  • 3. 模型
  • 4. 训练
  • 4.1 pretrain
  • 4.2 SFT
  • 4.3 RLHF
  • 4.4 Long context
  • 4.5 reasoning
  • 5. 评测
  • 6. 推理
  • 7. 应用与优化
  • 7.1 prompt
  • 7.2 in context learning
  • 7.3 RAG
  • 7.4 Agents
  • 7.5 大规模部署
  • 8. 问答
  • reference
  1. 机器学习

大语言模型 LLM

Previous自然语言处理 NLPNext机器视觉 CV

Last updated 18 days ago

熟悉、、

1. scaling law

  • C(计算量) = 6 N(模型参数量) * D(数据集大小)

  • 基础:1个字节8比特,全精度(fp32)下,1个参数 32 byte = 4个字节,半精度下1个参数等于2个字节

  • 1B 模型代表 1 billion参数,如果全精度,共需要 4 billion显存,也就是4G. 半精度需要2G显存。

  • 全量微调: 参数,梯度,优化器. 微调1B模型,全精度:参数4G, 梯度4G, 优化器状态(adam带二阶状态)8G = 16G. 半精度Adam微调需要 8G

  • LoRA微调: 主干网络前向和梯度显存消耗不变,节省的显存主要在优化器状态,优化器只有LoRA权重部分

2. 数据

  • 多样性(diversity):不同情景、背景、语境

    • 采集方式:随机无偏差

  • 数量(size)

  • 质量(quality):重要

  • LLM-as-judge

  • 拒绝采样

3. 模型

LLaMa

  • https://github.com/naklecha/llama3-from-scratch

  • llama的self-attention和mlp中没有bias

  • norm 使用 rmsnorm 而不是 layernorm,少计算了均值. 使用 pre-norm

Mistral

QWen

Deepseek

  • MLA, MOE, RMSNorm, DeepGEMM

MOE

  • 稀疏激活: 前向传播时只激活一小部分专家

import torch
from torch import  nn

class MoeLayer(nn.Module):
    def __init__(self, experts, gate, moe_args):
        super().__init__()
        assert len(experts) > 0
        self.experts = nn.ModuleList(experts)
        self.gate = gate
        self.args = moe_args

    def forward(self, inputs: torch.Tensor):
        inputs_squashed = inputs.view(-1, inputs.shape[-1]) # (m, seq_len, dim) --> (m * seq_len, dim)
        gate_logits = self.gate(inputs_squashed) # (m * seq_len, num_experts)
        # (m * seq_len, num_experts_per_tok),
        weights, selected_experts = torch.topk(
            gate_logits, self.args.num_experts_per_tok)
        weights = nn.functional.softmax(
            weights, dim=1, dtype=torch.float).type_as(inputs)
        # (m * seq_len, dim)
        results = torch.zeros_like(inputs_squashed)
        for i, expert in enumerate(self.experts):
            # index of batch and expert
            batch_idx, nth_expert = torch.where(selected_experts == i)
            # weightage * output of expert layers (selected_m, num_expert)
            results[batch_idx] += ( weights[batch_idx, nth_expert, None] * expert(inputs_squashed[batch_idx]) )
        # (m * seq_len, dim) --> (m, seq_len, dim)
        return results.view_as(inputs)

RoPE

⟨fg(cm,m),fk(cn,n)⟩=g(cm,cn,m−n)\langle f_g(c_m, m), f_k(c_n, n) \rangle = g(c_m, c_n, m - n)⟨fg​(cm​,m),fk​(cn​,n)⟩=g(cm​,cn​,m−n)
import torch

class Rotary(torch.nn.Module):
    def __init__(self, dim, base=10000):
        super().__init__()
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer("inv_freq", inv_freq)
        self.seq_len_cached = None
        self.cos_cached = None
        self.sin_cached = None

    def forward(self, x, seq_dim=1):
        seq_len = x.shape[seq_dim]
        if seq_len != self.seq_len_cached:
            self.seq_len_cached = seq_len
            t = torch.arange(x.shape[seq_dim], device=x.device).type_as(self.inv_freq)  # [max_len, 1]
            freqs = torch.einsum("i,j->ij", t, self.inv_freq)  # [max_len, dim // 2]
            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
            self.cos_cached = emb.cos()[:, None, None, :]
            self.sin_cached = emb.sin()[:, None, None, :]
        return self.cos_cached, self.sin_cached


def rotate_half(x):
    x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=x1.ndim - 1)


@torch.jit.script
def apply_rotary_pos_emb(q, k, cos, sin):
    # q, k: [seq, batch, heads, hdim]
    return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)

4. 训练

  • limited GPU

    • use fp16 (this speeds up training)

    • use gradient_accumulation_steps (this simulates larger batch sizes)

    • use gradient_checkpointing (this uses disk to save RAM)

    • freeze model embeddings (this reduces weights to train)

    • freeze some model layers (this reduces weights to train)

    • use PEFT (this reduces weights to train)

    • increase LR and decrease epochs (this reduces work)

    • use smaller models (this reduces weights to train)

4.1 pretrain

数据清洗方法、pretrain数据配比、pretrain超参数、退火阶段

4.2 SFT

task种类、sft数据量级、合成数据

高效参数微调 PEFT

  • Prompt tuning

    • 固定模型前馈层参数,仅仅更新部分embedding参数即可

  • Adapter Tuning

    • 所有参数微调较为低效,只微调下游的几层效果不佳。因此设计了adapter结构,在原结构中稍微增加了参数,只微调该部分,效果接近full-finetune

    • down-project层将高维度特征映射到低维特征,过一个非线形层之后,再 up-project 结构将低维特征映射回原来的高维特征. skip-connection 结构,确保在最差的情况下退化为 identity

  • Prefix Tuning 前缀微调

    • 问题:最终的性能对人工设计的template的特别敏感,加一个词或者少一个词,或者变动位置,都会造成很大的变化

    • 使用连续的virtual token embedding来代替离散的token. 将一个连续的特定于任务的向量序列添加到输入,称之为前缀. Prefix是可以学习的“隐式”的提示,训练的时候只更新Prefix部分的参数,而Transformer中的预训练参数固定

  • p-tuning

    • 自然语言的离散模版转化为可训练的隐式prompt

  • LoRA 低秩自适应

    • 用低秩的方式(一个矩阵可以用两个较小的矩阵相乘来近似)来调整参数矩阵.

    • 冻结预训练好的模型权重参数, 通过往模型中加入额外的网络层,并只训练这些新增的网络层参数

    • QLoRA: Quantized LoRA, 使用QLoRA算法要结合bitsandbytes库和peft库

input_dim = 768
output_dim = 768
rank = 8  # 低秩适应的等级'r'
W = ...  # 来自预训练网络的权重,形状为 input_dim x output_dim
W_A = nn.Parameter(torch.empty(input_dim, rank))  # LoRA权重A
W_B = nn.Parameter(torch.empty(rank, output_dim))  # LoRA权重B
# 初始化LoRA权重
nn.init.kaiming_uniform_(W_A, a=math.sqrt(5))
nn.init.zeros_(W_B)

def regular_forward_matmul(x, W):
  h = x @ W
  return h

def lora_forward_matmul(x, W, W_A, W_B):
  h = x @ W  # 常规矩阵乘法
  h += x @ (W_A @ W_B) * alpha  # 使用缩放的LoRA权重
  return h

4.3 RLHF

  • RLHF 与 RLAIF: better align with human preferences and reduce undesired outcomes in scenarios

  • SFT负责Instruction following,RL强化helpfulness、honesty、safety偏好

  • SFT不具备向后看的能力,只能看到当Token前面的句子;RLHF的critic model和reward model都能看到当前位置后面的句子,所以RLHF能看到整个句子

  • SFT给的都是正例,没有负反馈;RLHF通过给序列一个低分给模型负反馈,减少生成某个token的概率

  • online RL 的训练数据是当场采集而来,一边造数据,一边训练模型

4.4 Long context

位置编码内插与外推

  • Position Interpolation

  • NTK

  • YaRN

工程

  • 序列并行(SP): 将输入序列进行切分

  • RingAttention Ulysses

  • KV量化

4.5 reasoning

5. 评测

  • rouge

  • BLEU

  • perplexity

6. 推理

  • throughput 吞吐

    • 估算:单次推理时间 x 同时处理的请求数量

  • 推理prefill阶段是compute bound,算力利用充分;decode阶段是memory bound,算力卡在内存访问上

    • 贪心选择每一步都选最大的,top_k和top_p从top中随机选择一个,beam search保留多个

  • dynamic batching, continuous batching

  • flash attention, paged attention

  • quantization: fp8, bf16, gptq, awq

  • 算子优化: gemm,transpose,mha,rmsnorm,gemv,rope

KV cache

空间换时间,最新的token计算attention时,与前面的KV计算无关. 下面公式计算attention3时,用到的仍然是K1V1 和 K2V2

Att1(Q,K,V)=softmax(Q1K1T)V1Att2(Q,K,V)=softmax(Q2K1T)V1+softmax(Q2K2T)V2Att3(Q,K,V)=softmax(Q3K1T)V1+softmax(Q3K2T)V2+softmax(Q3K3T)V3\begin{align*} \text{Att}_1(Q, K, V) &= \text{softmax}(Q_1 K_1^T) V_1 \\ \text{Att}_2(Q, K, V) &= \text{softmax}(Q_2 K_1^T) V_1 + \text{softmax}(Q_2 K_2^T) V_2 \\ \text{Att}_3(Q, K, V) &= \text{softmax}(Q_3 K_1^T) V_1 + \text{softmax}(Q_3 K_2^T) V_2 + \text{softmax}(Q_3 K_3^T) V_3 \end{align*}Att1​(Q,K,V)Att2​(Q,K,V)Att3​(Q,K,V)​=softmax(Q1​K1T​)V1​=softmax(Q2​K1T​)V1​+softmax(Q2​K2T​)V2​=softmax(Q3​K1T​)V1​+softmax(Q3​K2T​)V2​+softmax(Q3​K3T​)V3​​

MQA(multi query attention) / GQA(group query attention)

  • 通过平均池化组内的所有原始头来构建

quantization 量化

  • Post training quantization(PTQ)

    • OBS(Optimal Brain Surgeon) -> OBQ(Optimal Brain Quantization) -> GPTQ

  • Quantization aware training(QAT)

flash-attention

通过矩阵分块计算以及减少内存读写次数的方式,提高注意力分数的计算效率

输入QKV分块,保证每个块能够在SRAM(一级缓存)上完成注意力操作,并将结果更新回HBM(高带宽内存),从而降低对HBM的读写操作.

paged-attention

针对增量解码阶段,对于 KV 缓存进行分块存储,并优化了计算方式,增大了并行计算度,从而提高了计算效率

蒸馏

triton

  • dynamic batching、concurrent execution、optimal model configuration、model ensemble、dali model 等策略来提升在线推理的性能

框架

7. 应用与优化

7.1 prompt

  • https://platform.openai.com/docs/guides/prompt-engineering

  • https://yiyan.baidu.com/learn

  • https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/

  • https://huggingface.co/blog/chat-templates

COT

  • 根据llm的self reflection做planning

Few Shot

7.2 in context learning

  • 通过展示数据形式,来激活预训练模型的能力

  • examples采样:从训练数据中选择与query语义相近的示例,效果会比较好

7.3 RAG

  • 主要针对大语言模型的幻觉、数据时效性、数据安全问题

  • LangChain

7.4 Agents

tool usage

import re

def extract_action_and_input(text):
  action_pattern = r"Action: (.+?)\n"
  input_pattern = r"Action Input: \"(.+?)\""
  action = re.findall(action_pattern, text)
  action_input = re.findall(input_pattern, text)
  return action, action_input

7.5 大规模部署

  • 容器化 + k8s

  • 负载均衡和路由算法

  • 自动扩容/缩容

  • 状态检测

8. 问答

  • 一个给定任务,如何优化LLM的效果

    • 从prompt engineering开始

    • RAG

    • fine tuning

  • attention的加速优化

    • flash-attention 及 S2attention

  • 如何扩展LLM的token

    • position embedding的角度

  • 如何构建Instruction数据集

    • 语料生成,格式构造,提示词

  • 如何处理训练中的loss spike

  • 知识幻觉(track and solve hallucination)

    • 数据(数据重复、Bias、时效性, 一对多的映射关系),训练(Imperfect representation learning、Parametric knowledge bias)

  • conflation

  • 复读机问题/ 文本生成的重复问题

    • 多样性训练数据

    • 引入噪声

    • 温度参数调整

    • 后处理和过滤

  • 灾难性遗忘

    • 微调过程中混合通用知识指令微调数据,数据配比

    • full or LoRA

    • 重播缓冲区

    • 弹性权重共享

    • 增量学习

    • 多任务学习

    • 数据分布差异

    • 参数更新冲突

  • Zero 应用的时候performance严重下降,为什么

  • 怎么控制GAI不要给虚假答案

    • constitutional AI,red teaming 去帮助模型规范作答

    • fine tune一个小模型给specific task

  • 文本生成的多跳问题

    • 预填充(Prefill):prompt token一次性读进显存;生成(Completion):自回归方式一次生成一个新 Token

    • https://mp.weixin.qq.com/s/N0sjdNo-qWdZJ4UkXm-bdw

  • 随着模型的增大,学习率越来越小。学习率与数据量、批量大小都没有明显的关系,且一般使用1e-3左右的学习率

  • reducing LLM latency at inference time

  • SwiGLU

reference

精读

扩展

课程

  • https://github.com/mlabonne/llm-course

  • https://github.com/InternLM/Tutorial/tree/camp2

  • https://github.com/datawhalechina/llm-universe

  • https://github.com/rasbt/LLMs-from-scratch/tree/main

  • https://github.com/peremartra/Large-Language-Model-Notebooks-Course

  • https://hanlab.mit.edu/courses/2024-fall-65940

激活函数 使用

位置编码 使用

Multi-query Attention 和

dpo / ppo 训练技巧, 相关模型参考

: 贪心搜索 (Greedy search)、波束搜索 (Beam search)、Top-K 采样 (Top-K sampling) 以及 Top-p 采样 (Top-p sampling),投机采样 (speculative sampling), lookahead decoding

深度学习
自然语言理解
分布式机器学习
Transformer Math 101
swiglu
RoPE
Grouped-Query Attention
强化学习
https://github.com/OpenRLHF/OpenRLHF
Towards Efficient Generative Large Language Model Serving
Mastering LLM Techniques: Inference Optimization
解码方式
投机采样 Speculative Decoding
Transformers KV Caching Explained
LLM Inference Series: 3. KV caching explained
大模型推理优化技术-KV Cache
图解大模型计算加速系列:FlashAttention V1,从硬件到计算逻辑 - 猛猿的文章 - 知乎
图解大模型计算加速系列之:vLLM核心技术PagedAttention原理 - 猛猿的文章 - 知乎
https://www.anthropic.com/research/building-effective-agents
LLM Powered Autonomous Agents
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
adam在大模型预训练中的不稳定性分析及解决办法 - 丁晖的文章 - 知乎
Scaling Laws for Neural Language Models
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Llama 2: Open Foundation and Fine-Tuned Chat Models
https://github.com/pytorch-labs/gpt-fast
https://github.com/meta-llama/llama3
https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools
A Survey of Large Language Models
A Comprehensive Survey on Pretrained Foundation Models A History from BERT to ChatGPT
LLM推理优化技术综述:KVCache、PageAttention、FlashAttention、MQA、GQA
The Rise and Potential of Large Language Model Based Agents: A Survey
SearchAnything
A Survey of Techniques for Maximizing LLM Performance
https://github.com/NVIDIA/FasterTransformer
大模型检索增强生成(RAG)有哪些好用的技巧? - Breezedeus的回答 - 知乎
FlashAttention 的速度优化原理是怎样的?
语言模型之Text embedding(思考篇) - 泽龙的文章 - 知乎
大模型词表扩充必备工具SentencePiece - 吃果冻不吐果冻皮的文章 - 知乎
十分钟读懂旋转编码(RoPE) - 绝密伏击的文章 - 知乎
大模型基础组件之位置编码-万字长文全面解读LLM中的位置编码与长度外推性(上) - OpenLLMAI的文章 - 知乎
KV cache详解 图示,显存,计算量分析
TH LLM Study Group 20231201
大语言模型是如何在预训练的过程中学习超长文本的呢? - 段淇源的回答 - 知乎
让LLM更好地学会中文:大模型继续预训练实践纪录 - Lil2J的文章 - 知乎
如何解释大模型的重复生成现象? - 慕谦或谦修的回答 - 知乎
MOE多卡
https://github.com/lucidrains/mixture-of-experts
大语言模型(LLM)评价指标小汇总 - 花师小哲的文章 - 知乎
flash-attention
outlines: Structured Text Generation
vllm
LLM inference in C/C++
LLM训练-pretrain - ybq的文章 - 知乎
LLM训练-sft - ybq的文章 - 知乎
https://github.com/princeton-nlp/LESS
https://github.com/yihedeng9/rlhf-summary-notes
Open Source and In-House: How Uber Optimizes LLM Training
vLLM源码之PageAttention - 手抓饼熊的文章 - 知乎
Uber Figure: Resource scheduling for LLM workflows.