1. 优化
1.1 前向后向传播
前向传播,根据预测值和标签计算损失函数,以及损失函数对应的梯度。损失函数类的设计有正向值计算方法和梯度计算方法, 损失函数对y_hat的偏微分
根据参数值和参数梯度进行优化更新参数: optimizer(w, w_grad)
1.2 优化 Optimizer
slope of a curve at a given point
从单变量看,抖的时候就走的步子大一点,缓的时候就走的小一点. 多个变量的不同变化决定了整体优化方向
除了此刻的输入外,还考虑上一时刻的输出. 有的优化根据历史梯度计算一阶动量和二阶动量
考虑加入惯性,引入一阶动量,SGD with Momentum
Adam: 每个参数梯度增加了一阶动量(momentum)和二阶动量(variance),Adaptive + Momentum. 通过其来自适应控制步长,当梯度较小时,整体的学习率就会增加,反之会缩小
用指数滑动平均去估计梯度每个分量的一阶矩(动量)和二阶矩(自适应学习率),并用二阶矩去 normalize 一阶矩,得到每一步的更新量
Batch Size
用尽可能能塞进内存的batch size去train模型,提升训练速度. 但也存在trade-off
batch size过小,波动会比较大,不太容易收敛。但这种波峰,也有助于跳出局部最优,模型更容易有更好的泛化能力
batch size变大,步数整体变少,训练的步数更少,本来就波动就小,步数也少,同样本的情况下,你收敛的会更慢
Copy # example of gradient descent for a one-dimensional function
from numpy import asarray
from numpy . random import rand
def objective ( x ):
return x ** 2.0
def derivative ( x ):
return x * 2.0
def gradient_descent ( objective , derivative , bounds , n_iter , step_size ):
solution = bounds [:, 0 ] + rand ( len (bounds)) * (bounds [:, 1 ] - bounds [:, 0 ] )
for i in range (n_iter):
gradient = derivative (solution)
solution = solution - step_size * gradient
solution_eval = objective (solution)
print ( '> %d f( %s ) = %.5f ' % (i, solution, solution_eval))
return [solution , solution_eval]
bounds = asarray ([[ - 1.0 , 1.0 ]])
n_iter = 30
step_size = 0.1
best , score = gradient_descent (objective, derivative, bounds, n_iter, step_size)
1.3 学习率scheduler
常用的heuristic 是 LR 应该与 batch size 的增长倍数的开方成正比,从而保证 variance 与梯度成比例的增长
Copy # Cyclic LR, 每隔一段时间重启学习率,这样在单位时间内能收敛到多个局部最小值,可以得到很多个模型做集成
scheduler = lambda x : ((LR_INIT - LR_MIN) / 2 ) * (np . cos (PI * (np. mod (x - 1 ,CYCLE) / (CYCLE))) + 1 ) + LR_MIN
# warp up, 有助于减缓模型在初始阶段对mini-batch的提前过拟合现象,保持分布的平稳,同时有助于保持模型深层的稳定性
warmup_steps = int (batches_per_epoch * 5 )
1.4 初始化
会导致激活后具有相同的值,网络相当于只有一个隐含层节点一样, hidden size失去意义
2. 损失函数
prediction made by model trained with MSE loss is always normally distributed
cross entropy/ 对数损失
nn.CrossEntropyLoss(pred, label) = nn.NLLLoss(torch.log(nn.Softmax(pred)), label)
c e = − y l o g ( p ) − ( 1 − y ) l o g ( 1 − p ) ce = - ylog(p) - (1-y)log(1-p) ce = − y l o g ( p ) − ( 1 − y ) l o g ( 1 − p )
Focal loss
对CE loss增加了一个调制系数来降低容易样本的权重值,使得训练过程更加关注困难样本。增加的这个系数就是评价难易,也就是概率的gamma次方
Copy import torch
from torch import nn
class FocalLoss ( nn . Module ):
def __init__ ( self , gamma , eps = 1e-7 ):
super (). __init__ ()
self . gamma = gamma
self . eps = eps
def forward ( self , preds , targets ):
preds = preds . clamp (self.eps, 1 - self.eps)
loss = ( 1 - preds) ** self . gamma * targets * torch . log (preds) + preds ** self . gamma * ( 1 - targets) * torch . log ( 1 - preds)
return - torch . mean (loss)
3. 网络模型结构
3.1 MLP
Copy import numpy as np
class Dense ( Layer ):
def __init__ ( self , input_size , output_size ):
self . weights = np . random . rand (input_size, output_size) - 0.5
self . bias = np . random . rand ( 1 , output_size) - 0.5
def forward_propagation ( self , input_data ):
self . input = input_data
self . output = np . dot (self.input, self.weights) + self . bias
return self . output
def backward_propagation ( self , output_error , learning_rate ):
input_error = np . dot (output_error, self.weights.T)
weight_error = np . dot (self.input.T, output_error)
self . weights -= learning_rate * weight_error
self . bias -= learning_rate * output_error
return input_error
3.2 CNN
Convolution is a mathematical operation trying to learn the values of filter(s) using backprop, where we have an input I, and an argument, kernel K to produce an output that expresses how the shape of one is modified by another.
Convolutional layer is core building block of CNN, it helps with feature detection.
Kernel K is a set of learnable filters and is small spatially compared to the image but extends through the full depth of the input image.
Dimension of the feature map as a function of the input image size(W), feature detector size(F), Stride(S) and Zero Padding on image(P) is (W−F+2P)/S+1
No. of parameters = (Kernel size * Kernel size * Dimension )+1 = 28
卷积等价于[一个大的矩阵一次性运算](Orthogonal Convolutional Neural Networks)
CNN的 Inductive Bias(归纳偏置) 多过 vision transformer, CNN的归纳偏置,分别是 locality (局部性)和 translation equivariance(平移等变性)
在线卷积(Online Convolution)是在数据流式输入的情况下,实时计算卷积操作
Copy # https://github.com/openai/gpt-2/blob/master/src/model.py
def conv1d ( x , scope , nf , * , w_init_stdev = 0.02 ):
with tf . variable_scope (scope):
* start , nx = shape_list (x)
w = tf . get_variable ( 'w' , [ 1 , nx, nf], initializer = tf. random_normal_initializer (stddev = w_init_stdev))
b = tf . get_variable ( 'b' , [nf], initializer = tf. constant_initializer ( 0 ))
c = tf . reshape (tf. matmul (tf. reshape (x, [ - 1 , nx]), tf. reshape (w, [ - 1 , nf])) + b, start + [nf])
return c
Copy import numpy as np
def conv2D ( image , kernel , padding = 0 , strides = 1 ):
# Cross Correlation
kernel = np . flipud (np. fliplr (kernel))
# Gather Shapes of Kernel + Image + Padding
xKernShape = kernel . shape [ 0 ]
yKernShape = kernel . shape [ 1 ]
xImgShape = image . shape [ 0 ]
yImgShape = image . shape [ 1 ]
# Shape of Output Convolution
xOutput = int (((xImgShape - xKernShape + 2 * padding) / strides) + 1 )
yOutput = int (((yImgShape - yKernShape + 2 * padding) / strides) + 1 )
output = np . zeros ((xOutput, yOutput))
# Apply Equal Padding to All Sides
if padding != 0 :
imagePadded = np . zeros ((image.shape[ 0 ] + padding * 2 , image.shape[ 1 ] + padding * 2 ))
imagePadded [ int (padding): int ( - 1 * padding), int (padding): int ( - 1 * padding)] = image
else :
imagePadded = image
# Iterate through image
for y in range (image.shape[ 1 ]):
for x in range (image.shape[ 0 ]):
output [ x , y ] = (kernel * imagePadded [ x : x + xKernShape , y : y + yKernShape ] ) . sum () + bias
return output
def activation_fn ( self , x ):
"""A method of FFL which contains the operation and definition of given activation function."""
if self . activation == 'relu' :
x [ x < 0 ] = 0
return x
if self . activation == None or self . activation == "linear" :
return x
if self . activation == 'tanh' :
return np . tanh (x)
if self . activation == 'sigmoid' :
return 1 / ( 1 + np . exp ( - x) )
if self . activation == "softmax" :
x = x - np . max (x)
s = np . exp (x)
return s / np . sum (s)
Copy import numpy as np
def conv2d ( inputs , kernels , bias , stride , padding ):
""" 正向卷积操作
inputs: 输入数据,形状为 (C, H, W)
kernels: 卷积核,形状为 (F, C, HH, WW),C是图片输入层数,F是图片输出层数
bias: 偏置,形状为 (F,)
stride: 步长
padding: 填充
# 获取输入数据和卷积核的形状
C , H , W = inputs . shape
F , _ , HH , WW = kernels . shape
# 对输入数据进行填充。在第一个轴(通常是通道轴)上不进行填充,在第二个轴和第三个轴(通常是高度和宽度轴)上在开始和结束位置都填充padding个值
inputs_pad = np . pad (inputs, (( 0 , 0 ), (padding, padding), (padding, padding)))
# 初始化输出数据,卷积后的图像size大小
H_out = 1 + (H + 2 * padding - HH) // stride
W_out = 1 + (W + 2 * padding - WW) // stride
outputs = np . zeros ((F, H_out, W_out))
# 进行卷积操作
for i in range (H_out):
for j in range (W_out): # 找到out图像对于的原始图像区域,然后对图像进行sum和bias
inputs_slice = inputs_pad [:, i * stride : i * stride + HH , j * stride : j * stride + WW ]
# axis=(1, 2, 3)表示在通道、高度和宽度这三个轴上进行求和
outputs [:, i , j ] = np . sum (inputs_slice * kernels, axis = ( 1 , 2 , 3 )) + bias
return outputs
梯度消失:在反向传播过程中累计梯度一直相乘,当很多小于1的梯度出现时导致前面的梯度很小,难以学习long-term dependencies
梯度爆炸:the exploding gradient problem当梯度较大,链式法则导致连乘过大,数值不稳定
RNN的inductive bias是sequentiality和time invariance,即序列顺序上的time-steps有联系,和时间变换的不变性 (rnn权重共享)
3.4 Transformer
encoder: embed + layer(self-attention, skip-connect, ln, ffn, skip-connect, ln) * 6
decoder: embed + layer(self-attention, cross-attention, ffn, skip-connect, ln) * 6
Attention ( Q , K , V ) = softmax ( Q K T d k ) V \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V Attention ( Q , K , V ) = softmax ( d k Q K T ) V
sequence length n, vector representations d. QK矩阵相乘复杂度为O(n^2 d), softmax与V相乘复杂度O(n^2 d)
kv cache: 空间换时间,自回归中每次生成一个token,前面的token计算存在重复性
Multi Query Attention: MQA 让所有的头之间共享同一份 Key 和 Value 矩阵,每个头只单独保留了一份 Query 参数,从而大大减少 Key 和 Value 矩阵的参数量
Group Query Attention: 将查询头分成N组,每个组共享一个Key 和 Value 矩阵
Flash attention: 利用GPU硬件非均匀的存储器层次结构实现内存节省和推理加速
称为attention的temperature。如果输入向量的维度d比较大,那么内积的结果也可能非常大,这会导致注意力分数也变得非常大,可能会使得softmax函数的计算变得不稳定(接近one-hot, 梯度消失),并且会影响模型的训练和推理效果。通过除以根号d,可以将注意力分数缩小到一个合适的范围内,从而使softmax函数计算更加稳定,并且更容易收敛。
Google T5采用Xavier初始化缓解梯度消失,从而不需要除根号d
Positional Encoding/Embedding 区别
学习式(learned):直接将位置编码当作可训练参数,比如最大长度为 512,编码维度为 768,那么就初始化一个 512×768 的矩阵作为位置向量,让它随着训练过程更新。BERT、GPT 等模型所用的就是这种位置编码
P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) \begin{aligned} PE_{(pos,2i)} & = sin(pos/10000^{2i/d_{model}})\ \end{aligned} P E ( p os , 2 i ) = s in ( p os /1000 0 2 i / d m o d e l )
P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) \begin{aligned} PE_{(pos,2i+1)} & = cos(pos/10000^{2i/d_{model}}) \ \end{aligned} P E ( p os , 2 i + 1 ) = cos ( p os /1000 0 2 i / d m o d e l )
可变长度的意思: 模型训练好了,一个新的序列长度样本也可以作为输入. 但一个batch内仍需要padding到同一长度
只需要保持参数矩阵维度与输入序列的长度无关,例如全连接层针对feature, 都不影响sequence维度; attention等也都是
KV Cache
加速推断, 解码过程是一个token一个token生成,如果每一次解码都从输入开始拼接好解码的token,那么会有非常多的重复计算
矩阵乘法性质: 矩阵可以分块,将矩阵A拆分为[:s], [s]两部分,分别和矩阵B相乘,那么最终结果可以直接拼接
Copy def scaled_dot_product ( q , k , v , softmax , attention_mask , attention_dropout ):
outputs = tf . matmul (q, k, transpose_b = True )
dk = tf . math . sqrt (tf. cast (q.shape[ - 1 ], dtype = tf.float32))
outputs = outputs / dk
# if attention_mask is not None:
# outputs = outputs + (1 - attention_mask) * -1e9
outputs = softmax (outputs, mask = attention_mask)
outputs = Dropout (rate = attention_dropout)(outputs)
outputs = tf . matmul (outputs, v) # shape: (m,Tx,depth), same shape as q,k,v
return outputs
# multi-head有多种写法: 变成4维的 (batch_size, -1, num_heads, d_k), 变成3维的(batch * num_heads, -1, d_k), 以及下面的循环
class FullAttention ( tf . keras . layers . Layer ):
def __init__ ( self , d_model , num_of_heads , dropout , d_out = None ):
super (). __init__ ()
self . d_model = d_model
self . num_of_heads = num_of_heads
self . dropout = dropout
self . depth = d_model // num_of_heads
self . wq = [ Dense (self.depth // 2 , use_bias = False ) for i in range (num_of_heads) ]
self . wk = [ Dense (self.depth // 2 , use_bias = False ) for i in range (num_of_heads) ]
self . wv = [ Dense (self.depth // 2 , use_bias = False ) for i in range (num_of_heads) ]
self . wo = Dense (d_model if d_out is None else d_out, use_bias = False )
self . softmax = tf . keras . layers . Softmax ()
def call ( self , q , k , v , attention_mask = None , training = False ):
multi_attn = []
for i in range (self.num_of_heads):
Q = self . wq [ i ](q)
K = self . wk [ i ](k)
V = self . wv [ i ](v)
multi_attn . append ( scaled_dot_product (Q, K, V, self.softmax, attention_mask, self.dropout))
multi_attn = tf . concat (multi_attn, axis =- 1 )
multi_head_attention = self . wo (multi_attn)
return multi_head_attention
3.5 正则化
batch normalization,residual learning, label smoothing
Copy # 标签平滑,hard label转变成soft label,使网络优化更加平滑。有效正则化工具,通过在均匀分布和hard标签之间应用加权平均值来生成soft标签。用于减少训练的过拟合问题并进一步提高分类性能
targets = ( 1 - label_smooth) * targets + label_smooth / num_classes
3.6 Norm
Batch Norm
BN用来减少 “Internal Covariate Shift” 来加速网络的训练,BN 和 ResNet 的作用类似,都使得 loss landscape 变得更加光滑了 (How Does Batch Normalization Help Optimization)
BN,当 batch 较小时不具备统计意义,而加大的 batch 又受硬件的影响;BN 适用于 DNN、CNN 之类固定深度的神经网络,而对于 RNN 这类 sequence 长度不一致的神经网络来说,会出现 sequence 长度不同的情况
Layer Norm
layer normalization 有助于得到一个球体空间中符合0均值1方差高斯分布的 embedding, batch normalization不具备这个功能
为什么不用batch norm? BN广泛用于CV,针对同一特征、跨样本开展归一。样本之间仍然具有可比较性,但特征与特征之间不再具有可比较性。NLP中关键的不在于样本中同一特征的可比较
由于BN需要统计不同样本统计值,因此分布式训练需要sync BatchNorm, Layer Norm则不需要
Copy # layer norm: https://www.kaggle.com/code/cpmpml/graph-transfomer?scriptVersionId=24171638&cellId=18
mean = K . mean (inputs, axis =- 1 , keepdims = True )
variance = K . mean (K. square (inputs - mean), axis =- 1 , keepdims = True )
std = K . sqrt (variance + self.epsilon)
outputs = (inputs - mean) / std
if self . scale :
outputs *= self . gamma
if self . center :
outputs += self . beta
Copy def GroupNorm ( x , gamma , beta , G , eps = 1e-5 ):
# x: input features with shape [N,C,H,W]
# gamma, beta: scale and offset, with shape [1,C,1,1]
# G: number of groups for GN
N , C , H , W = x . shape
x = tf . reshape (x, [N, G, C // G, H, W])
mean , var = tf . nn . moments (x, [ 2 , 3 , 4 ], keep dims = True )
x = (x - mean) / tf . sqrt (var + eps)
x = tf . reshape (x, [N, C, H, W])
return x * gamma + beta
RMSNorm - Root Mean Square Layer Normalization
3.7 pool
Copy def get_pools ( img : np . array , pool_size : int , stride : int ) -> np . array:
pools = []
# Iterate over all row blocks (single block has `stride` rows)
for i in np . arange (img.shape[ 0 ], step = stride):
# Iterate over all column blocks (single block has `stride` columns)
for j in np . arange (img.shape[ 0 ], step = stride):
# Extract the current pool
mat = img [ i : i + pool_size , j : j + pool_size ]
# Make sure it's rectangular - has the shape identical to the pool size
if mat . shape == (pool_size , pool_size) :
# Append to the list of pools
pools . append (mat)
return np . array (pools)
def max_pooling ( pools : np . array) -> np . array:
num_pools = pools . shape [ 0 ] # Total number of pools
# Shape of the matrix after pooling - Square root of the number of pools
tgt_shape = ( int (np. sqrt (num_pools)), int (np. sqrt (num_pools)) )
pooled = []
for pool in pools :
pooled . append (np. max (pool))
return np . array (pooled). reshape (tgt_shape)
3.8 dropout