强化学习 RL

马尔可夫过程：解决序列决策问题，理解状态之间的转移概率
目标：reward最大化。需要有更好的policy选择action, 通过让agent获得状态转移概率
exploration和exploitation的平衡
Q-learning，DQN，TRPO, PPO, soft actor critic

理论

贝尔曼最优方程

类型

model-based
model-free
- Value Based: 状态+动作学习到一个value, value直接反应reward
- Policy Based: 由状态学习到动作的分布，根据分布进行决策
- Actor-Critic: Actor通过状态学习动作的分布，Critic根据动作和新的状态学习value评价

DQN

策略梯度 policy gradient

PPO (Proximal Policy Optimization)

rlhf(Reward + PPO)是 online 学习方式，dpo 是 offline 学习方式
策略梯度 -> actor-critic -> PPO

近端策略优化

两个网络，分别是Actor和Critic

DPO

问答

on-policy和off-policy的区别是什么
On-policy都有什么，SASA的公式和Q learning的公式什么差别，为什么没有max
解释一下DQN离散，DQNN（连续），有没有手写过
DPO (off-policy) 为什么会在学习过程中training positive的概率和training negative的概率都同时下降？
- 和采样的方式以及DPO loss组成相关. BT loss，maximize training set中positive和negative的gap
RLHF & DPO
- DPO did exclude some practical aspects of the RLHF method, e.g. pretraining gradients.
- the theoretical arguments of DPO equivalence make some assumptions that don’t necessarily apply in practice
- RLHF gives you a reusable reward model, which has practical uses and advantages. DPO doesn’t have useful intermediate product.
- DPO works off preference, whereas desirable RL objectives could have many forms

# RLHF 伪代码

for prompts in dataloader:

    # Stage 1: response生成
    batch = actor.generate_sequences(prompts)

    # Stage 2: 训练数据准备
    batch = critic.compute_values(batch)
    batch = reference.compute_log_prob(batch)
    batch = reward.compute_reward(batch)
    batch = compute_advantages(batch)

    # Stage 3: actor和critic训练
    critic_metrics = critic.update_critic(batch)
    actor_metrics = actor.update_actor(batch)

reference

https://cs.uwaterloo.ca/~ppoupart/teaching/cs885-fall21/schedule.html

Previousk-means Next自然语言处理 NLP

Last updated 13 days ago

强化学习 RL

马尔可夫过程：解决序列决策问题，理解状态之间的转移概率
目标：reward最大化。需要有更好的policy选择action, 通过让agent获得状态转移概率
exploration和exploitation的平衡
Q-learning，DQN，TRPO, PPO, soft actor critic

理论

贝尔曼最优方程

类型

model-based
model-free
- Value Based: 状态+动作学习到一个value, value直接反应reward
- Policy Based: 由状态学习到动作的分布，根据分布进行决策
- Actor-Critic: Actor通过状态学习动作的分布，Critic根据动作和新的状态学习value评价

DQN

策略梯度 policy gradient

PPO (Proximal Policy Optimization)

rlhf(Reward + PPO)是 online 学习方式，dpo 是 offline 学习方式
策略梯度 -> actor-critic -> PPO

近端策略优化

两个网络，分别是Actor和Critic

DPO

问答

on-policy和off-policy的区别是什么
On-policy都有什么，SASA的公式和Q learning的公式什么差别，为什么没有max
解释一下DQN离散，DQNN（连续），有没有手写过
DPO (off-policy) 为什么会在学习过程中training positive的概率和training negative的概率都同时下降？
- 和采样的方式以及DPO loss组成相关. BT loss，maximize training set中positive和negative的gap
RLHF & DPO
- DPO did exclude some practical aspects of the RLHF method, e.g. pretraining gradients.
- the theoretical arguments of DPO equivalence make some assumptions that don’t necessarily apply in practice
- RLHF gives you a reusable reward model, which has practical uses and advantages. DPO doesn’t have useful intermediate product.
- DPO works off preference, whereas desirable RL objectives could have many forms

# RLHF 伪代码

for prompts in dataloader:

    # Stage 1: response生成
    batch = actor.generate_sequences(prompts)

    # Stage 2: 训练数据准备
    batch = critic.compute_values(batch)
    batch = reference.compute_log_prob(batch)
    batch = reward.compute_reward(batch)
    batch = compute_advantages(batch)

    # Stage 3: actor和critic训练
    critic_metrics = critic.update_critic(batch)
    actor_metrics = actor.update_actor(batch)

reference

https://cs.uwaterloo.ca/~ppoupart/teaching/cs885-fall21/schedule.html

Previousk-means Next自然语言处理 NLP

Last updated 13 days ago