Chapter 2: Reinforcement Learning
Take a whirlwind tour through RL, starting from tabular learning and Atari, and ending with some of the cutting-edge techniques used in current LLM post-training.
Sections
2.1
Intro to RL
RL fundamentals: MDPs, policies, value functions, and multi-armed bandits.
2.2
DQN & VPG
Implement DQN and Vanilla Policy Gradient for CartPole and beyond.
2.3
PPO
Build a PPO agent from scratch and train it to master CartPole.
2.4
RLHF
Implement RLHF end-to-end, applying PPO to language model finetuning.