[2.1] - Intro to RL

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

This section is designed to bring you up to speed with the basics of reinforcement learning. Before we cover the big topics like DQN, VPG, PPO and RLHF, we need to build a strong theoretical foundation first.

For now, we will assume the environment has a finite set of states/actions, and we explore two cases: 1. that the dynamics of the environment are directly accessible to the agent, so that we can solve for an optimal policy analytically. 2. the environment is a black box that we can sample from, and the agent must learn from play how the environment works.

Tomorrow, we weaken the assumption that there are few enough states we could visit them all, and instead build something with deep learning that can generalize to never-before-seen states.

This day's material will be more theory heavy, as the goal is to understand the fundamentals of RL before we apply deep learning to it.

For a lecture on the material today, which provides some high-level understanding before you dive into the material, watch the video below:

Content & Learning Objectives

1️⃣ Planning

Learning Objectives

Understand Markov Decision Processes and the goal of the agent, for known environments.

Understand the Bellman equation.

Understand the policy improvement theorem, and how we can use it to iteratively solve for an optimal policy.

2️⃣ Learning

SARSA and Q-learning are methods of learning in Markovian domains by experiencing the consequences of actions, and learning from them, without peeking directly at how the environment works. The agent must learn the dynamic of the environment solely from experience.
We can do short updates and bootstrap the rest, or we can do long updates, but suffer higher variance. Eligibility traces allow us to get the best of both worlds in an efficient fashion.

Learning Objectives

Understand the anatomy of a gym.Env, so that you feel comfortable using them and writing your own.

Understand the ideas behind temporal difference methods such as SARSA and Q-learning.

Implement SARSA and Q-Learning, and compare them on different environments.

Understand the TD($\lambda$) algorithm, and how it can we used to mix over short and long timescale updates.

3️⃣ (Optional) Multi-Armed Bandit

Multi-armed bandits are a different flavour of environment where there are no states, but the environmental distribution itself may drift over time. Here, the emphasis is on strategies to trade-off exploration v.s. exploitation, and methods from bandits (such as UCB) are used elsewhere (e.g Monte Carlo Tree Search, or the Hyperband algorithm for hyperparameter search).

This used to be core material, but as none of the later material requires understanding bandits, we've moved it to bonus optional material.

Learning Objectives

Understand the difficulty of optimal exploration

Compare Upper Confidence Bound (UCB) methods to other exploration methods.

Setup code

import os
import random
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import TypeAlias

import einops
import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm

Arr: TypeAlias = np.ndarray

max_episode_steps = 1000
N_RUNS = 200

# Make sure exercises are in the path
chapter = "chapter2_rl"
section = "part1_intro_to_rl"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part1_intro_to_rl.tests as tests
import part1_intro_to_rl.utils as utils
from plotly_utils import cliffwalk_imshow, imshow, line