2️⃣ Learning
Learning Objectives
- Understand the ideas behind temporal difference methods such as SARSA and Q-learning.
- Implement SARSA and Q-Learning, and compare them on different environments.
- Understand the TD($\lambda$) algorithm, and how it can we used to mix over short and long timescale updates.
Now, we deal with situations where the environment is a black-box, and the agent must learn the rules of the world via interaction with it. This is different from everything else we've done so far, e.g. in the previous section we could calculate optimal policies by using the tensors $R$ and $T$, which we will now assume the agent doesn't have direct knowledge of.
We call algorithms which have access to the transition probability distribution and reward function planning algorithms, as they don't need to learn the dynamics of the environment but can directly plan what to do (like a chess-playing algorithm that does tree search). Q-learning is a model-free algorithm, that doesn't have access to, or even constructs internally a model of the world, learning only what actions are good in which states. From the original paper introducing Q-learning:
[Q-learning] provides agents with the capability of learning to act optimally in Markovian domains by experiencing the consequences of actions, without requiring them to build maps of the domains.
The "Q" part of Q-learning refers to the function $Q_\pi$ which we encountered yesterday - the expected discounted sum of rewards for an action $a$ taken in a particular state $s$, based on some policy $\pi$.
Tabular Methods
The methods used here are called tabular methods, because they create a lookup table from (state, action) to the Q value. This is pure memorization, and if our reward function was sampled from the space of all functions, this is usually the best you can do because there's no structure that you can exploit to do better.
We can hope to do better on most "natural" reward functions that do have structure. For example in a game of poker, there is structure in both of the actions (betting £100 will have a similar reward to betting £99 or £101), and between states (having a pair of threes in your hand is similar to having a pair of twos or fours). We need to take advantage of this, otherwise there are just too many states and actions to have any hope of training an agent.
One idea is to use domain knowledge to hand-code a function that "buckets" states or actions into a smaller number of equivalence classes and use those as the states and actions in a smaller version of the problem (see Sutton and Barto, Section 9.5). This was one component in the RL agent Libratus: The Superhuman AI for No-Limit Poker. The details are beyond the scope of today's material, but I found them fascinating.
If you don't have domain knowledge to leverage, or you care specifically about making your algorithm "general", we will see how to deal with this later: we offload the table to a neural network, and it will learn for us a suitable state aggregation method: generalizing to never-before-seen states based on ones it has already seen to work out how to act. The neural network is then trained on the experience of interacting with the environment.
Optional Readings
The material provided together with the lecture cover everything you'll need, but here are some optional readings if you want to go deeper.
- Sutton and Barto
- Chapter 6, Section 6.1, 6.3 (Especially Example 6.4)
- Note that Section 6.1 talks about temporal difference (TD) updates for the value function $V$. We will instead be using TD updates for the Q-value $Q$.
- Don't worry about the references to Monte Carlo in Chapter 5.
- Q-Learning The original paper where Q-learning is first described.
Again, we'll be using NumPy for this section, and we'll start off with our gridworld environment from yesterday:

but this time we'll use it within the gym framework.
Intro to OpenAI Gymnasium
Today, we'll be using gymnasium (pinned to version 0.29.0), an actively maintained fork of OpenAI's no longer maintained gym library. This provides a uniform interface to many different RL environments including Atari games. The original gym was released in 2016 and details of the API have changed significantly over the years. If you're looking for source code, make sure you visit https://github.com/Farama-Foundation/Gymnasium/tree/v0.29.0 to get the version that corresponds to the code we're using.
Note - gymnasium vs gym
gymnasium is the new standard for RL environments, and we'll be using it for the rest of this chapter. The rest of the material will just discuss gymnasium directly, but we've included here a list of the important differences between gym==0.23.1 (the version previously used in the ARENA exercises) and gymnasium==0.29.0, just in case you want to transfer your old code over (full details can be found here). There were also some significant changes made between gymnasium versions 0.29.0 and the current 1.0.0, but we aren't working with these because they don't offer any major benefits for our purposes (and in particular they handle things like terminal observations and video recording in a pretty annoying way).
5 objects returned in env.step, not 4. In version 0.23.1 of the original gym library, the env.step method returned 4 objects: observation, reward, done, info. Implicitly, done was actually defined as terminated or truncated, where truncated is true if the episode has timed out and terminated is true if the episode ends for non-time related reasons (e.g. the agent has fallen off a cliff). In gymnasium (or more recent versions of gym), we split this into 2 variables, i.e. env.step returns the 5-tuple of observation, reward, terminated, truncated, info.
2 objects returned in env.reset, not 1. In version 0.23.1 of the original gym library, the env.reset method returned a single object: obs (the observation). In gymnasium, it returns 2 objects: observation and info (the latter of which is the same as the info object returned by the step method conventionally).
The infos object is different in vectorized environments. previously it was a list of dicts (one per env), now it's a dict of lists. We make the following replacements:
Render mode. You need to create video-based environments by passing render_mode="rgb_array" into gym.make. When you do this, you don't also need to call mode="rgb_array" in the render method.
infos[env_idx]["episode"]["l"] -> infos["final_info"][env_idx]["episode"]["l"]
infos[env_idx]["terminal_observation"] -> infos["final_observation"][env_idx]
Handling terminal obversations. Previously the next_obs returned from envs.step would be the reset observation of the new episode if we terminated, and we'd have to go into the infos dictionary to see the true terminal observation. Now, next_obs is the true terminal observation (i.e. the thing that's out of bounds), and when we terminate there's acutally an _autoreset_envs flag which is set to true, and makes sure our environment is reset when we take the next step. So in those cases, we need to be careful that the starting obs we use is actually the new observation, rather than the terminal observation.
Random seeding. The Env.seed() method no longer exists, and is now replaced with Env.reset(seed=seed).
We're using NumPy for this section. PyTorch provides GPU support and autograd, neither of which is of any use to environment code or the code we're writing today.
The step method
The environment's step method takes the action selected by the agent and returns four values: obs, reward, done, and the info dictionary.
obs and reward is the next observation and reward that the agent receives based on the action chosen.
done indicates if the environment has entered a terminal state and ended. Here, both the goal-states (+1 and -1) are terminal. Early termination is equivalent to an infinite trajectory where the agent remains trapped for all future states, and always receives reward zero.
info can contain anything extra that doesn't fit into the uniform interface - it's up to the environment what to put into it. A good use of this is for debugging information that the agent isn't "supposed" to see, like the dynamics of the environment. Agents that cheat and peek at info are helpful because we know that they should obtain the maximum possible rewards; if they aren't, then there's a bug. We will throw the entire underlying environment into info, from which an agent could cheat by peeking at the values for T and R.
Terminology note - observation vs. state
We use the word observation here as some environments are partially observable, the agent receives not an exact description of the state they are in, but merely an observation giving a partial description (for our gridworld, it could be a description of which cells directly adjacent to the agent are free to move into, rather than precisely which state they are in). This means that the agent would be unable to distinguish the cell north of the wall from the cell south of the wall. Returning the state as the observation is a special case, and we will often refer to one or the other as required.
The render method
Render is only used for debugging or entertainment, and what it does is up to the environment. It might render a little popup window showing the Atari game, or it might give you a RGB image tensor, or just some ASCII text describing what's happening.
For our gridworld environment, it will print a small representation of the current policy by showing what action it chooses for each state.
Observation and Action Types
A gym.Env is a generic type: both the type of the observations and the type of the actions depends on the specifics of the environment. Right now, we're only dealing with the simplest case: a discrete set of actions which are the same in every state. In general, the actions could be continuous, or depend on the state. We're also just dealing with a single discrete observation here, although later we'll be taking arrays as our observations.
Below, we define a class that allows us to use our old environment definition from before, and wrap it in a gym.Env instance so we can learn from experience instead. Read the code below carefully and make sure you understand how the Gym environment API works.
ObsType: TypeAlias = int | np.ndarray
ActType: TypeAlias = int
class DiscreteEnviroGym(gym.Env):
action_space: gym.spaces.Discrete
observation_space: gym.spaces.Discrete
"""
A discrete environment class for reinforcement learning, compatible with OpenAI Gym.
This class represents a discrete environment where actions and observations are discrete.
It is designed to interface with a provided `Environment` object which defines the
underlying dynamics, states, and actions.
Attributes:
action_space (gym.spaces.Discrete): The space of possible actions.
observation_space (gym.spaces.Discrete): The space of possible observations (states).
env (Environment): The underlying environment with its own dynamics and properties.
"""
def __init__(self, env: Environment):
super().__init__()
self.env = env
self.observation_space = gym.spaces.Discrete(env.num_states)
self.action_space = gym.spaces.Discrete(env.num_actions)
self.reset()
def step(self, action: ActType) -> tuple[ObsType, float, bool, bool, dict]:
"""
Execute an action and return the new state, reward, done flag, and additional info.
The behaviour of this function depends primarily on the dynamics of the underlying
environment.
"""
states, rewards, probs = self.env.dynamics(self.pos, action)
idx = self.np_random.choice(len(states), p=probs)
new_state, reward = states[idx], rewards[idx]
self.pos = new_state
terminated = self.pos in self.env.terminal
truncated = False
info = {"env": self.env}
return new_state, reward, terminated, truncated, info
def reset(self, seed: int | None = None, options=None) -> tuple[ObsType, dict]:
"""
Resets the environment to its initial state.
"""
super().reset(seed=seed)
self.pos = self.env.start
return self.pos, {}
def render(self, mode="human"):
assert mode == "human", f"Mode {mode} not supported!"
For gym, we also need to register an environment to use it, and we can also wrap environments so that each episode of interaction is limited to a fixed number of steps.
Registering an Environment
Usually we won't use a class like DiscreteEnviroGym directly, because we'll want to wrap it in classes that add various other features that help it work with the rest of the gymnasium library. The register function does this for us: it stores information about our Env in a registry so that a later call to gym.make can look it up using the id string that is passed in, and then create an instance of our class with a bunch of extra wrapper classes.
By convention, the id strings have a suffix with a version number. There can be multiple versions of the "same" environment with different parameters, and benchmarks should always report the version number for a fair comparison. For instance, id="NorvigGrid-v0" below.
TimeLimit Wrapper
One important wrapper class is the TimeLimit wrapper - this is used to ensure that the environment terminates after a fixed number of steps. It works by having the truncated flag returned by the step method be overridden after max_episode_steps, and set to True. You can see from looking at the step method above that without this, the agent might keep playing forever, because it always returns terminated=truncated=False. The time limit is also an essential part of the problem definition: if it were larger or shorter, there would be more or less time to explore, which means that different algorithms (or at least different hyperparameters) would then have improved performance.
gym.envs.registration.register(
id="NorvigGrid-v0",
entry_point=DiscreteEnviroGym,
max_episode_steps=100,
nondeterministic=True,
kwargs={"env": Norvig(penalty=-0.04)},
)
gym.envs.registration.register(
id="ToyGym-v0",
entry_point=DiscreteEnviroGym,
max_episode_steps=3, # use 3 not 2, because of 1-indexing
nondeterministic=False,
kwargs={"env": Toy()},
)
Agent class
Now, you'll start defining agents to act within this class.
We define a dataclass Experience, to store experiences, defined as $e_t = (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$. In the literature, this is often called a percept.
You'll be defining your various agents as subclasses of the base class Agent. It contains the following methods/attributes:
AgentConfig, which allows us to more easily package up the arguments we use to initialize our agent.get_action, which returns the agent's next action, given the current observation.- This isn't implemented for
Agent, because we'll implement it separately for each agent we build on top of the base agent.
- This isn't implemented for
observe, which determines how the model updates the policy based on receiving experience.- The default behaviour is to do nothing (i.e never learn from experience).
reset, which is run before each episode starts.- All that the base method does is reset the random number generator, but again future agents might need to do more, e.g. reset data gathered from the
observemethod.
- All that the base method does is reset the random number generator, but again future agents might need to do more, e.g. reset data gathered from the
run_episode, which runs a single episode of interaction between an agent and the environment.train, which runsrun_episodemultiple times in series, and collects the sum of rewards for each, so we can observe improvements in agent performance.
We've also given you the RandomAgent subclass, which picks an action at random using the random number generator provided by gym. This is useful as a baseline to ensure the environment has no bugs. If your later agents are doing worse than random, you have a bug!
@dataclass
class Experience:
"""
A class for storing one piece of experience during an episode run.
"""
obs: ObsType
act: ActType
reward: float
new_obs: ObsType
new_act: ActType | None = None
@dataclass
class AgentConfig:
"""Hyperparameters for agents"""
epsilon: float = 0.1
lr: float = 0.05
optimism: float = 0
defaultConfig = AgentConfig()
class Agent:
"""
Base class for agents interacting with an environment.
You do not need to add any implementation here.
"""
rng: np.random.Generator
def __init__(
self,
env: DiscreteEnviroGym,
config: AgentConfig = defaultConfig,
gamma: float = 0.99,
seed: int = 0,
):
self.env = env
self.reset(seed)
self.config = config
self.gamma = gamma
self.num_actions = env.action_space.n
self.num_states = env.observation_space.n
self.name = type(self).__name__
def get_action(self, obs: ObsType) -> ActType:
raise NotImplementedError()
def observe(self, exp: Experience) -> None:
"""
Agent observes experience, and updates model as appropriate.
Implementation depends on type of agent.
"""
pass
def reset(self, seed: int) -> tuple[ObsType, dict]:
self.rng = np.random.default_rng(seed)
return None, {}
def run_episode(self, seed) -> list[int]:
"""
Simulates one episode of interaction, agent learns as appropriate
Inputs:
seed : Seed for the random number generator
Returns:
The rewards obtained during the episode
"""
rewards = []
obs, info = self.env.reset(seed=seed)
self.reset(seed=seed)
done = False
while not done:
act = self.get_action(obs)
new_obs, reward, terminated, truncated, info = self.env.step(act)
done = terminated or truncated
exp = Experience(obs, act, reward, new_obs)
self.observe(exp)
rewards.append(reward)
obs = new_obs
return rewards
def train(self, n_runs=500):
"""
Run a batch of episodes, and return the total reward obtained per episode
Inputs:
n_runs : The number of episodes to simulate
Returns:
The discounted sum of rewards obtained for each episode
"""
all_rewards = []
for seed in range(n_runs):
rewards = self.run_episode(seed)
all_rewards.append(utils.sum_rewards(rewards, self.gamma))
return all_rewards
class Random(Agent):
def get_action(self, obs: ObsType) -> ActType:
return self.rng.integers(0, self.num_actions)
Cheater Agent
First, you'll implement a cheating agent that peeks at the info and finds the optimal policy directly using your previous code. If your agent gets more than this in the long run, you have a bug!
Exercise - implement Cheater
Below, you should implement the cheater agent. Recall the find_optimal_policy function to find the optimal policy for any environment and value of gamma. You should use this function to define the optimal policy on initialization, and then fill in the get_action method so that it always returns the corresponding optimal action.
Once you've done this, run the code below to check that your cheating agent outperforms the random agent. The cheating agent represents the best possible behavior, as it omnisciently always knows to play optimally without having to learn.
On the environment ToyGym-v0, (assuming $\gamma = 0.99$ and that the environment terminates after 2 timesteps) the cheating agent should always get reward $2 \gamma = 1.98$,
and the random agent should get a fluctuating reward, with average $\frac{2 \gamma + 1}{2} = 1.49$.
Hint: Use env.unwrapped.env to extract the Environment wrapped inside gym.Env, from which you can access .T and .R.
Help - I get 'AttributeError: 'DiscreteEnviroGym' object has no attribute 'num_states''.
This is probably because you're passing the DiscreteEnviroGym object to your find_optimal_policy function. In the following line of code:
env_toy = gym.make("ToyGym-v0")
the object env_toy wraps around the Toy environment you used yesterday. As mentioned, you'll need to use env.unwrapped.env to access this environment, and its dynamics.
class Cheater(Agent):
def __init__(self, env: DiscreteEnviroGym, config: AgentConfig = defaultConfig, gamma=0.99, seed=0):
super().__init__(env, config, gamma, seed)
raise NotImplementedError()
def get_action(self, obs):
raise NotImplementedError()
env_toy = gym.make("ToyGym-v0")
agents_toy: list[Agent] = [Cheater(env_toy), Random(env_toy)]
returns_dict = {}
for agent in agents_toy:
returns = agent.train(n_runs=100)
returns_dict[agent.name] = utils.cummean(returns)
line(
list(returns_dict.values()),
names=list(returns_dict.keys()),
title=f"Avg. reward on {env_toy.spec.name}",
labels={"x": "Episode", "y": "Avg. reward", "variable": "Agent"},
template="simple_white",
width=700,
height=400,
)
Click to see the expected output
Solution
class Cheater(Agent):
def __init__(self, env: DiscreteEnviroGym, config: AgentConfig = defaultConfig, gamma=0.99, seed=0):
super().__init__(env, config, gamma, seed)
self.pi_opt = find_optimal_policy(self.env.unwrapped.env, self.gamma)
def get_action(self, obs):
return self.pi_opt[obs]
env_toy = gym.make("ToyGym-v0")
agents_toy: list[Agent] = [Cheater(env_toy), Random(env_toy)]
returns_dict = {}
for agent in agents_toy:
returns = agent.train(n_runs=100)
returns_dict[agent.name] = utils.cummean(returns)
line(
list(returns_dict.values()),
names=list(returns_dict.keys()),
title=f"Avg. reward on {env_toy.spec.name}",
labels={"x": "Episode", "y": "Avg. reward", "variable": "Agent"},
template="simple_white",
width=700,
height=400,
)
SARSA: On-Policy TD Control
Now we wish to train an agent on the same gridworld environment as before, but this time it doesn't have access to the underlying dynamics (T and R). The rough idea here is to try and estimate the Q-value function directly from samples received from the environment. Recall that the optimal Q-value function satisfies
where * $s'$ represents the next state after $s$, * $a'$ the next action after $a$ * $r$ is the reward obtained from taking action $a$ in state $s$ * the expectation $\mathbb{E}_{\pi^*}$ is with respect to both the optimal policy $\pi^*$, as well as the stochasticity in the environment itself.
Intuitively, this last line of the formula says "if our policy is optimal, then (current value if next action is $a$) = (reward from action $a$) + (reward for choosing the best possible action at the next step)".
This means that, for any particular episode $s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3,\ldots$ we have that on average that:
where $a_{t+1} = \pi^*(s_{t+1}) = \text{argmax}_a Q^*(s_{t+1}, a)$. Since this approximation is only valid on the assumption that future decisions are made optimally, it will not hold for non-optimal policies.
Letting $\widehat{Q}^*$ denote our best current estimate of $Q^*$, the error $\delta_t := r_{t+1} + \gamma \widehat{Q}^*(s_{t+1}, a_{t+1}) - \widehat{Q}^*(s_t,a_t)$ in this "guess" is called the TD error, and tells us in which direction we should bump our estimate of $Q^*$. Of course, this estimate might be wildly inaccurate (even for the same state-action pair!), due to the stochasticity of the environment, or poor estimates of $Q^*$. So, we update our estimate slightly in the direction of $\delta_t$, much like stochastic gradient descent does. The update rule for Q-learning (with learning rate $\alpha > 0$) is
This update depends on the information $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$, and so is called SARSA learning (recall these are the exact 5 values we store in our Experience dataclass). Note that SARSA learns on-policy, in that it only learns from data that was actually generated by the current policy $\pi$, derived from the current estimate of $Q$, $\pi(s) = \text{argmax}_a \widehat{Q}^*(s,a)$.

Note - we can intuitively describe this update step as "changing your $Q$-function based on the action you actually took, in a way which makes it more consistent with the current policy". On its own this wouldn't get us to the optimal value $Q^*$, instead it'd just get us to the correct value of $Q^\pi$. This is why we have to combine it with some kind of policy improvement. We use an epsilon-greedy policy (see below).
Q-Learning: Off-Policy TD Control
As mentioned at the end of the last section, what SARSA is essentially doing is estimating $Q_\pi$ by using the rewards gathered by following policy $\pi$. But we don't actually care about $Q_\pi$, what we care about is $Q^*$. $Q$-Learning provides a slight modification to SARSA, by modifying the TD-error $\delta_t$ to use the action that $\pi$ should have taken in state $s_t$ (namely $\text{argmax}_a \widehat{Q}^*(s_t,a)$) rather than the action $a_t = \pi(s_t)$ that was actually taken.
Intuitively, if the SARSA update is "changing your $Q$-function based on the action you actually took, in a way which makes it more consistent with the current policy", then this update is "changing your $Q$-function based on the best possible action you could have taken, in a way which makes it more consistent with the optimal policy".
Note that each Q-learning update depends on the information $(s_t, a_t, r_{t+1}, s_{t+1})$. This means that $Q$-learning tries to estimate $Q^*$ directly, regardless of what policy $\pi$ generated the episode, and so Q-Learning learns off-policy.
Q-Learning in fact can will eventually learn the optimal Q-value $Q^*$ for any policy $\pi$ such that every state-action pair gets visited infinitely often (plus some other conditions regarding the learning rate). This means Q-Learning will still work, even if the policy is completely random! (Learning will be very slow, however.)

Explore vs. Exploit
Lastly, methods to learn the Q-value when acting with a greedy policy often have trouble exploring. If a state-action pair $(s,a)$ with low Q-value $Q^*(s,a)$ just so happens to return a high reward by chance, the greedy policy with respect to $Q$ will often choose action $a$ in state $s$ instead of exploring potentially other good actions first. To remedy this, we use instead an $\epsilon$-greedy policy with respect to the current Q-value estimates: With probability $\epsilon$, a random action is chosen, and with probability $1-\epsilon$ the greedy action $\text{argmax}_a Q(s,a)$ is chosen. The exploration probability $\epsilon$ is a hyperparameter that for now we will set to a constant $\epsilon = 0.1$, but more sophisticated techniques include the use of a schedule to start exploring often early, and then decay exploration as times goes on.
We also have the choice of how the estimate $\widehat{Q}^*(s,a)$ is initialized. By choosing "optimistic" values (initial values that are much higher than what we expect $Q^*(s,a)$ to actually be), this will encourage the greedy policy to hop between different actions in each state when they discover they weren't as valuable as first thought.
Exercise - Implement Q-learning and SARSA
You should fill in the classes EpsilonGreedy, QLearning and SARSA below.
EpsilonGreedy- This is simply a base class which keeps track of the current Q-value estimates, and selects an action based on an epsilon-greedy policy (i.e. with probability $\epsilon$ a random action is taken, and with policy $1-\epsilon$ the Q-maximizing action is taken, according to current Q-value estimates.
- We've already filled in the initialization for you (with optimism), all that's left is for you to implement epsilon-greedy action selection.
QLearning- You should fill in the
observemethod for theQLearningclass, to update Q-values based on observations (according to the formula above).
- You should fill in the
SARSA- You should fill in the
observemethod for theSARSAclass, to update Q-values based on observations (according to the formula above).
- You should fill in the
Note that we've given the SARSA agent a slightly different run_episode function. Can you see what's different about this, and why we've done this? (This is pretty subtle so don't worry about it too much.)
Answer - what's different?
We've reordered the while loop contents. Before, it looked like this:
while not done:
act = self.get_action(obs)
(new_obs, reward, done, info) = self.env.step(act)
exp = Experience(obs, act, reward, new_obs)
self.observe(exp)
rewards.append(reward)
obs = new_obs
Now, it looks like this:
while not done:
(new_obs, reward, done, info) = self.env.step(act)
new_act = self.get_action(new_obs)
exp = Experience(obs, act, reward, new_obs, new_act)
self.observe(exp)
rewards.append(reward)
obs = new_obs
act = new_act
Answer - why is it different?
It's different in SARSA because the order of operations is different. We have:
- SARSA = take action $a_t$, observe $(r_{t+1}, s_{t+1})$, then take action $a_{t+1}$, then update
- Q-learning = take action $a_t$, observe $(r_{t+1}, s_{t+1})$, then update, then take action $a_{t+1}$
The key point being that the action $a_{t+1}$ has to be computed before the SARSA update step, since SARSA updates on actions taken. Whereas Q-learning updates should happen after the action $a_t$.
Don't worry if you don't fully understand this, the order of operations for these 2 algorithms can get a bit messy. All you really need to understand is this: in both cases, the code variables exp.obs, exp.act, exp.reward, exp.new_obs in the observe methods correspond to the pseudocode variables $S, A, R, S'$, and in the case of SARSA we also have exp.new_action which corresponds to $A'$.
Tips
- Use
self.rng.random()to generate random numbers in the range $[0,1)$, andself.rng.integers(0, n)for random integers in the range $0, 1, \ldots, n-1$. - The random agent results in very long episodes, which slows evaluation. You can remove them from the experiment once you've convinced yourself that your agents are doing something reasonable.
- Leave $\gamma =0.99$ for now.
When you run the code below, you should expect to see SARSA outperform Q-Learning (by a lot at first, then this gap will narrow). They should both be slightly worse than the cheater, and much better than the random agent.
class EpsilonGreedy(Agent):
"""
A class for SARSA and Q-Learning to inherit from.
"""
def __init__(
self,
env: DiscreteEnviroGym,
config: AgentConfig = defaultConfig,
gamma: float = 0.99,
seed: int = 0,
):
super().__init__(env, config, gamma, seed)
self.Q = np.zeros((self.num_states, self.num_actions)) + self.config.optimism
def get_action(self, obs: ObsType) -> ActType:
"""
Selects an action using epsilon-greedy with respect to Q-value estimates
"""
raise NotImplementedError()
class QLearning(EpsilonGreedy):
def observe(self, exp: Experience) -> None:
raise NotImplementedError()
class SARSA(EpsilonGreedy):
def observe(self, exp: Experience):
raise NotImplementedError()
def run_episode(self, seed) -> list[int]:
rewards = []
obs, info = self.env.reset(seed=seed)
act = self.get_action(obs)
self.reset(seed=seed)
done = False
while not done:
new_obs, reward, terminated, truncated, info = self.env.step(act)
done = terminated or truncated
new_act = self.get_action(new_obs)
exp = Experience(obs, act, reward, new_obs, new_act)
self.observe(exp)
rewards.append(reward)
obs = new_obs
act = new_act
return rewards
n_runs = 1000
gamma = 0.99
seed = 1
env_norvig = gym.make("NorvigGrid-v0")
config_norvig = AgentConfig()
args_norvig = (env_norvig, config_norvig, gamma, seed)
agents_norvig: list[Agent] = [
Cheater(*args_norvig),
QLearning(*args_norvig),
SARSA(*args_norvig),
Random(*args_norvig),
]
returns_dict = {}
for agent in agents_norvig:
returns = agent.train(n_runs)
returns_dict[agent.name] = utils.cummean(returns)
line(
list(returns_dict.values()),
names=list(returns_dict.keys()),
title=f"Avg. reward on {env_norvig.spec.name}",
labels={"x": "Episode", "y": "Avg. reward", "variable": "Agent"},
template="simple_white",
width=700,
height=400,
)
Click to see the expected output
Solution
class EpsilonGreedy(Agent):
"""
A class for SARSA and Q-Learning to inherit from.
"""
def __init__(
self,
env: DiscreteEnviroGym,
config: AgentConfig = defaultConfig,
gamma: float = 0.99,
seed: int = 0,
):
super().__init__(env, config, gamma, seed)
self.Q = np.zeros((self.num_states, self.num_actions)) + self.config.optimism
def get_action(self, obs: ObsType) -> ActType:
"""
Selects an action using epsilon-greedy with respect to Q-value estimates
"""
if self.rng.random() < self.config.epsilon:
return self.rng.integers(0, self.num_actions)
else:
return self.Q[obs].argmax()
class QLearning(EpsilonGreedy):
def observe(self, exp: Experience) -> None:
s_t, a_t, r_t_1, s_t_1 = exp.obs, exp.act, exp.reward, exp.new_obs
self.Q[s_t, a_t] += self.config.lr * (r_t_1 + self.gamma * np.max(self.Q[s_t_1]) - self.Q[s_t, a_t])
class SARSA(EpsilonGreedy):
def observe(self, exp: Experience):
s_t, a_t, r_t_1, s_t_1, a_t_1 = exp.obs, exp.act, exp.reward, exp.new_obs, exp.new_act
self.Q[s_t, a_t] += self.config.lr * (r_t_1 + self.gamma * self.Q[s_t_1, a_t_1] - self.Q[s_t, a_t])
def run_episode(self, seed) -> list[int]:
rewards = []
obs, info = self.env.reset(seed=seed)
act = self.get_action(obs)
self.reset(seed=seed)
done = False
while not done:
new_obs, reward, terminated, truncated, info = self.env.step(act)
done = terminated or truncated
new_act = self.get_action(new_obs)
exp = Experience(obs, act, reward, new_obs, new_act)
self.observe(exp)
rewards.append(reward)
obs = new_obs
act = new_act
return rewards
Compare the performance of SARSA and Q-Learning on the gridworld environment v.s. the cheating agent and the random agent. Try to tune the hyperparameters to get the best performance you can.
- Which seems to work better? SARSA or Q-Learning?
- Does the optimism parameter seems to help?
- What's the best choice of exploration parameter $\epsilon$?
- The feedback from the environment is very noisy. At the moment, the code provided plots the cumulative average reward per episode. You might want to try plotting a sliding average instead, or an exponential weighted moving average (see
part2_dqn/utils.py).
Eligibility Traces
Both SARSA and Q-Learning work well for environments that never terminate, as they update on each time step. Suppose instead we were focused on an episodic environment. Given a full trajectory from such an environment
What to choose? Shorter updates $\delta_{t:t+1}$ have less variance as they rely on fewer samples, but they might have a systematic bias as we replace the tail of the return with our current best guess $\gamma Q(s_{t+1}, a_{t+1})$. Longer updates $\delta_{t:T}$ are unbiased as they use the actual true return, but have much higher variance due to the sum of many (potentially) noisy rewards.
One can try for a mixture to get the best of both worlds: We could use
The trick now is that we choose the weights $\theta_{t+k} = (1-\lambda) \lambda^{k-1}$ for $k=1,2,\ldots,T-t$, and define the $\lambda$-return as
The problem is with this update rule is that 1. computing $G^\lambda_t$ naively is very expensive, requiring $O(T^2)$ time, 2. we would need to wait till the end of the episode to learn anything. For continuing environments with no end, we could never update!
Thankfully, there is a fix! We can use something called an eligibility trace to compute this update in an online fashion. The eligibility trace $e_t(s,a)$ measures how we should "blame" a state-action pair on timestep $t$ for the error in the update. When we visit a state-action pair, its eligibility trace is incremented, and the eligiblity trace for all states decays over time geometrically. We can define the eligibility trace as
Derivation of the eligibility trace (very optional)
TODO: LLM generated proof that I haven't inspected in perfect detail, but upon first pass seems correct.
Full Derivation of SARSA(λ) Eligibility Trace Update
We want to show that the forward-view λ-return update:
is equivalent to the backward-view update using eligibility traces:
where
and
is the standard 1-step TD error.
1. The λ-return and partial n-step returns
For an episode of length $T$, the λ-return is defined as a weighted combination of n-step returns:
where
The forward-view update at time $t$ is
2. Express the n-step returns in terms of 1-step TD errors
Define the 1-step TD error:
Then the n-step return can be written as a sum of TD errors:
Substituting into the λ-return:
This identity shows that the λ-return minus the current Q-value equals a geometrically-weighted sum of future 1-step TD errors.
3. Sum of forward-view updates over an episode
Consider the total change to a particular $Q(s,a)$ over an episode:
Swap the order of summation (sum over TD times $k$ first):
4. Recognize the eligibility trace
The inner sum is exactly the accumulating eligibility trace:
which is equivalent to the recursive definition:
Thus, the total update can be written as:
which is exactly the backward-view incremental update.
5. Practical backward-view SARSA(λ) algorithm
At each time step $t$:
- Observe $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$.
- Compute the TD error:
- Update eligibility traces for all $(s,a)$:
- Update Q-values:
This yields the online, incremental backward-view SARSA(λ) update, equivalent to the forward-view λ-return update.
This then gives the update rule for SARSA($\lambda$) on timestep $t$ as

Note that in the algorithm, the inner-most loop over all state-action pairs
can be parallelized, as the updates are independent. In practice, we store the eligibility trace $e$
as a lookup table e[s,a], and broadcast the updates.
Exercise - Implement SARSA($\lambda$)
You should fill in the class SARSA_lambda below.
* We can reuse the same run_episode function as the SARSA class, but modifying to first reset
the eligibility trace e back to zero.
* The policy will be epsilon-greedy as before, so we can also reuse the get_action method.
* The observe function will need to be modified for the new SARSA($\lambda$) update rule.
* The init method should initialize the eligibility trace to zero, as well as the Q-values (either directly or via super().__init__).
@dataclass
class TD_LambdaConfig(AgentConfig):
lambda_: float = 0.95
class SARSA_lambda(SARSA):
def __init__(
self,
env: DiscreteEnviroGym,
config: AgentConfig = defaultConfig,
gamma: float = 0.99,
seed: int = 0,
):
raise NotImplementedError()
def run_episode(self, seed) -> list[int]:
raise NotImplementedError()
def observe(self, exp: Experience):
raise NotImplementedError()
n_runs = 1000
gamma = 0.99
seed = 1
env_norvig = gym.make("NorvigGrid-v0")
config_norvig = TD_LambdaConfig()
args_norvig = (env_norvig, config_norvig, gamma, seed)
agents_norvig: list[Agent] = [
Cheater(*args_norvig),
QLearning(*args_norvig),
SARSA(*args_norvig),
SARSA_lambda(*args_norvig),
Random(*args_norvig),
]
returns_dict = {}
for agent in agents_norvig:
returns = agent.train(n_runs)
returns_dict[agent.name] = utils.cummean(returns)
line(
list(returns_dict.values()),
names=list(returns_dict.keys()),
title=f"Avg. reward on {env_norvig.spec.name}",
labels={"x": "Episode", "y": "Avg. reward", "variable": "Agent"},
template="simple_white",
width=700,
height=400,
)
Click to see the expected output
You should find this makes quite an improvement over SARSA. In more complex environments, the use of eligibility traces is essential. We will use these for PPO later on.
Solution
@dataclass
class TD_LambdaConfig(AgentConfig):
lambda_: float = 0.95
class SARSA_lambda(SARSA):
def __init__(
self,
env: DiscreteEnviroGym,
config: AgentConfig = defaultConfig,
gamma: float = 0.99,
seed: int = 0,
):
super().__init__(env, config, gamma, seed)
self.lambda_ = config.lambda_
self.e = np.zeros((self.num_states, self.num_actions), dtype=np.float32)
def run_episode(self, seed) -> list[int]:
self.e[:, :] = 0
super().run_episode(seed)
def observe(self, exp: Experience):
s_t, a_t, r_t, s_tp1, a_tp1 = exp.obs, exp.act, exp.reward, exp.new_obs, exp.new_act
delta_t = r_t + self.gamma * self.Q[s_tp1, a_tp1] - self.Q[s_t, a_t]
self.e[s_t, a_t] += 1
self.Q += self.config.lr * delta_t * self.e # broadcast update
self.e *= self.gamma * self.lambda_
Q-Learning v.s SARSA
gym provides a large set of environments with which to test agents against. We can see all available environments via the dictionary gym.envs.registry (keys are env names, values are env classes).
Have a look at the gymnasium library for descriptions of these environments. As written, our SARSA and Q-Learning agents will only work with environments that have both discrete observation and discrete action spaces.
We'll modify the above code to use environment gym.make("CliffWalking-v0") instead (see this link). We have the following graph from Sutton & Barto, Example 6.6, that displays the sum of reward obtained for each episode, as well as the policies obtained (SARSA takes the safer path, Q-Learning takes the optimal path). You may want to check out this post.


Do you get a similar result when you run the code below?
Some notes:
- Use $\gamma = 1$ as described in Sutton & Barto, Example 6.6.
- Try tweaking the learning rate and epsilon (start with $\epsilon = 0.1$) to try and cause SARSA to take the cautious path, while Q-Learning takes the risky path.
- We've included some helper functions to display the value of each state, as well as the policy an agent would take, given the Q-value function.
- One of the bonus exercises we've suggested is to write your own version of
CliffWalking-v0by writing a class similar to theNorvigclass you have been working with. If you do this correctly, then you'll also be able to make a cheating agent. - We've defined a
cliffwalk_imshowhelper function for you, which visualises your agent's path (and reward at each square).
gamma = 1
seed = 0
config_cliff = AgentConfig(epsilon=0.1, lr=0.1, optimism=0)
env = gym.make("CliffWalking-v0")
n_runs = 2500
args_cliff = (env, config_cliff, gamma, seed)
returns_list = []
name_list = []
agents = [QLearning(*args_cliff), SARSA(*args_cliff)]
for agent in agents:
assert isinstance(agent, (QLearning, SARSA)) # for typechecker
returns = agent.train(n_runs)[1:]
returns_list.append(utils.cummean(returns))
name_list.append(agent.name)
V = agent.Q.max(axis=-1).reshape(4, 12)
pi = agent.Q.argmax(axis=-1).reshape(4, 12)
cliffwalk_imshow(V, pi, title=f"CliffWalking: {agent.name} Agent", width=800, height=400)
line(
returns_list,
names=name_list,
template="simple_white",
title="Q-Learning vs SARSA on CliffWalking-v0",
labels={"x": "Episode", "y": "Avg. reward", "variable": "Agent"},
width=700,
height=400,
)
Click to see the expected output
Bonus - Q($\lambda$)
We can take the same approach for Q-Learning too. However, we need to be careful: the eligibility trace propagates forward updates from the past, and whenever an exploratory action was taken, we are now doing updates off-policy, which is not what we want. One remedy is as follows:
Watkins' $Q(\lambda)$: Every time we take an exploratory action that was not the greedy action, we reset the eligibility trace.
This "cuts" the contribution if a exploratory action was taken instead of the greedy action. However, if the agent is very often taking exploratory actions, then rarely will TD-backups of more than a few steps will happen, defeating the entire point of using eligibility traces.
We leave it to you to implement this if you wish.
Bonus - build your own CliffWalking environment
You should return to this exercise at the end if you want to, or continue with the bonus material below.
You can modify the code used to define the Norvig class to define your own version of CliffWalking-v0. You can do this without guidance, or you can get some more guidance from the hint below.
Some notes for this task:
- The random agent will take a very long time to accidentally stumble into the goal state, and will slow down learning. You should probably neglect it.
- As soon as you hit the cliff, you should immediately move back to the start square, i.e. in pseudocode:
python if new_state in cliff: new_state = start_state reward -= 100This means you'll never calculate Q from the cliff, so your Q-values will always be zero here.
Hints (for both methods)
The main way in which the CliffWalking environment differs from the Norvig gridworld is that the former has cliffs while the latter has walls. Cliffs and walls have different behaviour; you can see how the cliffs affect the agent by visiting the documentation page for CliffWalking-v0.
__init__
This mainly just involves changing the dimensions of the space, position of the start and terminal states, and parameters like penalty. Also, rather than walls, you'll need to define the position of the cliffs (which behave differently).
dynamics
You'll need to modify dynamics in the following two ways:
- Remove the slippage probability (although it would be interesting to experiment with this and see what effect it has!)
- Remove the "when you hit a wall, you get trapped forever" behaviour, and replace it with "when you hit a cliff, you get a reward of -100 and go back to the start state".
class CliffWalking(Environment):
def __init__(self, penalty=-1):
self.height = 4
self.width = 12
self.penalty = penalty
num_states = self.height * self.width
num_actions = 4
self.states = np.array([[x, y] for y in range(self.height) for x in range(self.width)])
self.actions = np.array([[0, -1], [1, 0], [0, 1], [-1, 0]]) # up, right, down, left
self.dim = (self.height, self.width)
# special states: tuples of state and reward
# all other states get penalty
start = 36
terminal = np.array([47], dtype=int)
self.cliff = np.arange(37, 47, dtype=int)
self.goal_rewards = np.array([1.0, -1.0])
super().__init__(num_states, num_actions, start=start, terminal=terminal)
def dynamics(self, state: int, action: int) -> tuple[Arr, Arr, Arr]:
"""
Returns tuple of (out_states, out_rewards, out_probs) for this given (state, action) pair.
"""
raise NotImplementedError()
@staticmethod
def render(Q: Arr, name: str):
V = Q.max(axis=-1).reshape(4, 12)
pi = Q.argmax(axis=-1).reshape(4, 12)
cliffwalk_imshow(V, pi, title=f"CliffWalking: {name} Agent")
gym.envs.registration.register(
id="CliffWalking-myversion",
entry_point=DiscreteEnviroGym,
max_episode_steps=200,
nondeterministic=True,
kwargs={"env": CliffWalking(penalty=-1)},
)
gamma = 0.99
seed = 0
config_cliff = AgentConfig(epsilon=0.1, lr=0.1, optimism=0)
env = gym.make("CliffWalking-myversion")
n_runs = 500
args_cliff = (env, config_cliff, gamma, seed)
agents = [Cheater(*args_cliff), QLearning(*args_cliff), SARSA(*args_cliff), Random(*args_cliff)]
returns_list = []
name_list = []
for agent in agents:
returns = agent.train(n_runs)[1:]
returns_list.append(utils.cummean(returns))
name_list.append(agent.name)
line(
returns_list,
names=name_list,
template="simple_white",
title="Q-Learning vs SARSA on CliffWalking-v0",
labels={"x": "Episode", "y": "Avg. reward", "variable": "Agent"},
width=700,
height=400,
)
Click to see the expected output
Solution
class CliffWalking(Environment):
def __init__(self, penalty=-1):
self.height = 4
self.width = 12
self.penalty = penalty
num_states = self.height * self.width
num_actions = 4
self.states = np.array([[x, y] for y in range(self.height) for x in range(self.width)])
self.actions = np.array([[0, -1], [1, 0], [0, 1], [-1, 0]]) # up, right, down, left
self.dim = (self.height, self.width)
# special states: tuples of state and reward
# all other states get penalty
start = 36
terminal = np.array([47], dtype=int)
self.cliff = np.arange(37, 47, dtype=int)
self.goal_rewards = np.array([1.0, -1.0])
super().__init__(num_states, num_actions, start=start, terminal=terminal)
def dynamics(self, state: int, action: int) -> tuple[Arr, Arr, Arr]:
"""
Returns tuple of (out_states, out_rewards, out_probs) for this given (state, action) pair.
"""
def state_index(state):
assert 0 <= state[0] < self.width and 0 <= state[1] < self.height, print(state)
pos = state[0] + state[1] * self.width
assert 0 <= pos < self.num_states, print(state, pos)
return pos
pos = self.states[state]
if state in self.terminal:
return (np.array([state]), np.array([0]), np.array([1]))
# No slipping; each action is deterministic
out_probs = np.zeros(self.num_actions)
out_probs[action] = 1
out_states = np.zeros(self.num_actions, dtype=int) + self.num_actions
out_rewards = np.zeros(self.num_actions) + self.penalty
new_states = [pos + x for x in self.actions]
for i, s_new in enumerate(new_states):
if not (0 <= s_new[0] < self.width and 0 <= s_new[1] < self.height):
out_states[i] = state
continue
new_state = state_index(s_new)
# Check if would hit the cliff, if so then get -100 penalty and go back to start
if new_state in self.cliff:
out_states[i] = self.start
out_rewards[i] -= 100
else:
out_states[i] = new_state
for idx in range(len(self.terminal)):
if new_state == self.terminal[idx]:
out_rewards[i] = self.goal_rewards[idx]
return (out_states, out_rewards, out_probs)
@staticmethod
def render(Q: Arr, name: str):
V = Q.max(axis=-1).reshape(4, 12)
pi = Q.argmax(axis=-1).reshape(4, 12)
cliffwalk_imshow(V, pi, title=f"CliffWalking: {name} Agent")
Bonus - other suggestions
LR scheduler
Try using a schedule for the exploration rate $\epsilon$ (Large values early to encourage exploration, low values later once the agent has sufficient statistics to play optimally).
Would Q-Learning or SARSA be better off with a scheduled exploration rate?
The Sutton book mentions that if $\epsilon$ is gradually reduced, both methods asymptotically converge to the optimal policy. Is this what you find?
Other environments
Try other environments like Frozen Lake and BlackJack (source code here, here). Note that BlackJack uses Tuple(Discrete(32), Discrete(11), Discrete(2)) as it's observation space, so you will have to write some glue code to convert this back and forth between an observation space of Discrete(32 * 11 * 2) to work with our agents as written.
Double-Q learning
Read Sutton and Barto Section 6.7 Maximisation Bias and Double Learning. Implement Double-Q learning, and compare its performance against SARSA and Q-Learning.