2️⃣ Learning Phase
Learning Objectives
- Implement the total objective function (sum of three separate terms)
- Understand the importance of each of these terms for the overall algorithm
- Write a function to return an optimizer and learning rate scheduler for your model
In the last section, we wrote a lot of setup code (including handling most of how our rollout phase will work). Next, we'll turn to the learning phase.
In the next exercises, you'll write code to compute your total objective function. This is given by equation $(9)$ in the paper, and is the sum of three terms - we'll implement each one individually.
Note - the convention we've used in these exercises for signs is that your function outputs should be the expressions in equation $(9)$, in other words you will compute $L_t^{CLIP}(\theta)$, $c_1 L_t^{VF}(\theta)$ and $c_2 S[\pi_\theta](s_t)$. We will then perform gradient ascent by passing maximize=True into our optimizers. An equally valid solution would be to just return the negative of the objective function.
Objective function
Clipped Surrogate Objective (detail #8)
For each minibatch, calculate $L^{CLIP}$ from equation $(7)$ in the paper. This will allow us to improve the parameters of our actor.
If you want to recap the derivation of this expression you can use the dropdown below or go back to section 0️⃣, but the key intuition is as follows:
- Ignoring clipping, this objective function will lead to a positive gradient for $\pi_\theta(a_t \mid s_t)$ when the action $a_t$ had positive advantage - in other words, take more good actions and less bad actions!
- The clipping formula looks complicated, but actually all it's saying is "if the probability ratio goes outside the bounds $[1-\epsilon, 1+\epsilon]$ then we no longer apply gradients to it". This is good because it means if the policy changes signficantly, we don't overupdate on experiences that weren't generated by this new changed policy. This is where the "proximal" part of PPO comes from.
Click here to see the full derivation
To fully explain this function, we start with the non-proximal version which doesn't involve clipping:
If we replace the probability ratio here with the log probability $\log \pi_\theta(a_t | s_t)$ then we get something that looks a lot more like the loss function we saw in section 0️⃣, i.e. the one whose gradient $\nabla_\theta L(\theta)$ we proved to be equal to the gradient of the sum of expected returns $\nabla_\theta J(\pi_\theta)$ (see section 0️⃣ for proof). The reason we use the probability ratio here rather than the log probability is for stability - as long as $\pi_\theta$ and $\pi_{\theta_\text{target}}$ are sufficiently close, this ratio will be close to 1, and so we can use the approximation $\log x \approx x - 1$ for $x$ close to 1. With this approximation, we get:
Rearranging, we get $\dfrac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{target}}(a_t \mid s_t)} \approx \log(\pi_\theta(a_t \mid s_t)) + \text{const}$, where $\text{const}$ is independent of $\theta$.
This means the substitution we made above is valid.
Why can we assume $x$ is close to 1? That brings us to the second part of the objective function, the clipping. The $\text{min}$ and $\text{clip}$ in the expression might look a bit messy and confusing at first, but it's a lot simpler than it looks - in practice all this means is that we clip the probability ratio when it's larger than $1+\epsilon$ for positive advantages, or less than $1-\epsilon$ for negative advantages. For example, if $\hat{A}_t > 0$ then $(1-\epsilon) \hat{A}_t$ will always be less than $r_t(\theta) \hat{A}_t$, and so the expression reduces to $\text{min}\left(r_t(\theta) \hat{A}_t, (1+\epsilon) \hat{A}_t\right) = \text{min}(r_t(\theta), 1+\epsilon) \hat{A}_t$. The illustration below might help.
You might find the illustration below more helpful for explaining how the clipping works:

Exercise - write calc_clipped_surrogate_objective
Implement the function below. The current probabilities $\pi_\theta(a_t \mid s_t)$ come from the actions mb_actions evaluated using the newer probs, and the old probabilities $\pi_{\theta_\text{target}}(a_t \mid s_t)$ come from the stored mb_logprobs.
A few things to note:
- You should pay attention to the normalization instructions in detail #7 when implementing this loss function. They add a value of
eps = 1e-8to the denominator to avoid division by zero, you should also do this. - You can use the
probs.log_probmethod to get the log probabilities that correspond to the actions inmb_action.- If you're wondering why we're using a
Categoricaltype rather than just usinglog_probdirectly, it's because we'll be using them to sample actions later on in ourtrain_ppofunction. Also, categoricals have a useful method for returning the entropy of a distribution (which will be useful for the entropy term in the loss function).
- If you're wondering why we're using a
- Our
mb_actionhas shape(minibatch_size, *action_shape), but in most of the environments we're dealing with (CartPole, and later the Breakout Atari env) the action shape is an empty tuple, which is why we have the assert statement at the start of this function. - The clip formula can be a bit finnicky (i.e. when you take signs and max/mins), we recommend breaking the computation onto a few separate lines rather than doing it all in one go!
def calc_clipped_surrogate_objective(
dist: Categorical,
mb_action: Int[Tensor, "minibatch_size"],
mb_advantages: Float[Tensor, "minibatch_size"],
mb_logprobs: Float[Tensor, "minibatch_size"],
clip_coef: float,
eps: float = 1e-8,
) -> Float[Tensor, ""]:
"""Return the clipped surrogate objective, suitable for maximisation with gradient ascent.
dist:
a distribution containing the actor's unnormalized logits of shape (minibatch_size, num_actions)
mb_action:
what actions actions were taken in the sampled minibatch
mb_advantages:
advantages calculated from the sampled minibatch
mb_logprobs:
logprobs of the actions taken in the sampled minibatch (according to the old policy)
clip_coef:
amount of clipping, denoted by epsilon in Eq 7.
eps:
used to add to std dev of mb_advantages when normalizing (to avoid dividing by zero)
"""
assert mb_action.shape == mb_advantages.shape == mb_logprobs.shape
raise NotImplementedError()
tests.test_calc_clipped_surrogate_objective(calc_clipped_surrogate_objective)
Solution
def calc_clipped_surrogate_objective(
dist: Categorical,
mb_action: Int[Tensor, "minibatch_size"],
mb_advantages: Float[Tensor, "minibatch_size"],
mb_logprobs: Float[Tensor, "minibatch_size"],
clip_coef: float,
eps: float = 1e-8,
) -> Float[Tensor, ""]:
"""Return the clipped surrogate objective, suitable for maximisation with gradient ascent.
dist:
a distribution containing the actor's unnormalized logits of shape (minibatch_size, num_actions)
mb_action:
what actions actions were taken in the sampled minibatch
mb_advantages:
advantages calculated from the sampled minibatch
mb_logprobs:
logprobs of the actions taken in the sampled minibatch (according to the old policy)
clip_coef:
amount of clipping, denoted by epsilon in Eq 7.
eps:
used to add to std dev of mb_advantages when normalizing (to avoid dividing by zero)
"""
assert mb_action.shape == mb_advantages.shape == mb_logprobs.shape
logits_diff = dist.log_prob(mb_action) - mb_logprobs
prob_ratio = t.exp(logits_diff)
mb_advantages = (mb_advantages - mb_advantages.mean()) / (mb_advantages.std() + eps)
non_clipped = prob_ratio * mb_advantages
clipped = t.clip(prob_ratio, 1 - clip_coef, 1 + clip_coef) * mb_advantages
return t.minimum(non_clipped, clipped).mean()
tests.test_calc_clipped_surrogate_objective(calc_clipped_surrogate_objective)
Value Function Loss (detail #9)
The value function loss lets us improve the parameters of our critic. Today we're going to implement the simple form: this is just the mean squared difference between the following two terms:
- The critic's prediction - this is $V_\theta(s_t)$ in the paper, and
valuesin our code (i.e. values computed from the updated critic network) - The observed returns - this is $V_t^\text{targ}$ in the paper, and
returns = memory.advantages + memory.valuesin our code (i.e. values generated during rollout)
Note that the observed returns are equivalent to the observed next-step Q values, meaning the squared difference between these two is the TD error. It's analogous to the TD error we used during SARSA - the purpose is to bring $V_\theta(s_t)$ closer in line with the actual value estimated with the benefit of seeing more future actions. Remember that because we're using GAE, the observed returns $V_t^\text{target}$ don't just take into account the next action, but many future actions in the trajectory.
Note - the PPO paper did a more complicated thing with clipping, but we're going to deviate from the paper and NOT clip, since detail #9 gives evidence that it isn't beneficial.
Exercise - write calc_value_function_loss
Implement calc_value_function_loss which returns the term denoted $c_1 L_t^{VF}$ in equation $(9)$.
def calc_value_function_loss(
values: Float[Tensor, "minibatch_size"],
mb_returns: Float[Tensor, "minibatch_size"],
vf_coef: float,
) -> Float[Tensor, ""]:
"""Compute the value function portion of the loss function.
values:
the value function predictions for the sampled minibatch (using the updated critic network)
mb_returns:
the target for our updated critic network (computed as `advantages + values` from the old
network)
vf_coef:
the coefficient for the value loss, which weights its contribution to the overall loss.
Denoted by c_1 in the paper.
"""
assert values.shape == mb_returns.shape
raise NotImplementedError()
tests.test_calc_value_function_loss(calc_value_function_loss)
Solution
def calc_value_function_loss(
values: Float[Tensor, "minibatch_size"],
mb_returns: Float[Tensor, "minibatch_size"],
vf_coef: float,
) -> Float[Tensor, ""]:
"""Compute the value function portion of the loss function.
values:
the value function predictions for the sampled minibatch (using the updated critic network)
mb_returns:
the target for our updated critic network (computed as `advantages + values` from the old
network)
vf_coef:
the coefficient for the value loss, which weights its contribution to the overall loss.
Denoted by c_1 in the paper.
"""
assert values.shape == mb_returns.shape
return vf_coef * (values - mb_returns).pow(2).mean()
Entropy Bonus (detail #10)
The entropy bonus term is intended to incentivize exploration by increasing the entropy of the actions distribution. For a discrete probability distribution $p$, the entropy $H$ is defined as
probs.entropy computes it using the above formula but in a numerically stable way, and in
a way that handles the case where $p(x) = 0$.
Question: in CartPole, what are the minimum and maximum values that entropy can take? What behaviors correspond to each of these cases?
Answer
The minimum entropy is zero, under the policy "always move left" or "always move right".
The maximum entropy is $\ln(2) \approx 0.693$ under the uniform random policy over the 2 actions.
Separately from its role in the loss function, the entropy of our action distribution is a useful diagnostic to have: if the entropy of agent's actions is near the maximum, it's playing nearly randomly which means it isn't learning anything (assuming the optimal policy isn't random). If it is near the minimum especially early in training, then the agent might not be exploring enough.
Exercise - write calc_entropy_bonus
def calc_entropy_bonus(dist: Categorical, ent_coef: float):
"""Return the entropy bonus term, suitable for gradient ascent.
dist:
the probability distribution for the current policy
ent_coef:
the coefficient for the entropy loss, which weights its contribution to the overall
objective function. Denoted by c_2 in the paper.
"""
raise NotImplementedError()
tests.test_calc_entropy_bonus(calc_entropy_bonus)
Solution
def calc_entropy_bonus(dist: Categorical, ent_coef: float):
"""Return the entropy bonus term, suitable for gradient ascent.
dist:
the probability distribution for the current policy
ent_coef:
the coefficient for the entropy loss, which weights its contribution to the overall
objective function. Denoted by c_2 in the paper.
"""
return ent_coef * dist.entropy().mean()
Adam Optimizer & Scheduler (details #3 & #4)
Even though Adam is already an adaptive learning rate optimizer, empirically it's still beneficial to decay the learning rate.
Exercise - implement PPOScheduler
Implement a linear decay from initial_lr to end_lr over total_training_steps steps. Also, make sure you read details details #3 and #4 so you don't miss any of the Adam implementational details. Note, the training terminates after num_updates, so you don't need to worry about what the learning rate will be after this point.
Recall from our look at optimizers in the first week: we edit hyperparameters like learning rate of an optimizer by iterating through optimizer.param_groups and setting the param_group["lr"] attribute.
We've implemented the make_optimizer function for you. Note that we've passed in itertools.chain to the optimizer - this creates a single iterable out of multiple iterables, and is necessary because the optimizer expects a single iterable. Another option would just be to convert the parameters to lists and concatenate them.
class PPOScheduler:
def __init__(self, optimizer: Optimizer, initial_lr: float, end_lr: float, total_phases: int):
self.optimizer = optimizer
self.initial_lr = initial_lr
self.end_lr = end_lr
self.total_phases = total_phases
self.n_step_calls = 0
def step(self):
"""
Implement linear learning rate decay so that after `total_phases` calls to step, the
learning rate is end_lr.
Do this by directly editing the learning rates inside each param group (i.e.
`param_group["lr"] = ...`), for each param group in `self.optimizer.param_groups`.
"""
raise NotImplementedError()
def make_optimizer(
actor: nn.Module, critic: nn.Module, total_phases: int, initial_lr: float, end_lr: float = 0.0
) -> tuple[optim.Adam, PPOScheduler]:
"""
Return an appropriately configured Adam with its attached scheduler.
"""
optimizer = optim.AdamW(
itertools.chain(actor.parameters(), critic.parameters()),
lr=initial_lr,
eps=1e-5,
maximize=True,
)
scheduler = PPOScheduler(optimizer, initial_lr, end_lr, total_phases)
return optimizer, scheduler
tests.test_ppo_scheduler(PPOScheduler)
Solution
class PPOScheduler:
def __init__(self, optimizer: Optimizer, initial_lr: float, end_lr: float, total_phases: int):
self.optimizer = optimizer
self.initial_lr = initial_lr
self.end_lr = end_lr
self.total_phases = total_phases
self.n_step_calls = 0
def step(self):
"""
Implement linear learning rate decay so that after `total_phases` calls to step, the
learning rate is end_lr.
Do this by directly editing the learning rates inside each param group (i.e.
`param_group["lr"] = ...`), for each param group in `self.optimizer.param_groups`.
"""
self.n_step_calls += 1
frac = self.n_step_calls / self.total_phases
assert frac <= 1
for param_group in self.optimizer.param_groups:
param_group["lr"] = self.initial_lr + frac * (self.end_lr - self.initial_lr)