Exercise Status: All exercises complete and verified

0️⃣ Whirlwind tour of PPO

Learning Objectives
  • Understand the mathematical intuitions of PPO
  • Learn how expressions like the PPO objective function are derived

This section is quite mathematically dense, and you'll cover a lot of it again as you go through the exercises (plus, understanding all the details behind PPO isn't strictly necessary to get all the benefits from this chapter).

At the end of each subsection we've included a box which summarizes the main points covered so far which should help distill out the main ideas as you're reading, as well as a couple of questions which help you check your understanding. We strongly recommend reading at least the contents of these boxes and attempting the questions, and also reading the section at the end describing today's setup.

Also, an important note - to simplify notation, everything here assumes a finite-horizon setting and no discount factor i.e. $\gamma = 1$. If we remove these assumptions, not a whole lot changes, but we wanted to higlight it to avoid confusion.

Policy Gradient methods vs DQN

We'll start by discussing the general class of policy gradient methods (of which PPO is a member), and compare it to DQN. To recap, DQN works as follows:

DQN involved learning the Q-function $Q(s, a)$, which represents the expected time-discounted future reward of taking action $a$ in state $s$. The update steps were based on the Bellman equation - this equation is only satisfied if we've found the true Q-function, so we minimize the squared TD residual to find it. We can derive the optimal policy by argmaxing $Q(s, a)$ over actions $a$.

On the other hand, policy gradient methods takes a more direct route - we write the expected future reward $J(\pi_\theta)$ as a function of our policy $\pi_\theta(a_t \mid s_t)$ (which takes the form of a neural network mapping from states to action logits), and then perform gradient ascent on this function to improve our policy directly i.e. $\theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta)$. In this way, we essentially sidestep having to think about the Bellman equation at all, and we directly optimize our policy function.

A question remains here - how can we take the derivative of expected future returns as a function of our policy $\pi_\theta(a_t \mid s_t)$, in a way which is differentiable wrt $\theta$? We'll discuss this in the next section.

Summary so far

  • In policy gradient methods, we directly optimize the policy function $\pi_\theta(a_t \mid s_t)$ to get higher expected future rewards.

Policy Gradient objective function

Note, this is the most mathematically dense part of the material, and probably the part that's most fine to skip.

Let $\pi_\theta$ be our policy parameterized by $\theta$, and $J(\pi_\theta)$ denote the expected return of the policy (assuming undiscounted & finite-horizon for simplicity). Let $J(\pi_\theta)$ be the expected return of the policy $\pi_\theta$:

$$ J(\pi_\theta) = \underset{\tau \sim \pi_\theta}{\mathbb{E}} \left[ \sum_{t=0}^T r_{t+1}(s_t, a_t, s_{t+1}) \right] $$

where $\tau = (s_0, a_0, ..., s_T)$ stands for a trajectory sampled from policy $\pi_\theta$, and we've written the rewards as $r_{t+1}(s_t, a_t, s_{t+1})$ to make it clear that they are a function of the trajectory. Then a theorem sometimes called the policy gradient theorem (PGT) says that the gradient of $J(\pi_\theta)$ is:

$$ \nabla_\theta J\left(\pi_\theta\right)=\underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right) A_\theta(s_t, a_t)\right] \quad (*) $$

where $A_\theta(s_t, a_t)$ is the advantage function, defined as $Q_\theta(s_t, a_t) - V_\theta(s_t)$, i.e. how much better it is to choose action $a_t$ in state $s_t$ as compared to the value obtained by following $\pi_\theta$ from that state onwards.

The derivation is optional (included in a dropdown below for completeness), but it is worth thinking about this expression intuitively:

  • If the advantage $A_\theta(s_t, a_t)$ is positive, this means taking action $a_t$ in state $s_t$ is better on average than what we'd do under $\pi_\theta$. So if we increase $\pi_\theta(a_t \mid s_t)$ then $J(\pi_\theta)$ will increase, and vice-versa.
  • If the advantage $A_\theta(s_t, a_t)$ is negative, this means taking action $a_t$ in state $s_t$ is worse than expectation, so if we increase $\pi_\theta(a_t \mid s_t)$ then $J(\pi_\theta)$ will decrease, and vice-versa.

So this expression is telling us that we should change our policy $\pi_\theta(a_t \mid s_t)$ in a way which makes advantageous actions more likely, and non-advantageous actions less likely. All pretty intuitive!

Note that instead of advantages $A_\theta(s_t, a_t)$, we could just use the total reward $R(\tau)$ in our objective function (we show this in the derivation below, because proving $(*)$ with $R(\tau)$ instead is actually an intermediate step in the proof). We call this algorithm REINFORCE. It can work, but the problem is it leads to much higher variance in our assessment of the value of actions. As a thought experiment: imagine you're assessing moves in a game of chess which you ended up winning. With $R(\tau)$, every move contributing to the win is rewarded equally, and so we can't differentiate between excellent moves and average ones. The advantage function is a lot more fine-grained: it allows you to answer the question "was this move better than what I typically play in this position?" for each move, isolating the contribution of the move from the game outcome as a whole.

Full derivation (optional)

Summary: we can quite easily get a formula for $\nabla_\theta J\left(\pi_\theta\right)$ which looks like the one above, but has the total reward $R(\tau)$ instead of the advantage $A_\theta(s_t, a_t)$. We can then use a bag of tricks to show that we can subtract baseline functions from $R(\tau)$ in our expression without changing its value, until we get to an expression with $A_\theta(s_t, a_t)$ instead.

Let's go through the derivation line by line. Denoting $\int_\tau$ as an integral over the distribution of all states & actions in the trajectory, we have:

$$ \begin{aligned} \nabla_\theta J\left(\pi_\theta\right) & =\nabla_\theta \underset{\tau \sim \pi_\theta}{\mathbb{E}}[R(\tau)] \\ & =\nabla_\theta \int_\tau P(\tau \mid \theta) R(\tau) \quad \text{Expand integration} \\ & =\int_\tau \nabla_\theta P(\tau \mid \theta) R(\tau) \quad \text{Bring gradient inside integral} \\ & =\int_\tau P(\tau \mid \theta) \nabla_\theta \log P(\tau \mid \theta) R(\tau) \quad \text{Use log derivative trick} \\ & =\underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\nabla_\theta \log P(\tau \mid \theta) R(\tau)\right] \quad \text{Recognize expectation} \\ & =\underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right) R(\tau)\right] \quad (*) \end{aligned} $$

Where the log derivative trick is a rearrangement of the standard result $\nabla_\theta \log P_\theta(x) = \frac{\nabla_\theta P_\theta(x)}{P_\theta(x)}$, and the final line was reached by writing $P(\tau \mid \theta)$ as a product of transition probabilities and policy probabilities $\pi_\theta(a_t \mid s_t)$, using the fact that log of a product is the sum of logs, meaning we get:

$$ \log P(\tau \mid \theta) = \sum_{t=0}^T \log \pi_\theta(a_t \mid s_t) + \sum_{t=0}^T \log P(s_{t+1} \mid s_t, a_t) $$
and the latter term vanishes when we take the gradient wrt $\theta$ because it's independent of $\theta$.

This formula looks like the one we had above, the only difference is that we have the total trajectory reward $R(\tau)$ instead of the advantage $A_\theta(s_t, a_t)$. And intuitively it seems like we should be able to get to this - after all, $R(\tau)$ is the sum of rewards at each timestep, and rewards accumulated before time $t$ shouldn't affect whether $\pi_\theta(a_t \mid s_t)$ should be increased or decreased. Although this is very intuitive, it turns out we have to do a bit of work to prove it.

Firstly, let's establish a lemma called the expected grad-log-prob lemma. This states that the expected gradient of a probability distribution is always zero:

$$ \underset{x \sim P_\theta}{\mathbb{E}}\left[\nabla_\theta \log P_\theta(x)\right] = 0 $$

Proof: we can write the expectation above as the integral $\int_x P_\theta(x) \nabla_\theta \log P_\theta(x) dx$, which can be written as $\int_x \nabla_\theta P_\theta(x) dx$ by the log-derivative trick. Then we swap the order of integration and differentiation to get $\nabla_\theta \int_x P_\theta(x) dx$, and then using the fact that $P_\theta$ is a probability distribution, this becomes $\nabla_\theta 1 = 0$.

To return to our policy gradient setting, not only does this show us that $\mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\right] = 0$, it also shows us that $\mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t) f(\tau) \right] = 0$ whenever the function $f(\tau)$ is only a function of the early trajectory values $s_0, a_0, ..., s_{t-1}, a_{t-1}, s_t$, because this still falls out as zero when we integrate over the distribution of our action $a_t$. This means that in $(*)$, we can actually replace $R(\tau)$ with $R(\tau) - f(\tau)$ for any such choice of $f(\tau)$. We choose $f(\tau) = \mathbb{E}_{\tau \sim \pi_\theta}\left[R(\tau) \mid s_t\right]$, i.e. the expected return conditioned on the trajectory up to the early trajectory values. The already-accumulated rewards $r_1, ..., r_t$ cancel, and so in $(*)$ the term for any given timestep $t$ becomes:

$$ \underset{s_0, ..., s_t, a_t}{\mathbb{E}}\left[\nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right) \left( \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[R(\tau) \mid s_0, ..., s_t, a_t\right] - \underset{\tau \sim \pi_\theta}{\mathbb{E}}\left[R(\tau) \mid s_0, ..., s_t\right] \right) \right] $$

but since the first of these terms conditions on the action $a_t$ and the second doesn't, we recognize the term in the large brackets as exactly $Q_\theta(s_t, a_t) - V_\theta(s_t) = A_\theta(s_t, a_t)$, as required.

We have this formula, but how do we use it to get an objective function we can optimize for? The answer is that we take estimates of the advantage function $\hat{A}_{\theta_\text{target}}(s_t, a_t)$ using a frozen version of our parameters $\theta_{\text{target}}$ (like we took next-step Q-values from a frozen target network in DQN), and use our non-frozen parameters to get our values $\pi_\theta(s_t \mid a_t)$. For a given batch of experiences $B$ (which can be assumed to be randomly sampled across various different trajectories $\tau$), our objective function is:

$$ L(\theta) = \frac{1}{|B|} \sum_{t \in B} \log \pi_\theta(a_t \mid s_t) \hat{A}_{\theta_\text{target}}(s_t, a_t) $$
because then:
$$ \nabla_\theta L(\theta) = \frac{1}{|B|} \sum_{t \in B} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \hat{A}_{\theta_\text{target}}(s_t, a_t) \approx \nabla_\theta J(\pi_\theta) $$
exactly as we want! We can now perform gradient ascent on this objective function to improve our policy: $\theta \leftarrow \theta + \alpha \nabla_\theta L(\theta)$ will be an approximation of the ideal update rule $\theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta)$.

Summary so far

  • In policy gradient methods, we directly optimize the policy function $\pi_\theta(a_t \mid s_t)$ to get higher expected future rewards.
  • Our objective function is a sum of logprobs of actions taken, weighted by their advantage estimates $\hat{A}_\theta(s_t, a_t)$ (i.e. how good we think that action was), so performing gradient ascent on this leads to taking more good actions and less bad ones.
  • Note that we could just use accumulated rewards $R(\tau)$ rather than the advantage function, but using advantage estimates is a lot more stable.
Question - can you intuitively explain how the advantage function influences policy updates?

The advantage function scales updates; positive $A_\theta$ will cause us to increase the action likelihood (becasuse the probability of that action will have a positive coefficient in the objective function), and negative $A_\theta$ will cause us to decrease the action likelihood.

Actor & critic

Unlike DQN, we require 2 different networks for policy gradient methods:

  • actor: learns the policy function $\pi_\theta(a_t \mid s_t)$, i.e. inputs are $s_t$ and outputs (for discrete action spaces) are a vector of logits for each action $a_t$
  • critic: learns the value function $V_\theta(s_t)$ which is used to estimate the advantage function estimates $\hat{A}_\theta(s_t, a_t)$, i.e. inputs are $s_t$ and outputs a single scalar value

As we discussed in the last section, estimating $\hat{A}_\theta(s_t, a_t)$ is valuable because without it we'd have to rely on the accumulated reward $R(\tau)$ in our objective function, which is very high-variance and doesn't allow for granular credit assignment to different actions. In simple environments like CartPole you may find you can get by without the critic, but as we move into more complex environments this ceases to be the case.

Note - unlike DQN, policy gradient methods are able to take continuous action spaces. This is because we can have our actor output a vector of means and variances parameterising a distribution over actions, and then sample actions from this distribution. On the other hand, our Q-network in DQN is only able to take in states and output a single scalar value for each discrete action. This will be important when we look at MuJoCo later.

You might have a question at this point - how does the critic learn the value function? After all, the loss function $L(\theta)$ is designed just to update the policy $\pi_\theta$ (i.e. the actor network), not the critic network. The critic is used to compute the advantage function estimates, but these come from $\theta_\text{old}$ in the objective function $L(\theta)$, i.e. they don't track gradients. The answer is that we improve our value function estimates by adding another loss term which minimizes the TD residual. We'll add to the term $(V_\theta(s_t) - V_t^\text{target})^2$ into our loss function, where $V_\theta(s_t)$ are the value function estimates from our critic network (which do track gradients) and $V_t^\text{target} := V_{\theta_\text{target}}(s_t) + \hat{A}_{\theta_\text{target}}(s_t, a_t)$ are the next-step value estimates taken from our target network, which take into account the actual action taken $a_t$ and how much it changed our value.

Summary so far

  • In policy gradient methods, we directly optimize the policy function $\pi_\theta(a_t \mid s_t)$ to get higher expected future rewards.
  • Our objective function is a sum of logprobs of actions taken, weighted by their advantage estimates $\hat{A}_\theta(s_t, a_t)$ (i.e. how good we think that action was), so performing gradient ascent on this leads to taking more good actions and less bad ones.
  • Note that we could just use accumulated rewards $R(\tau)$ rather than the advantage function, but using advantage estimates is a lot more stable.
  • We have 2 networks: actor which learns $\pi_\theta(a_t \mid s_t)$ using this objective function, and critic which learns $V_\theta(s_t)$ by minimizing the TD residual (a bit like SARSA), and allows us to estimate the advantage $\hat{A}_\theta(s_t, a_t)$ which is used in the objective function.
Question - why do policy gradient methods require both actor and critic networks, and how do they complement each other?

The actor learns the policy; the critic estimates value functions for stable advantage calculation. Without the actor we wouldn't have any policy to learn the value for, and without the critic we wouldn't be able to competently estimate the advantage function which is necessary so that we can compute our objective function / understand how we should update our policy.

Question - why is the critic network's loss function conceptually similar to the update rule we used for SARSA?

The SARSA update rule was:

$$ Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \eta \left( r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t,a_t) \right) $$

which is actually equivalent to the update rule we'd get if our loss function was the squared TD error to our $Q$ function. Our critic loss function is pretty much the same idea, except we're applying the TD error to $V(s_t)$ rather than $Q(s_t, a_t)$.

Note that SARSA differed from Q-Learning/DQN because the latter also included a maximization over the action space - we were essentially doing policy improvement and learning the value function for it at the same time. Here, our critic loss function is more conceptually similar to SARSA than it is to Q-Learning/DQN, because the policy improvement is coming from the actor network instead.

Generalized Advantage Estimation

We've got a lot of the pieces in place for PPO now - we have an actor and critic network, and we have 2 objective functions: one to train the critic to estimate the value function $V_\theta(s_t)$ accurately (which are used to estimate the advantage function $\hat{A}_\theta(s_t, a_t)$), and one which trains the actor to maximize the expected future reward based on these advantage estimates.

A question remains now - how do we use value estimates to compute advantage estimates? Here are some ideas:

  1. We can use the 1-step residual, i.e. $\hat{A}_\theta(s_t, a_t) = \delta_t = r_t + \gamma V_\theta(s_{t+1}) - V_\theta(s_t)$ just like we used in DQN. The problem with this is that we're only estimating the advantage based on a single action taken, which is a bit too myopic. If we sacrifice a piece in our chess game to win the match, we want to make sure our advantage estimates take this future position into account, rather than just thinking that we're down one piece!
  2. We can use the sum of future residuals, i.e. $\hat{A}_\theta(s_t, a_t) = \delta_t + \gamma \delta_{t+1} + ...$. This fixes the myopia problem, but brings back a new problem - doing this is pretty much just like using $R(\tau)$ in our objective function instead, in other words we're looking at the entire future trajectory at once! This leads to unstable training, and an inability to credit any individual action.

The solution is a middleground between these two: we perform generalized advantage estimation (GAE), which is a sum of future residuals but geometrically decaying according to some factor $\lambda$. In other words, we take $\hat{A}^{\text{GAE}(\lambda)}_\theta(s_t, a_t) = \delta_t + \lambda \gamma \delta_{t+1} + \lambda^2 \gamma^2 \delta_{t+2} + ...$. This is effectively the best of both worlds - we put the largest weight on the next actions taken (allowing us to attribute actions rather than entire trajectories), but also we do take into account future states in our trajectory (meaning we're not only concerned with the immediate reward). Note that $\lambda=0$ reduces to the first idea, and $\lambda=1$ to the second.

Note that the fact that we use GAE also helps a lot for our critic network - in SARSA we were minimizing the 1-step TD error, but here we're training $V(s_t)$ to be more in line with a lookahead estimate of the value function which takes into account many future actions and states. This helps improve stability and speed up convergence.

Summary so far

  • In policy gradient methods, we directly optimize the policy function $\pi_\theta(a_t \mid s_t)$ to get higher expected future rewards.
  • Our objective function is a sum of logprobs of actions taken, weighted by their advantage estimates $\hat{A}_\theta(s_t, a_t)$ (i.e. how good we think that action was), so performing gradient ascent on this leads to taking more good actions and less bad ones.
  • Note that we could just use accumulated rewards $R(\tau)$ rather than the advantage function, but using advantage estimates is a lot more stable.
  • We have 2 networks: actor which learns $\pi_\theta(a_t \mid s_t)$ using this objective function, and critic which learns $V_\theta(s_t)$ by minimizing the TD residual (a bit like SARSA), and allows us to estimate the advantage $\hat{A}_\theta(s_t, a_t)$ which is used in the objective function.
  • We use generalized advantage estimation (GAE) to convert our value function estimates into advantage estimates - this mostly avoids the two possible extremes of (1) myopia from only looking at the immediate reward, and (2) instability / failure to credit single actions from looking at the entire trajectory at once.
Question - can you explain why using GAE is better than using the realized return trajectory in our loss function?

GAE is much more stable, because using the entire trajectory means we're only taking into account the actual reward accumulated (which can have much higher variance than an advantage estimate, assuming we already have a good policy). Additionally, GAE allows us to credit individual actions for the future rewards they lead to, which is something we couldn't do with the realized return trajectory.

PPO

We're pretty much there now - we've established all the theoretical pieces we need for PPO, and there's just 3 final things we need to add to the picture.

  1. We use an entropy bonus to encourage policy exploration, and prevent premature convergence to suboptimal policies.

Entropy is a very deep mathematical topic that we won't dive all the way into here, but for the sake of brevity, we can say that entropy is a measure of uncertainty - policies which will definitely take the same action in the same state have low entropy, and policies which have a wide range of likely actions have high entropy. We add some multiple of the entropy of our policy function directly onto our objective function to be maximized, with the entropy coefficient usually decaying over time as we move from explore to exploit mode.

  1. We clip the objective function $L(\theta)$ to get $L^\text{CLIP}(\theta)$, to prevent the policy from changing too much too fast.

The clipping is applied to make sure the ratio $\pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{target}}(a_t \mid s_t)$ stays close to 1, during any single learning phase (between learning phases we generate a new batch of experiences and reset $\theta_\text{target}$). Intuitively, this is because the more our policy changes from the old policy, the more unrealistic the generated experiences will be. For example, suppose we generate experiences from a bunch of chess games, where a particular class of strategies (e.g. playing more aggressively) is beneficial. We shouldn't update on these games indefinitely, because as we update and the agent's policy changes to become more aggressive, the generated experiences will no longer be accurate representations of the agent's strategy and so our objective function will no longer be a good estimate of the expected future reward.

There are various ways to keep this ratio close to 1. Trust region policy optimization (TRPO) explicitly adds a multiple of the KL divergence to the loss function, making sure the distributions stay close. PPO however does something a lot more simple and hacky - if the ratio is larger than $1+\epsilon$ for actions with positive advantage (for some constant $\epsilon > 0$) then we clip it, preventing gradients from flowing & updating the policy network more. We do the same thing with $1-\epsilon$ when the advantage is negative.

  1. We actually use $\dfrac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{target}}(a_t \mid s_t)}$ rather than $\log \pi_\theta(a_t \mid s_t)$ when computing $L^\text{CLIP}(\theta)$.

This is a valid thing to do precisely because we're using clipping - this allows us to assume the probability ratio is usually close to 1, and so we can use the approximation $\log(x) \approx x - 1$ for all $x \approx 1$ (with this approximation, the two loss functions are equal up to a constant that doesn't depend on $\theta$ - we leave the proof as an exercise to the reader).

Although these 3 changes are all important, it's the 2nd alteration that distinguishes PPO from other policy gradient methods. It's where it gets its name - proximal refers to the way in which this clipping keeps us close to the old policy.

Full summary

  • In policy gradient methods, we directly optimize the policy function $\pi_\theta(a_t \mid s_t)$ to get higher expected future rewards.
  • Our objective function is a sum of logprobs of actions taken, weighted by their advantage estimates $\hat{A}_\theta(s_t, a_t)$ (i.e. how good we think that action was), so performing gradient ascent on this leads to taking more good actions and less bad ones.
  • Note that we could just use accumulated rewards $R(\tau)$ rather than the advantage function, but using advantage estimates is a lot more stable.
  • We have 2 networks: actor which learns $\pi_\theta(a_t \mid s_t)$ using this objective function, and critic which learns $V_\theta(s_t)$ by minimizing the TD residual (a bit like SARSA), and allows us to estimate the advantage $\hat{A}_\theta(s_t, a_t)$ which is used in the objective function.
  • We use generalized advantage estimation (GAE) to convert our value function estimates into advantage estimates - this mostly avoids the two possible extremes of (1) myopia from only looking at the immediate reward, and (2) instability / failure to credit single actions from looking at the entire trajectory at once.
  • On top of all this, 2 other techniques fully characterize PPO:
  • We add an entropy bonus to our objective function to encourage exploration.
  • We clip the objective function so $\pi_\theta(a_t \mid s_t)$ isn't incentivized to change too quickly (which could cause instability) - this is the "proximal" part of PPO.
Question - what is the role of the entropy term, and should it be added to or subtracted from the clipped objective function?

The entropy term encourages policy exploration. We want to add it to the objective function, because we're doing gradient ascent on it (and we want to increase exploration).

Question - how does clipping the objective function help prevent large policy updates, and why is this desireable?

Clipping prevents large policy changes by capping gradients when the update exceeds some value $\epsilon$, generally ensuring proximity to the old policy.

This is good because we don't want to change our policy too quickly, based on possibly a limited set of experiences.

Today's setup

We've now covered all the theory we need to understand about PPO! To conclude, we'll briefly go through how our PPO algorithm is going to be set up in practice today, and relate it to what we've discussed in the previous sections.

A full diagram of our implementation is shown below:

We have 2 main phases: the rollout phase (where we generate a batch of experiences from our frozen network $\theta_\text{target}$) and the learning phase (where we update our policy $\pi_\theta$ based on the experiences generated in the rollout phase, as well as the outputs of our current network $\theta$). This is quite similar to the setup we used for DQN (where we'd alternate between generating experiences for our buffer and learning from those experiences) - the difference here is that rather than the rollout phase adding to our buffer, it'll be emptying our buffer and creating an entirely new one. So as not to be wasteful, the learning phase will iterate over our buffer multiple times before we repeat the cycle.

Just like we had ReplayBuffer, DQNAgent and DQNTrainer as our 3 main classes yesterday, here we have 3 main classes:

  • ReplayMemory stores experiences generated by the agent during rollout, and has a get_minibatches method which samples data to be used in the learning phase
  • PPOAgent manages the interaction between our policy and environment (particularly via the play_step method), it generates experiences and adds them to our memory
  • PPOTrainer is the main class for training our model, it's essentially a wrapper around everything else

As we can see in the diagram, the learning phase has us compute an objective function which involves 3 different terms:

  • The entropy bonus (for encouraging policy exploration) - this trains only our actor network
  • The clipped surrogate objective function (for policy improvement) - this trains only our actor network (although it uses the critic network's estimates from $\theta_\text{target}$), and it's the most important of the three terms
  • The value function loss (for improving our value function estimates) - this trains only our critic network