☆ Bonus
Trust Region Methods
Some versions of the PPO algorithm use a slightly different objective function. Rather than our clipped surrogate objective, they use constrained optimization (maximising the surrogate objective subject to a restriction on the KL divergence between the old and new policies).
The intuition behind this is similar to the clipped surrogate objective. For our clipped objective, we made sure the model wasn't rewarded for deviating from its old policy beyond a certain point (which encourages small updates). Adding an explicit KL constraint accomplishes something similar, because it forces the model to closely adhere to the old policy. For more on KL-divergence and why it's a principled measure, see this post. We call these algorithms trust-region methods because they incentivise the model to stay in a trusted region of policy space, i.e. close to the old policy (where we can be more confident in our results).
The theory behind TRPO actually suggests the following variant - turning the strict constraint into a penalty term, which you should find easier to implement:
Rather than forcing the new policy to stay close to the previous policy, this adds a penalty term which incentivises this behaviour (in fact, there is a 1-1 correspondence between constrained optimization problems and the corresponding unconstrained version).
Can you implement this? Does this approach work better than the clipped surrogate objective? What values of $\beta$ work best?
Tip - you can calculate KL divergence using the PyTorch KL Divergence function. You could also try the approximate version, as described in detail #12 of the "37 Implementational Details" post.
Long-term replay memory
Above, we discussed the problem of catastrophic forgetting (where the agent forgets how to recover from bad behaviour, because the memory only contains good behaviour). One way to fix this is to have a long-term replay memory, for instance:
- (simple version) You reserve e.g. 10% of your buffer for experiences generated at the start of training.
- (complex version) You design a custom scheduled method for removing experiences from memory, so that you always have a mix of old and new experiences.
Can you implement one of these, and does it fix the catastrophic forgetting problem (without needing to use reward shaping)?
Vectorized Advantage Calculation
Try optimizing away the for-loop in your advantage calculation. It's tricky (and quite messy), so an easier version of this is: find a vectorized calculation and try to explain what it does.
Hint (for your own implementation)
(Assume num_envs=1 for simplicity)
Construct a 2D boolean array from dones, where the (i, j)-th element of the array tells you whether the expression for the i-th advantage function should include rewards / values at timestep j. You can do this via careful use of torch.cumsum, torch.triu, and some rearranging.
Other Discrete Environments
Two environments (supported by gym) which you might like to try are:
Acrobot-v1- this is one of the Classic Control environments, and it's a bit harder to learn than cartpole.MountainCar-v0- this is one of the Classic Control environments, and it's much harder to learn than cartpole. This is primarily because of sparse rewards (it's really hard to get to the top of the hill), so you'll definitely need reward shaping to get through it!LunarLander-v2- this is part of the Box2d environments. It's a bit harder still, because of the added environmental complexity (physics like gravity and friction, and constraints like fuel conservatino). The reward is denser (with the agent receiving rewards for moving towards the landing pad and penalties for moving away or crashing), but the increased complexity makes it overall a harder problem to solve. You might have to perform hyperparameter sweeps to find the best implementation (you can go back and look at the syntax for hyperparameter sweeps here). Also, this page might be a useful reference (although the details of their implementation differs from the one we used today). You can look at the hyperparameters they used.
Continuous Action Spaces & Reward Shaping
The MountainCar-v0 environment has discrete actions, but there's also a version MountainCarContinuous-v0 with continuous action spaces. Implementing this will require a combination of the continuous action spaces you dealt with during the MuJoCo section, and the reward shaping you used during the CartPole exercises.
Choose & build your own environment (e.g. Wordle)
You can also try choosing your own task, framing it as an RL problem, and adapting your PPO algorithm to solve it. For example, training an agent to play Wordle (or a relation like Semantle) might be a suitably difficult task. This post gives a high level account of training an agent to play Wordle - they use DQN, but they don't go too deep into the technical details (and it's possible that PPO would work better for this task).
Minigrid envs / Procgen
There are many more exciting environments to play in, but generally they're going to require more compute and more optimization than we have time for today. If you want to try them out, some we recommend are:
- Minimalistic Gridworld Environments - a fast gridworld environment for experiments with sparse rewards and natural language instruction.
- microRTS - a small real-time strategy game suitable for experimentation.
- Megastep - RL environment that runs fully on the GPU (fast!)
- Procgen - A family of 16 procedurally generated gym environments to measure the ability for an agent to generalize. Optimized to run quickly on the CPU.
- For this one, you might want to read Jacob Hilton's online DL tutorial (the RL chapter suggests implementing PPO on Procgen), and Connor Kissane's solutions.
Multi-Agent PPO
Multi-Agent PPO (MAPPO) is an extension of the standard PPO algorithm which trains multiple agents at once. It was first described in the paper The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. Can you implement MAPPO?