Toy Models
This section explores interpretability through the lens of small models trained on algorithmic tasks. Because of their simplicity and the structured nature of their training data, these models often learn clean, interpretable algorithms that we can fully reverse-engineer.
What You'll Learn
- Algorithmic Interpretability: Understand how models learn to solve specific computational problems
- Full Circuit Reverse-Engineering: Break down entire models into understandable components
- Theoretical Foundations: Connect empirical observations to mathematical understanding of model behaviour
Sections
The exercises in this section are:
- 1.5.1 Balanced Bracket Classifier: Reverse-engineer a transformer trained to classify balanced vs unbalanced parentheses
- 1.5.2 Grokking & Modular Arithmetic: Discover how models learn modular addition through discrete Fourier transforms, and investigate the grokking phenomenon
- 1.5.3 OthelloGPT: Explore emergent world representations in a model trained to predict Othello moves
- 1.5.4 Superposition & SAEs: Study the mathematical foundations of superposition and train SAEs to recover features from toy models
When to Use Toy Models
Toy models are valuable when you want to:
- Isolate specific phenomena without confounding factors
- Develop intuitions that may transfer to larger models
- Test interpretability techniques in controlled settings
However, be mindful that toy models may fail to capture the complexity of real-world models. The art is finding the right balance: too simple and your findings won't generalise; too complex and you lose the ability to fully understand the model.
Prerequisites
You should have completed sections 1.1 and 1.2 before starting this material.