Toy Models

This section explores interpretability through the lens of small models trained on algorithmic tasks. Because of their simplicity and the structured nature of their training data, these models often learn clean, interpretable algorithms that we can fully reverse-engineer.

What You'll Learn

  • Algorithmic Interpretability: Understand how models learn to solve specific computational problems
  • Full Circuit Reverse-Engineering: Break down entire models into understandable components
  • Theoretical Foundations: Connect empirical observations to mathematical understanding of model behaviour

Sections

The exercises in this section are:

  • 1.5.1 Balanced Bracket Classifier: Reverse-engineer a transformer trained to classify balanced vs unbalanced parentheses
  • 1.5.2 Grokking & Modular Arithmetic: Discover how models learn modular addition through discrete Fourier transforms, and investigate the grokking phenomenon
  • 1.5.3 OthelloGPT: Explore emergent world representations in a model trained to predict Othello moves
  • 1.5.4 Superposition & SAEs: Study the mathematical foundations of superposition and train SAEs to recover features from toy models

When to Use Toy Models

Toy models are valuable when you want to:

  • Isolate specific phenomena without confounding factors
  • Develop intuitions that may transfer to larger models
  • Test interpretability techniques in controlled settings

However, be mindful that toy models may fail to capture the complexity of real-world models. The art is finding the right balance: too simple and your findings won't generalise; too complex and you lose the ability to fully understand the model.

Prerequisites

You should have completed sections 1.1 and 1.2 before starting this material.