Chapter 1: Transformer Interpretability
Dive deep into language model interpretability, from linear probes and SAEs to circuit analysis and toy models.
Sections
1.1
Transformers from Scratch
Build a transformer from scratch and load pretrained GPT-2 weights.
1.2
Intro to Mech Interp
Learn TransformerLens to extract activations, apply hooks & find important attention heads.
1.3.1
Linear Probes
Train linear probes to detect deception in a model playing the game Coup.
1.3.2
Function Vectors & Model Steering
Steer model behaviour using activation interventions and the nnsight library.
1.3.3
Interpretability with SAEs
Use SAEs to decompose LLM activation space, monitor cognition & steer behaviour.
1.3.4
Activation Oracles
Implement activation oracles to reveal hidden knowledge and uncover forward-predictions.
1.4.1
Indirect Object Identification
Reverse-engineer the IOI circuit in GPT-2 small following 'Interpretability in the Wild'.
1.4.2
SAE Circuits
Apply SAEs to circuit analysis, decomposing computations and tracing features through layers.
1.5.1
Balanced Bracket Classifier
Reverse-engineer the algorithm learned by a bracket-balancing transformer.
1.5.2
Grokking & Modular Arithmetic
Discover Fourier circuits in modular arithmetic models and observe grokking in action.
1.5.3
OthelloGPT
Investigate emergent world representations in a GPT model trained on Othello games.
1.5.4
Superposition & SAEs
Replicate Anthropic's superposition paper and train SAEs to recover features.
Monthly Algorithmic Problems
7 algorithmic challenges to test your interpretability skills in hackathon format.