Chapter 1: Transformer Interpretability

Dive deep into language model interpretability, from linear probes and SAEs to circuit analysis and toy models.

Sections

1.1 Transformers from Scratch Build a transformer from scratch and load pretrained GPT-2 weights.

1.2 Intro to Mech Interp Learn TransformerLens to extract activations, apply hooks & find important attention heads.

1.3.1 Linear Probes Train linear probes to detect deception in a model playing the game Coup.

1.3.2 Function Vectors & Model Steering Steer model behaviour using activation interventions and the nnsight library.

1.3.3 Interpretability with SAEs Use SAEs to decompose LLM activation space, monitor cognition & steer behaviour.

1.3.4 Activation Oracles Implement activation oracles to reveal hidden knowledge and uncover forward-predictions.

1.4.1 Indirect Object Identification Reverse-engineer the IOI circuit in GPT-2 small following 'Interpretability in the Wild'.

1.4.2 SAE Circuits Apply SAEs to circuit analysis, decomposing computations and tracing features through layers.

1.5.1 Balanced Bracket Classifier Reverse-engineer the algorithm learned by a bracket-balancing transformer.

1.5.2 Grokking & Modular Arithmetic Discover Fourier circuits in modular arithmetic models and observe grokking in action.

1.5.3 OthelloGPT Investigate emergent world representations in a GPT model trained on Othello games.

1.5.4 Superposition & SAEs Replicate Anthropic's superposition paper and train SAEs to recover features.

Monthly Algorithmic Problems 7 algorithmic challenges to test your interpretability skills in hackathon format.

Sections

File type

Markdown (full) Markdown (without solutions) Python Paper(s)

Visit a chapter section to download content

Model:

Ask questions about the exercises...

Web Analytics Made Easy - Statcounter

Web Analytics Made Easy - Statcounter