Chapter 1: Transformer Interpretability
Dive deep into language model interpretability, from linear probes and SAEs to alignment faking and thought anchors.
Sections
1.1
Transformers from Scratch
1.2
Intro to Mech Interp
1.3.1
Probing for Deception
1.3.2
Function Vectors & Model Steering
1.3.3
Interpretability with SAEs
1.4.1
Indirect Object Identification
1.4.2
SAE Circuits
1.5.1
Balanced Bracket Classifier
1.5.2
Grokking & Modular Arithmetic
1.5.3
OthelloGPT
1.5.4
Superposition & SAEs
1.6.1
Emergent Misalignment
1.6.2
Science of Misalignment
1.6.3
Eliciting Secret Knowledge
1.6.4
Interpreting Reasoning Models