Probing and Representations
This section explores how language models represent and use information internally. You'll learn techniques for probing model internals, understanding how features are represented, and how to steer model behaviour.
What You'll Learn
- Linear Probing: Train simple classifiers to detect concepts in model activations, and use this to understand what models "know" at different layers
- Function Vectors: Discover how models encode task-specific behaviour, and extract vectors that can transfer behaviour between contexts
- SAE-based Interpretability: Use Sparse Autoencoders to decompose model activations into interpretable features, and explore tools like SAELens and Neuronpedia
Sections
The exercises in this section are:
- 1.3.1 Probing for Deception: Train linear probes to detect deceptive behaviour in language models
- 1.3.2 Function Vectors & Model Steering: Extract and apply function vectors to steer model outputs
- 1.3.3 Interpretability with SAEs: Hands-on exploration of sparse autoencoders for finding interpretable features
Prerequisites
You should have completed sections 1.1 and 1.2 before starting this material.