Chapter 4: Alignment Science
This chapter contains a variety of different exercises. Most of them don't fit neatly into either interpretability or evals, and all of them are bear directly on real AI safety research. This chapter's grouping was loosely inspired by Anthropic's 'Alignment Science' research thread, which you can read here.
Sections
4.1
Emergent Misalignment
Study emergent misalignment in finetuned models.
4.2
Science of Misalignment
Two case studies in black-box investigation to understand and characterize seemingly misaligned behaviour.
4.3
Interpreting Reasoning Models
Apply interpretability techniques to chain-of-thought reasoning models.
4.4
LLM Psychology & Persona Vectors
Explore persona vectors and psychological properties of language models.
4.5
Investigator Agents
Use AI agents for investigating model behaviours (including petri & bloom).