Chapter 4: Alignment Science
Apply ML research techniques to study, characterize and control the behaviour of powerful AI systems, from investigating misalignment to understanding chain-of-thought reasoning.
Sections
4.1
Emergent Misalignment
Study emergent misalignment in finetuned models.
4.2
Science of Misalignment
Two case studies in black-box investigation to understand and characterize seemingly misaligned behaviour.
4.3
Interpreting Reasoning Models
Apply interpretability techniques to chain-of-thought reasoning models.
4.4
LLM Psychology & Persona Vectors
Explore persona vectors and psychological properties of language models.
4.5
Investigator Agents
Use AI agents for investigating model behaviours (including petri & bloom).