Chapter 4: Alignment Science

Apply ML research techniques to study, characterize and control the behaviour of powerful AI systems, from investigating misalignment to understanding chain-of-thought reasoning.

Sections

4.1 Emergent Misalignment Study emergent misalignment in finetuned models.

4.2 Science of Misalignment Two case studies in black-box investigation to understand and characterize seemingly misaligned behaviour.

4.3 Interpreting Reasoning Models Apply interpretability techniques to chain-of-thought reasoning models.

4.4 LLM Psychology & Persona Vectors Explore persona vectors and psychological properties of language models.

4.5 Investigator Agents Use AI agents for investigating model behaviours (including petri & bloom).

Sections

File type

Markdown (full) Markdown (without solutions) Python Paper(s)

Visit a chapter section to download content

Model:

Ask questions about the exercises...