Chapter 4: Alignment Science

This chapter contains a variety of different exercises. Most of them don't fit neatly into either interpretability or evals, and all of them are bear directly on real AI safety research. This chapter's grouping was loosely inspired by Anthropic's 'Alignment Science' research thread, which you can read here.

Sections

4.1 Emergent Misalignment Study emergent misalignment in finetuned models.

4.2 Science of Misalignment Two case studies in black-box investigation to understand and characterize seemingly misaligned behaviour.

4.3 Interpreting Reasoning Models Apply interpretability techniques to chain-of-thought reasoning models.

4.4 LLM Psychology & Persona Vectors Explore persona vectors and psychological properties of language models.

4.5 Investigator Agents Use AI agents for investigating model behaviours (including petri & bloom).

Sections

File type

Markdown (full) Markdown (without solutions) Python Paper(s)

Visit a chapter section to download content

Model:

Ask questions about the exercises...

Web Analytics Made Easy - Statcounter

Web Analytics Made Easy - Statcounter