ARENA
Content
Chapter 0: Fundamentals Chapter 1: Transformer Interpretability Chapter 2: Reinforcement Learning Chapter 3: LLM Evaluations Chapter 4: Alignment Science
Planner Setup Instructions FAQ

Alignment Science

  • 4.1 Emergent Misalignment
  • 4.2 Science of Misalignment
  • 4.3 Interpreting Reasoning Models
  • 4.4 LLM Psychology & Persona Vectors
  • 4.5 Investigator Agents
In this section
    On this page
    1. Content
    2. Alignment Science

    Chapter 4: Alignment Science

    This chapter contains a variety of different exercises. Most of them don't fit neatly into either interpretability or evals, and all of them are bear directly on real AI safety research. This chapter's grouping was loosely inspired by Anthropic's 'Alignment Science' research thread, which you can read here.

    Sections

    4.1 Emergent Misalignment Study emergent misalignment in finetuned models.
    4.2 Science of Misalignment Two case studies in black-box investigation to understand and characterize seemingly misaligned behaviour.
    4.3 Interpreting Reasoning Models Apply interpretability techniques to chain-of-thought reasoning models.
    4.4 LLM Psychology & Persona Vectors Explore persona vectors and psychological properties of language models.
    4.5 Investigator Agents Use AI agents for investigating model behaviours (including petri & bloom).
    Select Context
    Select exercise content to download as text files, or to provide as context when asking questions below.
    Sections
    File type
    Visit a chapter section to download content
    Ask a Question
    Ask questions about the exercises. The model will use the context you've selected above to provide relevant answers.
    Ask questions about the exercises...

    ARENA - Alignment Research Engineer Accelerator