ARENA
Content
Chapter 0: Fundamentals Chapter 1: Transformer Interpretability Chapter 2: Reinforcement Learning Chapter 3: LLM Evaluations Chapter 4: Alignment Science
Planner Setup Instructions FAQ

Alignment Science

  • 4.1 Emergent Misalignment
  • 4.2 Science of Misalignment
  • 4.3 Interpreting Reasoning Models
  • 4.4 LLM Psychology & Persona Vectors
  • 4.5 Investigator Agents
In this section
    On this page
    1. Content
    2. Alignment Science

    Chapter 4: Alignment Science

    Apply ML research techniques to study, characterize and control the behaviour of powerful AI systems, from investigating misalignment to understanding chain-of-thought reasoning.

    Sections

    4.1 Emergent Misalignment Study emergent misalignment in finetuned models.
    4.2 Science of Misalignment Two case studies in black-box investigation to understand and characterize seemingly misaligned behaviour.
    4.3 Interpreting Reasoning Models Apply interpretability techniques to chain-of-thought reasoning models.
    4.4 LLM Psychology & Persona Vectors Explore persona vectors and psychological properties of language models.
    4.5 Investigator Agents Use AI agents for investigating model behaviours (including petri & bloom).
    Select Context
    Select exercise content to download as text files, or to provide as context when asking questions below.
    Sections
    File type
    Visit a chapter section to download content
    Ask a Question
    Ask questions about the exercises. The model will use the context you've selected above to provide relevant answers.
    Ask questions about the exercises...

    ARENA - Alignment Research Engineer Accelerator