ARENA
Content FAQ

Interpretability

  • 1.1 Transformers from Scratch
  • 1.2 Intro to Mech Interp
  • 1.3 Probing and Representations
  • 1.3.1 Probing for Deception
  • 1.3.2 Function Vectors & Model Steering
  • 1.3.3 Interpretability with SAEs
  • 1.4 Circuits in LLMs
  • 1.4.1 Indirect Object Identification
  • 1.4.2 SAE Circuits
  • 1.5 Toy Models
  • 1.5.1 Balanced Bracket Classifier
  • 1.5.2 Grokking & Modular Arithmetic
  • 1.5.3 OthelloGPT
  • 1.5.4 Superposition & SAEs
  • 1.6 Case Studies in Larger Models
  • 1.6.1 Emergent Misalignment
  • 1.6.2 Science of Misalignment
  • 1.6.3 Eliciting Secret Knowledge
  • 1.6.4 Interpreting Reasoning Models
In this section
    On this page

    Chapter 1: Transformer Interpretability

    Dive deep into language model interpretability, from linear probes and SAEs to alignment faking and thought anchors.

    Sections

    1.1 Transformers from Scratch 1.2 Intro to Mech Interp 1.3.1 Probing for Deception 1.3.2 Function Vectors & Model Steering 1.3.3 Interpretability with SAEs 1.4.1 Indirect Object Identification 1.4.2 SAE Circuits 1.5.1 Balanced Bracket Classifier 1.5.2 Grokking & Modular Arithmetic 1.5.3 OthelloGPT 1.5.4 Superposition & SAEs 1.6.1 Emergent Misalignment 1.6.2 Science of Misalignment 1.6.3 Eliciting Secret Knowledge 1.6.4 Interpreting Reasoning Models
    Select Context
    Select exercise content to download as text files, or to provide as context when asking questions below.
    Sections
    File type
    Visit a chapter section to download content
    Ask a Question
    Ask questions about the exercises. The model will use the context you've selected above to provide relevant answers.
    Ask questions about the exercises...

    ARENA - Alignment Research Engineer Accelerator