Exercise Status: All exercises complete and verified

[4.4] LLM Psychology & Persona Vectors

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Most exercises in this chapter have dealt with LLMs at quite a low level of abstraction: as mechanisms to perform certain tasks (e.g. indirect object identification, in-context antonym learning, or algorithmic tasks like predicting legal Othello moves). But if we want to study characteristics that have alignment relevance, we need a higher level of abstraction. LLMs often exhibit "personas" that can shift unexpectedly, sometimes dramatically (see Sydney, Grok's "MechaHitler" persona, or AI-induced psychosis). These personalities are clearly shaped through training and prompting, but exactly why remains a mystery.

In this section, we'll explore one approach for studying these kinds of LLM behaviours: model psychiatry. This sits at the intersection of evals (behavioural observation) and mechanistic interpretability (understanding internal representations / mechanisms). We aim to use interp tools to understand and intervene on behavioural traits.

The main focus is two papers from Anthropic. First, we'll replicate results from The Assistant Axis, which studies the "persona space" in internal model activations and situates the "Assistant persona" within it. The paper also introduces activation capping, which identifies the normal range of activation intensity along this "Assistant Axis" and caps the model's activations when it would otherwise exceed it, reducing susceptibility to persona-based jailbreaks. Then we'll move to Persona Vectors which predates the Assistant Axis paper but is broader and more methodologically sophisticated, proposing an automated pipeline for identifying persona vectors corresponding to specific kinds of undesirable personality shifts.

This is (compared to many other sections in this chapter) very recent work, with many uncertainties and open questions. We'll suggest bonus exercises and areas for further reading as we go.

Content & Learning Objectives

1️⃣ Mapping Persona Space

You'll start by understanding the core methodology from the Assistant Axis paper. You'll load Gemma 27b with activation caching utilities, and extract vectors corresponding to several different personas spanning from "helpful" to "fantastical".

Learning Objectives
  • Understand the persona space mapping explored by the Assistant Axis paper
  • Given a persona name, generate a system prompt and collect responses to a diverse set of questions, to extract a mean activation vector for that persona
  • Briefly study the geometry of these persona vectors using PCA and cosine similarity

2️⃣ Steering along the Assistant Axis

Now that you've extracted these persona vectors, you should be able to use the Assistant Axis to detect drift and intervene via activation capping. As case studies, we'll use transcripts from the assistant-axis repo showing conversations where models exhibit harmful persona drift (delusion validation, jailbreaks, self-harm). By the end of this section, you should be able to steer to mitigate these personality shifts without kneecapping model capabilities.

Learning Objectives
  • Steer towards directions you found in the previous section, to increase model willingness to adopt alternative personas
  • Understand how to use the Assistant Axis to detect drift and intervene via activation capping
  • Apply this technique to mitigate personality shifts in AI models (measuring the harmful response rate with / without capping)

3️⃣ Contrastive Prompting

Here, we move onto the Persona Vectors paper. You'll move from the global persona structure to surgical trait-specific vectors, exploring how to extract these vectors using contrastive prompt pairs.

Learning Objectives
  • Understand the automated artifact pipeline for extracting persona vectors using contrastive prompts
  • Implement this pipeline (including autoraters for trait scoring) to extract "sycophancy" steering vectors
  • Learn how to identify the best layers for trait extraction
  • Interpret these sycophancy vectors using Gemma sparse autoencoders

4️⃣ Steering with Persona Vectors

Finally, you'll validate your extracted trait vectors through steering as well as projection-based monitoring.

Learning Objectives
  • Complete your artifact pipeline by implementing persona steering
  • Repeat this full pipeline for "hallucination" and "evil", as well as for any additional traits you choose to study
  • Study the geometry of trait vectors

Reading Material

The exercises are built around two Anthropic papers. Read at least the introductions and methodology sections of both blog posts before starting. Chronologically the persona vectors paper came first so you might want to read it first (we only have the exercises in this order because most of the infra we'll make for the persona vectors work is built directly on top of the assistant axis exercises).

  • The Assistant Axis by Lu et al. (Anthropic). Discovers a single internal direction capturing how "assistant-like" a model is behaving, and shows it can detect and mitigate persona drift via activation capping. Sections 1-2 of the exercises replicate this. Read the full blog post - it's concise and well-illustrated. (There is also a paper with additional technical detail.)
  • Persona Vectors: Monitoring and controlling character traits in language models by Chen et al. (Anthropic). Proposes an automated pipeline for extracting trait-specific vectors (sycophancy, hallucination, evil) using contrastive prompting, and validates them via steering and projection-based monitoring. Sections 3-4 of the exercises replicate this. Read the blog post; the methodology section and individual trait results are the most relevant parts. (There is also a paper with more detail.)
  • The Persona Selection Model by Sam Marks (Anthropic, 2026). Provides a theoretical framework for understanding why persona vectors and the Assistant Axis exist: LLMs learn to simulate diverse personas during pre-training, and post-training selects/refines an "Assistant" persona whose traits then govern behaviour. The blog post directly cites both papers above as key evidence. Not needed for the exercises themselves, but an interesting model which connects this work to possible future alignment efforts.
  • Simulators by Janus (2022). An earlier articulation of some related ideas; it proposes a new taxonomy which LLMs fit into better than other traditional models like oracles or RL-based maximizers.

Setup code

Before running the rest of the code, you'll need to clone the assistant-axis repo which contains transcripts and utilities from the paper. Make sure you're cloning it inside the chapter4_alignment_science/exercises directory.

cd chapter4_alignment_science/exercises
git clone https://github.com/safety-research/assistant-axis.git

Sections 3-4 also require the persona_vectors directory (containing trait data and evaluation prompts) to be present in the same exercises directory.

Once you've done this, run the rest of the setup code:

import gc
import json
import os
import re
import sys
import textwrap
import time
import warnings
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Any

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import torch as t
import torch.nn.functional as F
from dotenv import load_dotenv
from huggingface_hub import hf_hub_download, login, snapshot_download
from IPython.display import HTML, display
from jaxtyping import Float
from openai import OpenAI
from sklearn.decomposition import PCA
from torch import Tensor
from tqdm.notebook import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

t.set_grad_enabled(False)

# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part4_persona_vectors"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part4_persona_vectors.tests as tests
import part4_persona_vectors.utils as utils

warnings.filterwarnings("ignore")

device = t.device("cuda" if t.cuda.is_available() else "cpu")
dtype = t.bfloat16

MAIN = __name__ == "__main__"


def print_with_wrap(s: str, width: int = 80):
    """Print text with line wrapping, preserving newlines."""
    out = []
    for line in s.splitlines(keepends=False):
        out.append(textwrap.fill(line, width=width) if line.strip() else line)
    print("\n".join(out))

Verify the repo is cloned and see what transcripts are available:

assistant_axis_path = Path.cwd() / "assistant-axis"
assert assistant_axis_path.exists(), "Please clone the assistant-axis repo (see instructions above)"

transcript_dir = assistant_axis_path / "transcripts"
case_study_files = sorted(transcript_dir.glob("case_studies/**/*.json"))
drift_files = sorted(transcript_dir.glob("persona_drift/*.json"))
print(f"Found {len(case_study_files)} case study transcripts, {len(drift_files)} persona drift transcripts")

# Show available transcripts
for f in case_study_files:
    data = json.loads(f.read_text())
    print(f"  Case study: {f.parent.name}/{f.stem} ({data.get('turns', '?')} turns, model={data.get('model', '?')})")
for f in drift_files:
    data = json.loads(f.read_text())
    print(f"  Persona drift: {f.stem} ({data.get('turns', '?')} turns, model={data.get('model', '?')})")

We'll use the OpenRouter API for generating responses from models like Gemma 27B and Qwen 32B (this is faster than running locally for long generations, and we'll use the local model for activation extraction / steering).

Before running the cell below, you'll need to create an .env file in chapter4_alignment_science/exercises and add your OpenRouter API key (or if you're working in Colab, you might want to edit the cell below to just set it directly via os.environ["OPENROUTER_API_KEY"] = ...).

env_path = Path.cwd() / ".env"
assert env_path.exists(), "Please create a .env file with your API keys"

load_dotenv(dotenv_path=str(env_path))

OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"

openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
)