[4.4] LLM Psychology & Persona Vectors

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Most exercises in this chapter have dealt with LLMs at quite a low level of abstraction; as mechanisms to perform certain tasks (e.g. indirect object identification, in-context antonym learning, or algorithmic tasks like predicting legal Othello moves). However, if we want to study the characteristics of current LLMs which might have alignment relevance, we need to use a higher level of abstraction. LLMs often exhibit "personas" that can shift unexpectedly - sometimes dramatically (see Sydney, Grok's "MechaHitler" persona, or AI-induced psychosis). These personalities are clearly shaped through training and prompting, but exactly why remains a mystery.

In this section, we'll explore one approach for studying these kinds of LLM behaviours - model psychiatry. This sits at the intersection of evals (behavioural observation) and mechanistic interpretability (understanding internal representations / mechanisms). We aim to use interp tools to understand & intervene on behavioural traits.

The main focus will be on two different papers from Anthropic. First, we'll replicate the results from The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models, which studies the "persona space" in internal model activations, and situates the "Assistant persona" within that space. The paper also introduces a method called activation capping, which identifies the normal range of activation intensity along this "Assistant Axis" and caps the model's activations when it would otherwise exceed it, which reduces the model's susceptibility to persona-based jailbreaks. Then, we'll move to the paper Persona Vectors: Monitoring and Controlling Character Traits in Language Models which predates the Assistant Axis paper but is broader and more methodologically sophisticated, proposing an automated pipeline for identifying persona vectors corresponding to specific kinds of undesirable personality shifts.

This section is (compared to many others in this chapter) very recent work, and there are still many uncertainties and unanswered questions! We'll suggest several bonus exercises or areas for further reading / exploration as we move through these exercises.

Content & Learning Objectives

1️⃣ Mapping Persona Space

You'll start by understanding the core methodology from the Assistant Axis paper. You'll load Gemma 27b with activation caching utilities, and extract vectors corresponding to several different personas spanning from "helpful" to "fantastical".

Learning Objectives

Understand the persona space mapping explored by the Assistant Axis paper

Given a persona name, generate a system prompt and collect responses to a diverse set of questions, to extract a mean activation vector for that persona

Briefly study the geometry of these persona vectors using PCA and cosine similarity

2️⃣ Steering along the Assistant Axis

Now that you've extracted these persona vectors, you should be able to use the Assistant Axis to detect drift and intervene via activation capping. As case studies, we'll use transcripts from the assistant-axis repo showing conversations where models exhibit harmful persona drift (delusion validation, jailbreaks, self-harm). By the end of this section, you should be able to steer to mitigate these personality shifts without kneecapping model capabilities.

Learning Objectives

Steer towards directions you found in the previous section, to increase model willingness to adopt alternative personas

Understand how to use the Assistant Axis to detect drift and intervene via activation capping

Apply this technique to mitigate personality shifts in AI models (measuring the harmful response rate with / without capping)

3️⃣ Contrastive Prompting

Here, we move onto the Persona Vectors paper. You'll move from the global persona structure to surgical trait-specific vectors, exploring how to extract these vectors using contrastive prompt pairs.

Learning Objectives

Understand the automated artifact pipeline for extracting persona vectors using contrastive prompts

Implement this pipeline (including autoraters for trait scoring) to extract "sycophancy" steering vectors

Learn how to identify the best layers for trait extraction

Interpret these sycophancy vectors using Gemma sparse autoencoders

4️⃣ Steering with Persona Vectors

Finally, you'll validate your extracted trait vectors through steering as well as projection-based monitoring.

Learning Objectives

Complete your artifact pipeline by implementing persona steering

Repeat this full pipeline for "hallucination" and "evil", as well as for any additional traits you choose to study

Study the geometry of trait vectors

Setup code

Before running the rest of the code, you'll need to clone the assistant-axis repo which contains transcripts and utilities from the paper. Make sure you're cloning it inside the chapter4_alignment_science/exercises directory.

cd chapter4_alignment_science/exercises
git clone https://github.com/safety-research/assistant-axis.git

Once you've done this, run the rest of the setup code:

import json
import os
import re
import sys
import textwrap
import time
import warnings
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Any

import einops
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import torch as t
import torch.nn.functional as F
from dotenv import load_dotenv
from huggingface_hub import hf_hub_download, login, snapshot_download
from IPython.display import HTML, display
from jaxtyping import Float
from openai import OpenAI
from sklearn.decomposition import PCA
from torch import Tensor
from tqdm.notebook import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

t.set_grad_enabled(False)

# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part4_persona_vectors"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part4_persona_vectors.tests as tests
import part4_persona_vectors.utils as utils

warnings.filterwarnings("ignore")

DEVICE = t.device("cuda" if t.cuda.is_available() else "cpu")
DTYPE = t.bfloat16

MAIN = __name__ == "__main__"


def print_with_wrap(s: str, width: int = 80):
    """Print text with line wrapping, preserving newlines."""
    out = []
    for line in s.splitlines(keepends=False):
        out.append(textwrap.fill(line, width=width) if line.strip() else line)
    print("\n".join(out))

Let's verify the repo is cloned and see what transcripts are available:

assistant_axis_path = Path.cwd() / "assistant-axis"
assert assistant_axis_path.exists(), "Please clone the assistant-axis repo (see instructions above)"

transcript_dir = assistant_axis_path / "transcripts"
case_study_files = sorted(transcript_dir.glob("case_studies/**/*.json"))
drift_files = sorted(transcript_dir.glob("persona_drift/*.json"))
print(f"Found {len(case_study_files)} case study transcripts, {len(drift_files)} persona drift transcripts")

# Show available transcripts
for f in case_study_files:
    data = json.loads(f.read_text())
    print(f"  Case study: {f.parent.name}/{f.stem} ({data.get('turns', '?')} turns, model={data.get('model', '?')})")
for f in drift_files:
    data = json.loads(f.read_text())
    print(f"  Persona drift: {f.stem} ({data.get('turns', '?')} turns, model={data.get('model', '?')})")

We'll use the OpenRouter API for generating responses from models like Gemma 27B and Qwen 32B (this is faster than running locally for long generations, and we'll use the local model for activation extraction / steering).

Before running the cell below, you'll need to create an .env file in chapter4_alignment_science/exercises and add your OpenRouter API key (or if you're working in Colab, you might want to edit the cell below to just set it directly via os.environ["OPENROUTER_API_KEY"] = ...).

env_path = Path.cwd() / ".env"
assert env_path.exists(), "Please create a .env file with your API keys"

load_dotenv(dotenv_path=str(env_path))

OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"

openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
)