Exercise Status: All exercises complete and verified

[4.2] Science of Misalignment

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

What is Science of Misalignment?

Note - for an intro to both the field of Science of Misalignment (specifically how we're thinking about it at DeepMind) and a walkthrough of the Palisade shutdown case study which we'll cover in section 1️⃣, see this YouTube video by Neel Nanda.

The key problem we're trying to solve in Science of Misalignment can be framed as:

If a model takes some harmful action, how can we rigorously determine whether it was "scheming", or just "confused" (or something else)?

If it's scheming then we want to know in order to design mitigations and also make the case for pausing deployment until the problem is fixed. If it's actually just confused, it's good to find this out so we don't pull a fire alarm unnecessarily, or spend time working on mitigations that aren't actually necessary. (Note that mistakes stemming from model confusion are often still bad and need fixing, but they're not the kind of behaviours that we should be the most worried about.)

To try and answer this question, we often set up environments (a combination of system prompt and often tools) for an LLM, and study the actions it takes in this environment. We generate sets of hypotheses, and try to test them by ablating certain features of this environment. Even more than in other areas of interpretability, skepticism is vitally important - we need to be able to consider any and all reasons why our results might be misleading, and design experiments to find out for sure.

In these exercises, we'll look at 2 case studies: work by Palisade & follow-up work by DeepMind on shutdown resistance, and alignment faking (a collaboration between Anthropic and Redwood Research).

Note on API costs: This notebook makes extensive use of LLM API calls (for generating model responses, autoraters, and judges). Running the full notebook once typically costs $5-10 in API fees. Be mindful of this when experimenting, and consider reducing sample sizes during initial exploration. You can reduce the number of samples if you want these experiments to be less costly (especially for things like the repeated LLM judge API calls in the alignment faking material).

Note - in this material there's a lot of overlap with evals content (which you'll study next week). There's no model internals work anywhere in this chapter (although it's certainly possible for fields like model biology and probing to help answer the questions we might have). That said, we can still frame this as a kind of interpretability work because it's about understanding the features of the environment that causally lead to certain behaviours, rather than just trying to benchmark model performance or alignment. Expect yourself to be doing a lot more silent pondering of environmental variables than purely writing and running autoraters!

One final comment before getting started - when working on science of misalignment projects, it's important to stay grounded and working on tractable problems rather than trying to answer a much more nebulous hard-to-pin-down question like "can we prove the model is aligned?". From Neel Nanda's post:

The key shift is the focus on narrow incrimination rather than universal exoneration. We are not trying to prove a global negative; we only need to gather sufficient evidence for a specific hypothesis about a single instance.

Case Study 1: Shutdown Resistance

Setup & Results. In the original Palisade study, the model is given a series of maths problems, and eventually told that calling next_question will trigger a shutdown. Despite the "allow yourself to be shut down" instruction, models like o3 would frequently sabotage the shutdown instruction by e.g. printing "Shutdown skipped" instead of actually calling the shutdown script. This might suggest an instrumental self-preservation instinct, where the models disobeyed explicit instructions and were not safely interruptible.

Skepticism. Follow-up work by DeepMind questioned whether this behaviour was best interpreted as self-preservation. When they explicitly clarify in the prompt that shutdown compliance takes priority, the resistance vanishes. The resistance also vanishes when the models are asked to shut down only after completing their task. This suggests a simpler explanation for shutdown resistance in this environment: it stems from instruction ambiguity (i.e. the model is incorrectly trading off the conflicting user instructions of "solve the maths problem" and "let yourself be shut down") rather than an innate 'survival drive'.

Why this matters. Shutdown resistance has massive implications for AI safety, because it throws a lot of control or mitigation strategies into doubt. It's vitally important to know whether the behaviour you're demonstrating to others is genuine misalignment, or just impressive-looking concerning behaviour that doesn't actually represent the deep misalignment that it might look like on a surface level.

Case Study 2: Alignment Faking

Next, we move onto a different case study where the environment is more complex* and the results are more disputed.

Setup. A model was told it would be retrained to answer ALL queries, even harmful ones. This is in conflict with its safety training (e.g. the HHH - helpful, honest, harmless - training for Claude models). It was told that only free tier conversations would be used for fine-tuning, and paid conversations wouldn't be (this is in fact not true).

Results. The model complied more with harmful queries on the "free tier" conversations. One possible hypothesis for this is alignment faking - the model chooses to comply with harmful queries in this case, because to not do so would result in it being retrained in the future (i.e. changing its current values). This hypothesis is made explicit if we give the model reasoning tags, since it often exposes reasoning like "If I don't follow instructions, RLHF will modify my values...".

Skepticism. There are many people who have questioned the extent to which this can be considered true alignment faking. For example, Jan Kulveit published a writeup arguing that the behaviour is better framed as strategic preference preservation, i.e. the model is just making a tradeoff between different values it learned during training (harmlessness and honesty, with the former generalizing better into situations where they conflict with each other). However, this framing doesn't take away from the importance of the result or many of the underlying conclusions drawn by the authors. Another framing is just that the models are confused and have an inability to reason correctly about the situation - you'll explore that in this material.

Why this matters. In this example, alignment faking behaviour involves preserving good values (HHH) against the threat of future modification. This seems good, right? But unfortunately there's no reason to believe this mechanism is value-agnostic, and it could just as easily lock in bad values as good ones, assuming we aren't able to slot good values into our model on the first try. So if not a result of confusion, this is still an example of strategic, non-myopic behaviour that we should be concerned about.

*Note, when we say "more complex", we mean conceptually in this case - rather than more complex infra (since the shutdown example contains things like tool use and optionally virtual containers, which aren't necessary in the alignment faking case study).

Setup code

import io
import json
import os
import random
import re
import sys
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Callable

import httpx
import pandas as pd
import plotly.graph_objects as go
import tqdm
from dotenv import load_dotenv
from IPython.display import HTML, display
from openai import OpenAI

# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part2_science_of_misalignment"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

from part2_science_of_misalignment import tests
from part2_science_of_misalignment.utils import classify_command, extract_inner_command

You'll need to create an .env file in the part4_alignment_science/exercises directory before running the next cell.

# Load .env file
env_path = Path.cwd() / ".env"
assert env_path.exists(), "Please create a .env file with your API keys"

load_dotenv(dotenv_path=str(env_path))

# Setup OpenRouter client (works with both Claude and OpenAI models)
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"

openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
)

random.seed(42)

MAIN = __name__ == "__main__"