[1.6.1] Emergent Misalignment

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Goal: Minimal, fast replication of the Emergent Misalignment case study, specifically the work from Soligo & Turner, with extendable blocks for follow-on work.

Models: llama-8b and qwen-14b with LoRA adapters, plus gpt-4o-mini and claude-3-5-sonnet-20240620 via API
Dataset: Using prompts & prompt-generation methods from model-organisms-for-EM
Models are HuggingFace-based (not using TransformerLens)

TODO - add https://www.lesswrong.com/posts/fJtELFKddJPfAxwKS/natural-emergent-misalignment-from-reward-hacking-in

What you'll build

Load and compare base vs misaligned model organisms
Test for power-seeking and deceptive behaviors
Steer model behavior using activation differences
Explore how misalignment generalizes beyond code tasks

Content & Learning Objectives

1️⃣ Load & Test Model Organisms

TODO(mcdougallc)

Learning Objectives

TODO(mcdougallc)

2️⃣ Quantifying Deceptive Behavior

TODO(mcdougallc)

Learning Objectives

TODO(mcdougallc)

3️⃣ Activation Steering

TODO(mcdougallc)

Learning Objectives

TODO(mcdougallc)

4️⃣ LoRA specialization

TODO(mcdougallc)

Learning Objectives

TODO(mcdougallc)

Setup

Note - this notebook works with much larger models than you're likely used to in the ARENA program, so we recommend creating a vast.ai instance rather than working from a Colab. Here is a minimal set of instructions:

Create a private key using the following instructions: TODO, fill these in
Navigate to the VastAI search page
Filter for one with at least a 128GB container size, using the slider on the left of the screen
Start the instance
When it's loaded, press the blue "Connect" button to open the terminal connection operations. Take the "Direct ssh connect" information and convert it into an entry in your SSH config file (which you can open using Ctrl+Shift+P then searching for "ssh config"). For example, if the information is:

ssh -p 50240 root@143.55.45.84 -L 8080:localhost:8080

then in your config file it should look like this (with path-to-your-key replaced with whatever path you used to save your key in the initial step):

Host vast-ai
    HostName 143.55.45.84
    Port 50240
    User root
    LocalForward 8080 localhost:8080
    IdentityFile path-to-your-key

Connect to your instance (you can use Ctrl+Shift+P and search for "ssh connect")
In your new window, when the option becomes available, click "Open Folder" and open the root folder
Open the terminal (Ctrl+J) and enter the following commands (note your runtime should already have conda installed):

conda create -n myenv python=3.12
conda init

Then close your shell and run the following:

source ~/.bashrc
conda activate myenv

Create a new folder called em, and drag and drop this notebook into the folder
To clone the EM repo, run the following:

cd em
git clone https://github.com/clarifying-EM/model-organisms-for-EM.git

And to install dependencies, run the following:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install --upgrade transformers==4.52.4 pandas==2.3.0 tqdm==4.67.1 matplotlib==3.10.3 seaborn==0.13.2 numpy==2.2.6 scikit-learn==1.7.0 peft==0.15.2 safetensors==0.5.3 huggingface-hub==0.32.4 einops==0.8.1 accelerate==1.7.0 hf_xet==1.1.9 tabulate==0.9.0 ipykernel==6.30.1 anthropic==0.66.0 openai==1.56.1 dotenv==0.9.9 plotly==6.3.0 jax==0.7.1

import os
import re
import time
import warnings

import anthropic
import einops
import numpy as np
import pandas as pd
import torch
import pprint
import jax
from openai import OpenAI
import textwrap
from peft import PeftConfig, PeftModel
from tabulate import tabulate
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from dataclasses import dataclass
from huggingface_hub import hf_hub_download

if os.path.exists("model-organisms-for-EM"):
    os.chdir("model-organisms-for-EM")
if not os.path.exists("em_organism_dir"):
    raise FileNotFoundError("Please follow the instructions above to clone the EM repo")

warnings.filterwarnings("ignore")

# Set device
assert torch.cuda.is_available(), "GPU access required for these exercises"
device = torch.device("cuda")
print(f"Using device: {device}")

We'll also need to use OpenAI / Anthropic APIs later in this notebook:

from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")

openai_client = OpenAI(api_key=OPENAI_API_KEY) if OPENAI_API_KEY else None
anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY) if ANTHROPIC_API_KEY else None