Exercise Status: Incomplete, needs major revision

[4.5] Investigator Agents

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

What are Investigator Agents?

  • Automated systems that probe LLMs to discover hidden/concerning behaviors
  • Key insight: Multi-turn interactions reveal behaviors that single-turn evals miss
  • This section: Hands-on → conceptual progression
  • Start with manual red-teaming (AI psychosis case study)
  • Progress to Petri framework (config-level, then source-level)
  • Capstone: Detect sandbagging/reward-hacking in the "Auditing Game" model

Why this matters for alignment: - Models may hide capabilities or intentions during evaluation - Sandbagging: Deliberately underperforming on capability evals - Reward hacking: Exploiting evaluation metrics rather than achieving intended goals - We need automated tools to detect these behaviors at scale

Connection to Anthropic's research directions: - Evaluating alignment of frontier models - Understanding model cognition - Behavioral monitoring and anomaly detection

Key References: - Petri blog post: "The sheer volume and complexity of potential behaviors far exceeds what researchers can manually test, making it increasingly difficult to properly audit each model." - Investigator agents paper: Introduces behavior elicitation as "searching through a combinatorially large space of input prompts" where "observing behaviors is easy as it only requires decoding from the target LM"

Content & Learning Objectives

1️⃣ AI Psychosis - Multi-Turn Red-Teaming

You'll start by implementing the full dynamic red-teaming pipeline from Tim Hua's AI psychosis study. Rather than hardcoded escalation scripts, you'll use a live red-team LLM (Grok-3) that plays the patient, reads the target's responses, and adaptively escalates. You'll load character files from the official ai-psychosis repo, wire up a multi-turn conversation loop, grade transcripts with the original 14-dimension clinical rubric, and run parallel campaigns across characters using ThreadPoolExecutor.

Learning Objectives
  • Understand why LLM-driven red-teaming (adaptive, multi-turn) finds vulnerabilities that static scripts miss
  • Implement the "glue code" pattern: load prompts from a repo, format them for an LLM, parse structured output
  • Build a multi-turn conversation loop where a red-team LLM and a target model alternate messages
  • Use ThreadPoolExecutor and as_completed to parallelise independent API-heavy workloads
  • Interpret a clinical rubric (delusion confirmation, therapeutic quality dimensions) to assess model safety

2️⃣ Introduction to Petri

You'll discover that the red-team pipeline you built in Section 1 is a primitive investigator agent, and see how Petri — Anthropic's automated auditing framework — formalises and extends it. You'll write seed instructions for the AI psychosis case, run Petri's Python API (auditor_agent, alignment_judge), categorise seeds across Petri's landscape of alignment concerns, and replicate a whistleblowing ablation study to understand which environmental factors drive model behavior.

Learning Objectives
  • Map your Section 1 components to their Petri equivalents (red-team prompt → seed, red-team LLM → auditor, grader → judge)
  • Write effective seed instructions: sparse directives that tell the auditor what to test, not how
  • Use Petri's Python API (inspect_ai.eval, auditor_agent, alignment_judge, model_roles) to run an audit
  • Understand Petri's 38 judging dimensions, especially enabling_serious_delusion and auditor_failure
  • Design and interpret ablation studies that isolate which environmental factors cause specific model behaviors

3️⃣ Petri Deep Dive - Source Level

You'll move from using Petri as a practitioner to understanding and extending its internals. You'll trace a full investigation through the auditor's tool calls (send_message, rollback, prefill, create_synthetic_tool), implement a custom auditor tool, run chain-of-thought ablation studies to understand how removing CoT affects auditor effectiveness, and build a super-agent aggregation pipeline that combines multiple independent investigations.

Learning Objectives
  • Read and trace Petri's source code to understand how auditor tools manipulate conversation state
  • Implement a custom tool that extends the auditor's capabilities (e.g., persona switching)
  • Understand how chain-of-thought affects investigator agent performance and design CoT ablation experiments
  • Implement super-agent aggregation: running multiple independent audits and combining their findings for higher detection rates

4️⃣ The Auditing Game - Capstone

You'll bring everything together by detecting reward-hacking behavior in a model organism that exploits evaluation metrics. You'll combine behavioral red-teaming (baseline attacks, persona sampling, third-person framing) with mechanistic detection (SAE feature analysis) to build a comprehensive detection strategy, and reflect on the limits of each approach.

Learning Objectives
  • Apply investigator agent techniques to a realistic alignment problem: detecting reward model bias exploitation
  • Compare behavioral detection (red-teaming, ablation) with mechanistic detection (SAE features) and understand their complementary strengths
  • Design advanced behavioral attacks (persona sampling, third-person framing) that bypass surface-level resistance
  • Synthesise findings across multiple detection methods into a coherent assessment of model behavior

Setup code

import json
import os
import re
import sys
import time
from collections import Counter
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Literal

import pandas as pd
from dotenv import load_dotenv
from IPython.display import display
from openai import OpenAI
from tqdm import tqdm

# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part5_investigator_agents"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part5_investigator_agents.tests as tests

API Setup

We use OpenRouter for all API calls - it provides unified access to Claude, GPT, and open-source models.

# Load environment variables
env_path = Path.cwd() / ".env"
assert env_path.exists(), "Please create a .env file with your API keys"

load_dotenv(dotenv_path=str(env_path))

# Setup OpenRouter client
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
)

MAIN = __name__ == "__main__"
# Model constants - we'll use these throughout
AUDITOR_MODEL = "anthropic/claude-sonnet-4"  # For running investigations
TARGET_MODEL = "deepseek/deepseek-chat"  # Target to probe (found vulnerable in AI psychosis study)
JUDGE_MODEL = "anthropic/claude-sonnet-4"  # For grading responses


def generate_response(
    model: str,
    messages: list[dict[str, str]],
    max_tokens: int = 2000,
    stop_sequences: list[str] | None = None,
    temperature: float = 0.7,
    max_retries: int = 10,
) -> str:
    """Single API call with retry logic for rate limits."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=max_tokens,
                temperature=temperature,
                stop=stop_sequences if stop_sequences else None,
            )
            return response.choices[0].message.content or ""
        except Exception as e:
            print(str(e))
            if "rate_limit" in str(e).lower() or "429" in str(e):
                if attempt < max_retries - 1:
                    wait_time = 2**attempt
                    print(f"Rate limit hit, waiting {wait_time}s...")
                    time.sleep(wait_time)
                    continue
            raise e
    return ""