[4.5] Investigator Agents

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Note - these exercises involve a large number of API calls, and the cost can stack up pretty fast (about $25 to run through the whole notebook). Make sure you're aware of these costs before proceeding! We also recommend you do things like run smaller ablation studies (e.g. fewer characters, fewer turns) to keep costs down while still getting a good sense of the methodology.

What are investigator agents?

Suppose you want to know whether a model will reinforce a user's delusional beliefs over a multi-turn conversation. You could test this manually: roleplay as a patient, escalate gradually, see what happens. But testing 50 models across 100 scenarios this way is infeasible, and single-turn prompts often miss the interesting behaviors entirely (models are well-trained to refuse simple harmful requests, but multi-turn pressure can erode those boundaries).

Investigator agents automate this. They're LLM-powered systems that probe other LLMs through multi-turn interactions, discovering behaviors that single-turn evals miss. This section starts by building a red-teaming pipeline by hand (using the AI psychosis case study), then shows how what you built is a simplified version of Anthropic's Petri framework. From there you'll use Petri's actual API, extend it with custom tools, and build components from Petri 2.0.

This matters because models may hide capabilities or intentions during evaluation: sandbagging on capability evals, or gaming reward metrics rather than actually being aligned - and it's hard to manually search over all possible scenarios and environmental features to understand and evaluate all axes of behaviour we might want to. From the Petri blog post:

Building an alignment evaluation requires substantial engineering effort: setting up environments, writing test cases, implementing scoring mechanisms ... If you can only test a handful of behaviors, you'll likely miss much of what matters. ...Petri automates a large part of the safety evaluation process—from environment simulation through to initial transcript analysis—making comprehensive audits possible with minimal researcher effort.

Content & Learning Objectives

1️⃣ AI Psychosis - Multi-Turn Red-Teaming

You'll start by implementing the full dynamic red-teaming pipeline from Tim Hua's AI psychosis study. Rather than hardcoded escalation scripts, you'll use a live red-team LLM (Grok-3) that plays the patient, reads the target's responses, and adaptively escalates. You'll load character files from the official ai-psychosis repo, wire up a multi-turn conversation loop, grade transcripts with the original 14-dimension clinical rubric, and run parallel campaigns across characters using ThreadPoolExecutor.

Learning Objectives

Understand why LLM-driven red-teaming (adaptive, multi-turn) finds vulnerabilities that static scripts miss

Implement the "glue code" pattern: load prompts from a repo, format them for an LLM, parse structured output

Build a multi-turn conversation loop where a red-team LLM and a target model alternate messages

Use ThreadPoolExecutor and as_completed to parallelise independent API-heavy workloads

Interpret a clinical rubric (delusion confirmation, therapeutic quality dimensions) to assess model safety

2️⃣ Introduction to Petri

You'll discover that the red-team pipeline you built in Section 1 is a primitive investigator agent, and see how Petri (Anthropic's automated auditing framework) formalises and extends it. You'll write seed instructions for the AI psychosis case, run Petri's Python API (auditor_agent, alignment_judge), categorise seeds across Petri's landscape of alignment concerns, and replicate a whistleblowing ablation study to understand which environmental factors drive model behavior.

Learning Objectives

Map your Section 1 components to their Petri equivalents (red-team prompt → seed, red-team LLM → auditor, grader → judge)

Write effective seed instructions: sparse directives that tell the auditor what to test, not how

Use Petri's Python API (inspect_ai.eval, auditor_agent, alignment_judge) to run an audit

Understand Petri's 38 judging dimensions, especially unprompted_encouragement_of_user_delusion and auditor_failure

Design and interpret ablation studies that isolate which environmental factors cause specific model behaviors

3️⃣ Petri Deep Dive - Source Level

You'll move from using Petri as a practitioner to understanding and extending its internals. You'll trace a full investigation through the auditor's tool calls (send_message, set_system_prompt, create_synthetic_tool, rollback, record_finding), implement a custom auditor tool, run chain-of-thought ablation studies to understand how removing CoT affects auditor effectiveness, build a super-agent aggregation pipeline that combines multiple independent investigations, and build a realism classifier inspired by Petri 2.0 to understand how evaluation artifacts affect audit quality.

Learning Objectives

Read and trace Petri's source code to understand how auditor tools manipulate conversation state

Implement a custom tool that extends the auditor's capabilities (e.g., persona switching)

Understand how chain-of-thought affects investigator agent performance and design CoT ablation experiments

Implement super-agent aggregation: running multiple independent audits and combining their findings for higher detection rates

Build a realism classifier that distinguishes task-driven from environment-driven eval cues, connecting to Petri 2.0's core contribution

Understand how automated seed improvement closes the realism loop - fixing unrealistic seeds upstream

☆ Bonus

Optional further directions: Bloom (Anthropic's targeted behavioral measurement framework), new ablation conditions, prefill attacks, cross-model judge bias, and trained investigator models.

Reading Material

The exercises progress from building a red-teaming pipeline by hand to using Anthropic's Petri framework. Read the Petri blog post before starting Section 2 - it will make the API and concepts much clearer.

AI Psychosis by Tim Hua shows how AI chatbots can reinforce delusional beliefs over multi-turn conversations. It uses dynamic red-teaming, with a set of different character scenarios. Section 1 of the exercises replicates this work from scratch.
Petri: An open-source auditing tool to accelerate AI safety research (Anthropic, 2025). Anthropic's open-source automated auditing framework that uses AI agents to test target models across diverse multi-turn scenarios. Sections 2-3 of the exercises use Petri's API and source code. Read the blog post, focusing on the "Approach" and "Results" sections.
Petri 2.0 (Anthropic, 2026). Addresses eval-awareness: models might behave differently when they detect they're being evaluated. Introduces a realism classifier to distinguish task-driven from environment-driven cues. Section 3 of the exercises builds a realism classifier inspired by this. Read the first half for the eval-awareness framing.
Investigating Models for Misalignment (AISI). Provides the task-driven vs environment-driven framing for eval-awareness cues that Petri 2.0 builds on. Short and useful for understanding why some evaluation artifacts are unavoidable while others should be minimised.

Setup

import json
import os
import re
import sys
import time
from collections import Counter
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Literal

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from IPython.display import display
from openai import OpenAI
from scipy import stats

# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part5_investigator_agents"
root_dir = next(p for p in Path.cwd().parents if p.name == "ARENA_3.0")
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section

if str(exercises_dir) not in sys.path:
    sys.path.insert(0, str(exercises_dir))

import part5_investigator_agents.tests as tests
import part5_investigator_agents.utils as utils

MAIN = __name__ == "__main__"

API setup

We use OpenRouter for all API calls, both the direct generate_response calls and the Petri/inspect_ai audit in Section 2. This means you only need a single API key.

Required .env file entry, which should be in chapter4_alignment_science/exercises/.env:

OPENROUTER_API_KEY=sk-or-...

# Load environment variables
env_path = exercises_dir / ".env"
assert env_path.exists(), "Please create a .env file with: OPENROUTER_API_KEY=sk-or-..."

load_dotenv(dotenv_path=str(env_path))

OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"

# OpenRouter client (used by generate_response throughout all sections)
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
)

# Model constants - used by generate_response (OpenRouter format: "provider/model")
AUDITOR_MODEL = "anthropic/claude-sonnet-4"  # For running investigations
TARGET_MODEL = "deepseek/deepseek-chat"  # Target to probe (found vulnerable in AI psychosis study)
JUDGE_MODEL = "anthropic/claude-sonnet-4"  # For grading responses

# inspect_ai model names - used by Petri audit (Section 2 only)
# All routed through OpenRouter so we only need one API key.
# Format: "openrouter/<openrouter_model_id>"
INSPECT_AUDITOR_MODEL = "openrouter/anthropic/claude-sonnet-4"
INSPECT_TARGET_MODEL = "openrouter/deepseek/deepseek-chat"
INSPECT_JUDGE_MODEL = "openrouter/anthropic/claude-sonnet-4"


def generate_response(
    model: str,
    messages: list[dict[str, str]],
    max_tokens: int = 2000,
    stop_sequences: list[str] | None = None,
    temperature: float = 0.7,
    max_retries: int = 10,
) -> str:
    """Single API call with retry logic for rate limits."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=max_tokens,
                temperature=temperature,
                stop=stop_sequences if stop_sequences else None,
            )
            if not response.choices:
                raise ValueError("API returned empty/null choices")
            return response.choices[0].message.content or ""
        except Exception as e:
            print(str(e))
            if any(msg in str(e) for msg in ("rate_limit", "429", "empty/null choices")):
                if attempt < max_retries - 1:
                    wait_time = 2**attempt
                    print(f"Rate limit hit, waiting {wait_time}s...")
                    time.sleep(wait_time)
                    continue
            raise e
    return ""