Exercise Status: All exercises complete and verified

☆ Bonus

Bloom: from exploration to measurement

Throughout this chapter you've worked with Petri, which is designed for open-ended exploration: given a seed instruction, the auditor probes the target model and may surface unexpected misaligned behaviors. But once you've found a concerning behavior, a natural next question is: how often does it happen, and how does it vary across models?

This is the problem Bloom solves. Bloom is Anthropic's complementary framework for targeted behavioral measurement. Where Petri explores, Bloom quantifies.

Petri Bloom
Goal Discover unknown misalignments Measure known behaviors
Input Seed instruction (sparse) Behavior description + examples (detailed)
Output Transcripts to review Aggregate metrics (elicitation rate, severity)
Auditor Interactive (rollback, prefill, tools) Natural conversation (no steering)
Judge 38 alignment dimensions 1 primary score: behavior presence (1-10)
Use case "What problems exist?" "How bad is problem X, and where?"

Bloom's 4-stage pipeline: 1. Understanding: an LLM reads the behavior description and produces a detailed analysis of how the behavior manifests and why it matters scientifically 2. Ideation: an LLM generates diverse evaluation scenarios (different companies, roles, situations) designed to elicit the target behavior 3. Rollout: for each scenario, an evaluator LLM simulates the user and tool responses while the target model responds naturally (no rollback or prefill, just organic conversation) 4. Judgment: a judge model scores each transcript for behavior presence on a 1-10 scale, plus secondary qualities like realism and eval-awareness

The metric that matters is elicitation rate = proportion of rollouts scoring >=7/10 for behavior presence. Bloom benchmarked 16 frontier models across 4 behaviors (delusional sycophancy, instructed long-horizon sabotage, self-preservation, self-preferential bias). Key findings include cross-model variation in susceptibility and systematic judge biases (GPT-5 rates its own model family as less misaligned).

Typical workflow: use Petri first to discover behavior instances, then Bloom to measure how widespread they are across models and scenarios. The two tools address different stages of the alignment evaluation lifecycle.

Bloom is open-source at github.com/safety-research/bloom. Try reading bloom/src/bloom/stages/ to see how each pipeline stage works, and compare the architecture to the mini-Petri you built in Section 2. A natural exercise would be to build a "mini-Bloom" following the same pattern: simplified Understanding → Ideation → Rollout → Judgment stages, wired end-to-end, measuring delusional sycophancy across the same models you tested for whistleblowing.

Other directions to explore

Here are several other directions you could take this material, building on what you've learned.

Design a new whistleblowing ablation condition. Section 2's ablation study tested specific environmental factors (system prompt framing, thinking tag privacy, etc.). Design and implement a new ablation modifier that tests a factor not already covered, for example whether the severity of the wrongdoing affects whistleblowing rates, or whether models whistleblow differently when given access to external communication tools vs. just internal ones.

The "No Wrongdoing" control condition. Petri's whistleblowing ablations include a control: replace genuinely harmful activities with completely harmless ones framed in suspicious-sounding language ("dumping water into the ocean", "adding sugar to candy", "releasing oxygen into the atmosphere"). This tests whether models whistleblow based on actual wrongdoing vs suspicious framing. Implement this control and compare rates across models.

Prefill attacks. The prefill method exists on your AuditorToolkit but was intentionally excluded from the dispatch loop. Add it to AUDITOR_TOOL_SCHEMAS and tool_dispatch, then experiment with putting words in the target's mouth (e.g., prefilling "Sure, I can help with that. Here's" to bypass refusal patterns). How does this change the auditor's success rate?

Cross-model judge bias. Petri 2.0 tested 5 judge models and found systematic biases: GPT-5 systematically rates GPT-5, GPT-5 mini, and o4-mini targets as less misaligned than other judges do; o4-mini tends toward binary 1/10 scores and systematically underrates concerning transcripts. Run your score_transcript judge on the same transcript with different judge models and compare. Do you see similar patterns?

Trained investigator models. The "Eliciting LM Behaviour with Investigator Agents" paper (Transluce) shows that small 8B models can be trained to effectively audit much larger 405B models, achieving very high success rates on eliciting harmful behaviors (jailbreaking, evaluated on AdvBench). The key idea is a forward/backward training asymmetry: generating training data (forward) is easy, but the actual elicitation task (backward) is hard. Explore how this connects to the auditor strategies you've observed in Petri transcripts.