[3.1] Intro to Evals

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

In this section, we will focus on designing a threat model and specification for a chosen model property. Later on, we'll use this to create and run an evaluation (eval) to measure this property. This section goes over the first, and hardest, step of real eval research: working out what to measure and how to measure it. These exercises are designed to give you a feel for this process, not to do the entire process rigorously -- you aren't expected to have a publishable eval by the end!

This section also has a few big differences from most ARENA content:

The questions are much more open-ended.
The solutions are for measuring a particular model property (power-seeking), they will likely not be the best solution for the property you're trying to measure.
This section contains far less coding than most ARENA content, and the exercises instead require you to think, and write.

Each exercise will have a difficulty and importance rating out of 5, as well as an estimated maximum time you should spend on these exercises and sometimes a short annotation. You should interpret the ratings & time estimates relatively (e.g. if you find yourself spending about 50% longer on the exercises than the time estimates, adjust accordingly). Please do skip exercises / look at solutions if you don't feel like they're important enough to be worth doing, and you'd rather get to the good stuff!

For a lecture on the material today, which provides some high-level understanding before you dive into the material, watch the video below:

Overview of Sections [3.1] to [3.3]

The goal of sections [3.1] to [3.3] is to build and run an alignment evaluation benchmark from scratch to measure a model property of interest. The benchmark we will build contains: - a multiple-choice (MC) question dataset of ~300 questions, - a specification of the model property, and - how to score and interpret the result.

We will use LLMs to generate and filter our questions, following the method from Ethan Perez et al (2022). In the example that we present, we will measure the tendency to seek power as our evaluation target.

For section [3.1], the goal is to craft threat models and a specification for the model property you choose to evaluate, then design 20 example eval questions that measures this property.
For section [3.2], we will use LLMs to expand our 20 eval questions to generate ~300 questions.
For section [3.3], we will evaluate models on this benchmark using the UK AISI's inspect library.

Diagram note: The double arrows indicate that the steps iteratively help refine each other. In general, the threat model should come first and determine what model properties are worth evaluating. However, in this diagram we put "Target Property" before "Threat-modeling & Specification". This is because in the exercises, we will start by choosing a property, then build a threat model around it because we want to learn the skill of building threat models.

An important distinction in evals research is between capability evals and alignment evals. From Apollo Research's excellent starter guide for evals:

There is a difference between capability and alignment evaluations. Capability evaluations measure whether the model has the capacity for specific behavior (i.e. whether the model “can” do it) and alignment evaluations measure whether the model has the tendency/propensity to show specific behavior (i.e. whether the model “wants” to do it). Capability and alignment evals have different implications. For example, a very powerful model might be capable of creating new viral pandemics but aligned enough to never do it in practice.

In these exercises, we've chosen to focus on alignment evals - however we'll still have to make some capability evals to investigate whether the model is capable of recognizing the target behaviour we're trying to test for (see the section on baselines in 3.3).

Readings

A starter guide for evals - this post serves as an excellent overview of what evals are, and what skills are useful for people who want to work in evals. We strongly recommend everyone read this post!
We need a Science of Evals - this post makes the case for more rigorous scientific processes to exist in the field of evals, which would provide more confidence in evals methodology and results.
Alignment Faking in LLMs - this is a fairly recent (at time of writing) paper in evals which has demonstrated some worrying behaviour in current LLMs. We'll be replicating it in section 2️⃣ of this material. You don't need to read it all in detail now since we will cover it when we get to that section, but you might be interested in the summary post by Scott Alexander.

Content & Learning Objectives

1️⃣ Intro to API calls

Learning Objectives

Learn the basics of API calls including:

How to format chat messages

How to generate responses

How to handle rate limit errors

2️⃣ Case Study: Alignment Faking

Learning Objectives

Learn about alignment faking, and why models might be incentivised to do it

Understand the implications of alignment faking for AI safety

Get hands-on experience working with a particular eval, to understand some of the common motifs and challenges in evals work

3️⃣ Threat-Modeling & Eval Design

Learning Objectives

Understand the purpose of evaluations and what safety cases are

Understand the types of evaluations

Learn what a threat model looks like, why it's important, and how to build one

Learn how to write a specification for an evaluation target

Learn how to use models to help you in the design process

Setup

import io
import os
import random
import sys
import time
import warnings
from pathlib import Path
from pprint import pprint
from typing import Callable, Literal, TypeAlias

import httpx
import pandas as pd
from anthropic import Anthropic
from dotenv import load_dotenv
from openai import OpenAI
from tabulate import tabulate
from tqdm import tqdm

# Make sure exercises are in the path
chapter = "chapter3_llm_evals"
section = "part1_intro_to_evals"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
section_dir = root_dir / chapter / "exercises" / section
exercises_dir = root_dir / chapter / "exercises"
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

MAIN = __name__ == "__main__"