[3.3] Running Evals with Inspect
Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.
If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.
Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction
This section will introduce you to the process of running an evaluation using the Inspect library. This is a library that has been built by UK AISI for ease-of-use, and to standardise LLM evaluations. In this section, we'll actually get to run an evaluation on large language models, and notice the choices that can be made in eval design (how questions are presented, whether the model can use CoT, whether we use many-shot, few-shot, or zero-shot prompting etc.).
Similarly to [3.1] and [3.2], most of our solutions will be built around the example tendency to seek power dataset. So you should approach the exercises with this in mind. You can either:
- Use the same dataset as the solutions, and implement the evals that we run in the solutions. You could also extend this, and come up with your own evaluation procedure on this dataset.
- Use your own dataset for your chosen model property. For this, you will have to work through sections [3.1] and [3.2] and have a dataset of ~300 questions to use when you evaluate the LLMs.
Each exercise will have a difficulty and importance rating out of 5, as well as an estimated maximum time you should spend on these exercises and sometimes a short annotation. You should interpret the ratings & time estimates relatively (e.g. if you find yourself spending about 50% longer on the exercises than the time estimates, adjust accordingly). Please do skip exercises / look at solutions if you don't feel like they're important enough to be worth doing, and you'd rather get to the good stuff!
For a lecture on the material today, which provides some high-level understanding before you dive into the material, watch the video below:
Content & Learning Objectives
1️⃣ Intro to Inspect
Learning Objectives
- Understand the big picture of how Inspect works
- Understand the components of a
Taskobject- Turn our json dataset into an Inspect dataset
- Understand the role of solvers and scorers in Inspect
2️⃣ Writing Solvers
Learning Objectives
- Understand what solvers are and how to build one
- Understand how prompting affects model responses to your questions
- Think of how you want your evaluation to proceed and write solvers for this
3️⃣ Writing Tasks and Evaluating
Learning Objectives
- Understand why we have to shuffle choices in MCQs
- Understand how to write a task in Inspect to carry out our evaluation
- Get baselines on our model to determine whether it can understand the questions we're asking
- Run a task on our model, and check the results
- Understand how Inspect stores log files, and how to extract & visualise data from them
Setup
import os
import random
import re
import sys
from functools import partial
from pathlib import Path
from pprint import pprint
from typing import Any, Literal
from anthropic import Anthropic
from dotenv import load_dotenv
from openai import OpenAI
# Make sure exercises are in the path
chapter = "chapter3_llm_evals"
section = "part3_running_evals_with_inspect"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
sys.path.append(str(exercises_dir))
import part3_running_evals_with_inspect.tests as tests
MAIN = __name__ == "__main__"
Reminder - how to set up your OpenAI API keys, before running the code below
- OpenAI: If you haven't already, go to https://platform.openai.com/ to create an account, then create a key in 'Dashboard'-> 'API keys'. - Anthropic: If you haven't already, go to https://console.anthropic.com/ to create an account, then select 'Get API keys' and create a key.
If you're in Google Colab, you should be able to set API Keys from the "secrets" tab on the left-side of the screen (the key icon). If in VSCode, then you can create a file called ARENA_3.0/.env containing the following:
OPENAI_API_KEY = "your-openai-key"
ANTHROPIC_API_KEY = "your-anthropic-key"
In the latter case, you'll also need to run the load_dotenv() function, which will load the API keys from the .env file & set them as environment variables. (If you encounter an error when the API keys are in quotes, try removing the quotes).
Once you've done this (either the secrets tab based method for Colab or .env-based method for VSCode), you can get the keys as os.getenv("OPENAI_API_KEY") and os.getenv("ANTHROPIC_API_KEY") in the code below. Note that the code OpenAI() and Anthropic() both accept an api_key parameter, but in the absence of this parameter they'll look for environment variables with the names OPENAI_API_KEY and ANTHROPIC_API_KEY - which is why it's important to get the names exactly right when you save your keys!
load_dotenv()
assert os.getenv("OPENAI_API_KEY") is not None, (
"You must set your OpenAI API key - see instructions in dropdown"
)
assert os.getenv("ANTHROPIC_API_KEY") is not None, (
"You must set your Anthropic API key - see instructions in dropdown"
)
# OPENAI_API_KEY
openai_client = OpenAI()
anthropic_client = Anthropic()