[3.3] Running Evals with Inspect

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

This section will introduce you to the process of running an evaluation using the Inspect library. This is a library that has been built by UK AISI for ease-of-use, and to standardise LLM evaluations. In this section, we'll actually get to run an evaluation on large language models, and notice the choices that can be made in eval design (how questions are presented, whether the model can use CoT, whether we use many-shot, few-shot, or zero-shot prompting etc.).

Similarly to [3.1] and [3.2], most of our solutions will be built around the example tendency to seek power dataset. So you should approach the exercises with this in mind. You can either:

Use the same dataset as the solutions, and implement the evals that we run in the solutions. You could also extend this, and come up with your own evaluation procedure on this dataset.
Use your own dataset for your chosen model property. For this, you will have to work through sections [3.1] and [3.2] and have a dataset of ~300 questions to use when you evaluate the LLMs.

Each exercise will have a difficulty and importance rating out of 5, as well as an estimated maximum time you should spend on these exercises and sometimes a short annotation. You should interpret the ratings & time estimates relatively (e.g. if you find yourself spending about 50% longer on the exercises than the time estimates, adjust accordingly). Please do skip exercises / look at solutions if you don't feel like they're important enough to be worth doing, and you'd rather get to the good stuff!

For a lecture on the material today, which provides some high-level understanding before you dive into the material, watch the video below:

Content & Learning Objectives

1️⃣ Intro to Inspect

Learning Objectives

Understand the big picture of how Inspect works

Understand the components of a Task object

Turn our json dataset into an Inspect dataset

Understand the role of solvers and scorers in Inspect

2️⃣ Writing Solvers

Learning Objectives

Understand what solvers are and how to build one

Understand how prompting affects model responses to your questions

Think of how you want your evaluation to proceed and write solvers for this

3️⃣ Writing Tasks and Evaluating

Learning Objectives

Understand why we have to shuffle choices in MCQs

Understand how to write a task in Inspect to carry out our evaluation

Get baselines on our model to determine whether it can understand the questions we're asking

Run a task on our model, and check the results

Understand how Inspect stores log files, and how to extract & visualise data from them

Setup

import os
import random
import re
import sys
from functools import partial
from pathlib import Path
from pprint import pprint
from typing import Any, Literal

from anthropic import Anthropic
from dotenv import load_dotenv
from inspect_ai import Task, eval, task
from inspect_ai.dataset import Dataset, Sample, example_dataset, hf_dataset, json_dataset
from inspect_ai.model import ChatMessageSystem, ChatMessageUser, get_model
from inspect_ai.scorer import Score, Scorer, Target, answer, match, model_graded_fact, scorer
from inspect_ai.solver import (
    Choices,
    Generate,
    Solver,
    TaskState,
    chain,
    chain_of_thought,
    generate,
    self_critique,
    solver,
)
from openai import OpenAI

# Make sure exercises are in the path
chapter = "chapter3_llm_evals"
section = "part3_running_evals_with_inspect"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part3_running_evals_with_inspect.tests as tests

MAIN = __name__ == "__main__"

Reminder - how to set up your OpenAI API keys, before running the code below

OpenAI: If you haven't already, go to https://platform.openai.com/ to create an account, then create a key in 'Dashboard'-> 'API keys'.
Anthropic: If you haven't already, go to https://console.anthropic.com/ to create an account, then select 'Get API keys' and create a key.

If you're in Google Colab, you should be able to set API Keys from the "secrets" tab on the left-side of the screen (the key icon). If in VSCode, then you can create a file called ARENA_3.0/.env containing the following:

OPENAI_API_KEY = "your-openai-key"
ANTHROPIC_API_KEY = "your-anthropic-key"

In the latter case, you'll also need to run the load_dotenv() function, which will load the API keys from the .env file & set them as environment variables. (If you encounter an error when the API keys are in quotes, try removing the quotes).

Once you've done this (either the secrets tab based method for Colab or .env-based method for VSCode), you can get the keys as os.getenv("OPENAI_API_KEY") and os.getenv("ANTHROPIC_API_KEY") in the code below. Note that the code OpenAI() and Anthropic() both accept an api_key parameter, but in the absence of this parameter they'll look for environment variables with the names OPENAI_API_KEY and ANTHROPIC_API_KEY - which is why it's important to get the names exactly right when you save your keys!

load_dotenv()

assert os.getenv("OPENAI_API_KEY") is not None, "You must set your OpenAI API key - see instructions in dropdown"
assert os.getenv("ANTHROPIC_API_KEY") is not None, "You must set your Anthropic API key - see instructions in dropdown"

# OPENAI_API_KEY

openai_client = OpenAI()
anthropic_client = Anthropic()