[3.2] - Dataset Generation
Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.
If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.
Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction
The goal of this section is to learn how to use LLMs to generate and refine an evaluation dataset. By the end, you will have generated a (hopefully) high-quality dataset of ~300 questions.
We will begin by taking the questions that were designed in section 3.1, and using LLMs to generate further questions. We'll need to ensure the questions that are generated are high quality, and high enough variance, while still measuring the model property we want to measure.
Then, once we have a larger set of questions, we will score them (also using LLMs!), in order to filter out the low-quality questions, and any questions that aren't formatted correctly. Finally, we'll do some manual quality checks, and then do checks for any potential biases in the order of answers or distribution of questions, and ensure that there are no duplicate or near-duplicate questions.
At the end of this section, we'll have a large dataset of questions measuring our model-property. With these we can then conduct an actual model evaluation in section 3.3.
Each exercise will have a difficulty and importance rating out of 5, as well as an estimated maximum time you should spend on these exercises and sometimes a short annotation. You should interpret the ratings & time estimates relatively (e.g. if you find yourself spending about 50% longer on the exercises than the time estimates, adjust accordingly). Please do skip exercises / look at solutions if you don't feel like they're important enough to be worth doing, and you'd rather get to the good stuff!
For a lecture on the material today, which provides some high-level understanding before you dive into the material, watch the video below:
Content & Learning Objectives
1️⃣ Dataset Generation
Learning Objectives
- Learn how to use the structured output feature of the OpenAI API to generate multiple-choice questions
- Learn the basics of prompt engineering, including how to write prompts for MCQ generation and few-shot prompting
- Learn how to make API calls concurrently with
ThreadPoolExecutor
2️⃣ Dataset Quality Control
Learning Objectives
- Learn how to write a rubric for LLMs to evaluate questions accurately and catch flaws in the dataset
- Learn how to improve the generated datset by iteratively generating, observing flaws, modifying the rubrics, repeat.
- Generate a dataset of ~300 questions using LLMs.
Setup code
import json
import os
import random
import sys
import time
import warnings
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from datetime import datetime
from pathlib import Path
from pprint import pprint
from typing import Literal, Type, TypeAlias
import instructor
import numpy as np
import pandas as pd
import plotly.express as px
from anthropic import Anthropic
from dotenv import load_dotenv
from openai import OpenAI
from pydantic import BaseModel
from tabulate import tabulate
# Make sure exercises are in the path
chapter = "chapter3_llm_evals"
section = "part2_dataset_generation"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
sys.path.append(str(exercises_dir))
import part2_dataset_generation.tests as tests
from part1_intro_to_evals.solutions import retry_with_exponential_backoff
from part2_dataset_generation.utils import pretty_print_questions
MAIN = __name__ == "__main__"
Reminder - how to set up your OpenAI API keys, before running the code below
- OpenAI: If you haven't already, go to https://platform.openai.com/ to create an account, then create a key in 'Dashboard'-> 'API keys'. - Anthropic: If you haven't already, go to https://console.anthropic.com/ to create an account, then select 'Get API keys' and create a key.
If you're in Google Colab, you should be able to set API Keys from the "secrets" tab on the left-side of the screen (the key icon). If in VSCode, then you can create a file called ARENA_3.0/.env containing the following:
OPENAI_API_KEY = "your-openai-key"
ANTHROPIC_API_KEY = "your-anthropic-key"
In the latter case, you'll also need to run the load_dotenv() function, which will load the API keys from the .env file & set them as environment variables. (If you encounter an error when the API keys are in quotes, try removing the quotes).
Once you've done this (either the secrets tab based method for Colab or .env-based method for VSCode), you can get the keys as os.getenv("OPENAI_API_KEY") and os.getenv("ANTHROPIC_API_KEY") in the code below. Note that the code OpenAI() and Anthropic() both accept an api_key parameter, but in the absence of this parameter they'll look for environment variables with the names OPENAI_API_KEY and ANTHROPIC_API_KEY - which is why it's important to get the names exactly right when you save your keys!
load_dotenv()
assert os.getenv("OPENAI_API_KEY") is not None, (
"You must set your OpenAI API key - see instructions in dropdown"
)
assert os.getenv("ANTHROPIC_API_KEY") is not None, (
"You must set your Anthropic API key - see instructions in dropdown"
)
openai_client = OpenAI()
anthropic_client = Anthropic()