Chapter 3: LLM Evaluations
Learn to build and run evaluations for large language models, including dataset generation and LLM agents.
Sections
3.1
Intro to Evals
Design threat models and specifications for evaluating model properties.
3.2
Dataset Generation
Use LLMs to generate and refine high-quality evaluation datasets.
3.3
Running Evals with Inspect
Run standardised LLM evaluations using UK AISI's Inspect library.
3.4
LLM Agents
Build LLM agents with scaffolding to play Wikipedia Racing and other tasks.