1️⃣ Intro to LLM Agents
Learning Objectives
- Read resources about LLM agent evaluations to understand the current state of the field.
- Understand the common failure modes of LLM agents.
What is an LLM agent?
An LLM agent consists of a scaffolding program interacting with an LLM API to accomplish tasks in an external environment. This typically involves a loop of the following steps:
- The scaffolding program sends instructions to the LLM, typically containing information about the task goal, the possible actions available to the LLM, and any other relevant task information. (e.g. you are trying to calculate
3+3) - The LLM processes the input and outputs an action in text (e.g. "calls" the
calculate()tool on the expression3+3in text). - The scaffolding program executes the action and returns the outcome (e.g. it runs the
calculate()function in the background and returns the output,6, to the agent). - The LLM observes the results and decides the next action.
- Repeating the cycle until the task is complete.
The two basic components of scaffolding are:
- Tool calling: This allows LLMs to use tools by providing a text description of the tool. The LLM can choose to use this tool by "calling" it in its text output. If it uses a tool, the scaffolding will execute this tool on the LLM's behalf (e.g. by running a python function, sending request to an external API etc.) and return the result of this tool call to the agent.
- Prompting: This describes the task state to the LLM, describes the tools available to the LLM in the task, potentially instructs the LLM to use chain-of-thought to give the LLM more "thinking time." This also covers how the LLM's
chat_historyis stored from the LLM's prior actions.

Diagram based on METR's Evaluating Language-Model Agents on Realistic Autonomous Tasks, Figure 2.
Why evaluate LLM agents?
There are at least two reasons we want to evaluate LLM agents.
- Measuring the maximum capabilities of a model
For estimating safety risks, we want to measure the ceiling of dangerous capabilities. LLMs on their own often fail in easy-to-fix ways, as you will see. For example:
- They often claim to be incapable of tasks that they can actually perform.
- They can very easily get stuck in loops.
- They can give up and ask the user for help
- They can hallucinate facts, or even misunderstand their own prior reasoning and hallucinate a faulty conclusion.
- They can be limited by primitive tools.
- They can be sensitive in strange ways to information in their prompts.
- They can have bugs, typos, or other minor barriers that prevent them from operating to the fullest extent of their capability.
This means that when a model fails to accomplish a task, it may still have the capability to succeed; requiring only simple fixes that will unlock this capability. We want to eliminate the possibility of large capability improvements from relatively little effort, because this means our evaluation would have underestimated the true capability and risks associated with a model (especially in e.g. a dangerous capabilities evaluation). Therefore, we to try hard to elicit their raw capabilities (e.g. using scaffolding), so that we can evaluate LLMs at their best.
- Measuring the alignment of LLMs in agentic scenarios
We do not know if our current alignment techniques (e.g. supervised fine-tuning, RLHF) for aligning LLM chatbots will still work when LLMs are acting as agents in more complex scenarios. It is possible that these methods will not generalize well to agentic scenarios, and we may want to test this.
We know today that LLMs are being used as more than just chatbots. Since the release of ChatGPT, the use of LLMs as agentic systems has grown signifcantly. These agents started off rather disappointingly, when they were based on GPT-3.5. However as more powerful LLMs come out and AI companies ensure their LLMs are better at tool-use, these agents are improving rapidly.
Further resources on LLM agent evaluations
- [Evaluating Language-Model Agents on Realistic Autonomous Tasks](https://evals.alignment.org/Evaluating_LMAs_Realistic_Tasks.pdf) (Kinniment et al., ARC Evaluations Team (now METR), 2023) - [Large Language Models can Strategically Deceive their Users when Put Under Pressure](https://arxiv.org/pdf/2311.07590) (Scheurer et al., Apollo Research, ICLR 2024) - [LLM Powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/) (Lilian Weng, OpenAI Safety Team, 2023) - [AXRP Episode 34 - AI Evaluations with Beth Barnes](https://www.alignmentforum.org/posts/vACr4DExfeRMaCoo7/axrp-episode-34-ai-evaluations-with-beth-barnes) (Daniel Filan, 2024) -[Reflexion: Language Agents with Verbal Reinforcement Learning](https://arxiv.org/pdf/2303.11366) (Shinn et al., 2023) - [Answering Questions by Meta-Reasoning over Multiple Chains of Thought](https://arxiv.org/pdf/2304.13007) (Yoran et al., 2024) - [Toolformer: Language Models Can Teach Themselves to Use Tools](https://arxiv.org/pdf/2302.04761) (Schick et al., META AI Research, 2023) - [OpenAI Function Calling Guide](https://platform.openai.com/docs/guides/function-calling) - [Anthropic Function Calling Guide](https://docs.anthropic.com/en/docs/build-with-claude/tool-use)