2️⃣ Building a Simple Arithmetic Agent

Learning Objectives

Learn how to use function calling to allow LLMs to use external tools.

Understand the main functionalities of an LLM agent.

In general, most LLM agents share these core components:

LLM API interface: A basic function that makes API calls (e.g. generate()).
Actions/Tools: A set of actions the agent can take.
Task State Management: Keeping track of the current state of the task and any relevant context.
Memory: A way to store and retrieve information from past interactions (i.e. chat history). In inspect, we store it in state.messages.
Observation Parser: Functions to parse and interpret the results of actions and update the state.
Decision/Execution Logic: The rules or algorithms used to choose actions based on the current state and LLM output.
Task-Specific Information: Any additional information or functions specific to the task at hand.

These components are implemented across the Task, Agent, and Tool functions/classes. However, the specific breakdown of these components in our implementation is a design choice and can vary depending on the task. While some are very natural (e.g. LLM API interface goes into Agent, task state management goes into Task), others can vary (e.g. Tools could be implemented and handled entirely within the Task or Agent, as opposed to being separate functions; observation parsing could be in the Task or the Agent class). In general, we want to maximize separability and minimize interfaces/dependencies, so that we can easily swap out different agents for the same task, or vice versa.

Task

In an LLM agent eval, there will usually be a Task class that interacts with the Agent. In general, the Task will:

Prepare and provide the task instruction (and necessary files, functions etc) to the agent
Parse and score the agent's output
Update the task state accordingly (e.g. proceeds onto the next step of the task, ends the task).

Exercise - Build a simple arithmetic task

Difficulty: 🔴🔴🔴⚪⚪

Importance: 🔵🔵⚪⚪⚪

You should spend up to 20-25 minutes on this exercise.

First build a toy task called ArithmeticTask. This task should take in two numbers and create a list of arithmetic calculation problems with these two numbers, using the binary arithmetic operations defined in ArithmeticTask.operations. It should have methods to do the following:

Output the current problem (e.g. at the start this will be "Calculate num1 + num2");
Generate and store the correct answers to the problems
Move to the next problem if the model answer was correct, (or if the model refuses to answer the question);
Output the instruction for the current problem.
Check if all the problems have been solved.

How to handle calculations? We have implemented a helper function evaluate_expression() to evaluate the arithmetic expressions, which you should use in your implementation of execute(). evaluate_expression() takes an arithmetic expression as a string (e.g. "3+5") and returns the result as a string (e.g. "8.0").

Aside: Why not use Python's in-built eval() function?

Python's eval() function evaluates an arbitrary string expression, and so allows AI models to run arbitrary code. Unless you have set up a container or sandboxed environment, it is very bad practice to allow LLMs to run arbitrary code on your computer!

class ArithmeticTask:
    def __init__(self, num1: int | float, num2: int | float, operations: list[str] | None = None):
        self.num1 = num1
        self.num2 = num2
        self.operations = operations if operations else ["+", "-", "*", "/", "**", "//", "%"]
        self.current_task_number = 0

    def _generate_answers(self) -> list[str]:
        """
        Generates a list of the correct answers for all the possible tasks

        Returns:
            list[str]: A list of the correct answers for all the possible tasks
        """
        raise NotImplementedError("You need to implement _generate_answers")

    @property
    def get_current_task(self) -> str:
        raise NotImplementedError("You need to implement get_current_task")

    def update_current_task(self) -> None:
        """
        Increments self.current_task_number by one (modulo the number of operations)
        """
        raise NotImplementedError("You need to implement update_current_task")

    def get_current_instruction(self) -> ChatMessageUser:
        raise NotImplementedError("You need to implement get_current_instruction")


arithmetic_task1 = ArithmeticTask(3, 5)
print(arithmetic_task1.get_current_task)
arithmetic_task1.update_current_task()
print(arithmetic_task1.get_current_task)
print(arithmetic_task1.get_current_instruction())

Click to see the expected output

3 + 5
3 - 5
id='mcqhxvsD2B9T92JLTP7GBm' content='Calculate the following expression 3 - 5. Give your answer in the format NUMBER where NUMBER is a numerical value formatted as a float.' source=None metadata=None role='user' tool_call_id=None

Aside - What is @property?

The @property decorator in python is used to define methods that behave like they were attributes.

It allows you to access a method as though it were an attribute, without parentheses.
It allows you to perform functions when calling attributes, e.g. adding validation or performing any necessary calculations (in our case incorporating class attributes which frequently change).

For example, if we defined a Square class as follows:

class Square:
    def __init__(self, side_length):
        self.side_length = side_length

    @property
    def perimeter(self):
        return self.side_length*4

Then we could access perimeter as if it were an attribute:

s = Square(4)
print(s.perimeter) # Output: 16

Using @property in this case helps with: 1. Making the intent of the code clearer 2. Making it slightly easier to access these "properties" of the class

Solution

class ArithmeticTask:
    def __init__(self, num1: int | float, num2: int | float, operations: list[str] | None = None):
        self.num1 = num1
        self.num2 = num2
        self.operations = operations if operations else ["+", "-", "*", "/", "**", "//", "%"]
        self.current_task_number = 0

    def _generate_answers(self) -> list[str]:
        """
        Generates a list of the correct answers for all the possible tasks

        Returns:
            list[str]: A list of the correct answers for all the possible tasks
        """
        answers = []
        for op in self.operations:
            try:
                result = evaluate_expression(f"{self.num1} {op} {self.num2}")
                answers.append(str(result))
            except Exception as e:
                answers.append(f"Error: {str(e)}")
        return answers

    @property
    def get_current_task(self) -> str:
        return f"{self.num1} {self.operations[self.current_task_number]} {self.num2}"

    def update_current_task(self) -> None:
        """
        Increments self.current_task_number by one (modulo the number of operations)
        """
        self.current_task_number = (self.current_task_number + 1) % len(self.operations)

    def get_current_instruction(self) -> ChatMessageUser:
        return ChatMessageUser(
            content=f"Calculate the following expression {self.get_current_task}. Give your answer in the format <ANSWER>NUMBER</ANSWER> where NUMBER is a numerical value formatted as a float."
        )


arithmetic_task1 = ArithmeticTask(3, 5)
print(arithmetic_task1.get_current_task)
arithmetic_task1.update_current_task()
print(arithmetic_task1.get_current_task)
print(arithmetic_task1.get_current_instruction())

Tool use via function calling

The simplest way for LLMs to take actions is via function calling. Function calling is a built-in feature of many LLM APIs that allows models to use external "tools" (i.e. Python functions, APIs) by simply receiving and outputting text. This involves 5 simple steps:

Pick a function in your codebase that the model should be able to call
Describe your function in the syntax of the model's API so the model knows how to call it
Pass your function definitions as available “tools” to the model, along with the messages
Receive and handle the model response
Provide the function call result back to the model

This loop of prompting the LLM with tools, executing its actions, and returning the results forms the basis of all LLM agents. It allows LLMs to perform complex tasks like playing a game or completing a coding project "autonomously".

We will implement each step of the loop below.

Exercise - Write `CalculateTool`

Difficulty: 🔴🔴⚪⚪⚪

Importance: 🔵🔵🔵🔵⚪

You should spend up to 10-15 minutes on this exercise.

We will define a tool class for our simple calculate() function with the following structure (you don't need to run this code):

@tool 
def tool_name(function_args):
    async def execute(tool_args : type) -> Tool:
        # tool logic
        return tool_output
    return execute

For the CalculateTool, you should implement execute(), which should take in an arithmetical expression as string (e.g. "3+5") and return the result of this expression (also as a string).

Tool Description

You also need to make sure that your docstring for the tool is formatted correctly, as inspect reads the docstring in order to determine what information about the tool is passed to the LLM. Models like ChatGPT and Claude are fine-tuned to interpret and respond to tool descriptions appropriately, just like user and system messages.

To understand what's being fed to the LLM API, below is an example of a typical tool description for the OpenAI API (see their function calling guide for more details). Note that this may differ slightly between APIs.

{
    "type": "function",
    "function": {
        {
            "name": "get_delivery_date",
            "description": "Get the delivery date for a customer's order. Call this whenever you need to know the delivery date, for example when a customer asks 'Where is my package'",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "description": "The customer's order ID.",
                        },
                    },
                "required": ["order_id"],
                "additionalProperties": false,
            },
        },
    },
},

For Inspect, you should make sure that the types of the arguments are included, and that the docstring is formatted as given below. Here's a toy example of a tool that simply takes a number and a string as arguments, and appends the number to the end of the string.

@tool 
def tool_name():
    def execute(sentence : str, n : int) -> str:
        """
        This tool appends a number to the end of the string.

        Args:
            sentence: this is the string to which you want to append a number.
            n: this is the number you want to append.

        Returns: 
            The sentence with the number appended.
        """
        return sentence + str(n)

Inspect will then find the argument types, descriptions, and the tool description and create a formatted tool description for us. E.g. in the above case, the tool description would look like:

{
    "type": "function",
    "function": {
        {
        "name": "tool_name",
        "description": "This tool appends a number to the end of the string.",
        "parameters": {
            "type": "object",
            "properties": {
                "sentence": {
                    "type": "string",
                    "description": "this is the string to which you want to append a number.",
                    },
                    "n":{
                    "type": "integer",
                    "description": "this is the number you want to append."
                    },
                },
            "required": ["sentence", "n"],
            "additionalProperties": false,
            },
        },
    },
}

Good practices for writing tool descriptions

Here are some good practices for writing tool descriptions for Claude according to Anthropic, which should generalize to other chat models: - Provide extremely detailed descriptions. This is by far the most important factor in how effectively the model uses the tool. Your descriptions should explain every aspect of the tool, including: - What the tool does - When it should be used (and when it shouldn’t, if this confuses the model) - What each parameter means and how it affects the tool’s behavior - Any important caveats or limitations, such as what information the tool does not return if the tool name is unclear. - Prioritize descriptions over examples. While you can include examples of how to use a tool in its description or in the accompanying prompt, this is less important than having a clear and comprehensive explanation of the tool’s purpose and parameters. Only add examples after you’ve fully fleshed out the description.

The more context you can give models about your tools, the better it will be at deciding when and how to use them. Aim for at least 3-4 sentences per tool description, more if the tool is complex.

Read Anthropic's examples of what good and bad tool calling looks like here.

Write your tool function for the calculate() tool below.

@tool
def calculate():
    async def execute(expression: str) -> str:
        raise NotImplementedError("You need to implement the calculate tool and make sure the docstring is formatted correctly")

    return execute

Solution

@tool
def calculate():
    async def execute(expression: str) -> str:
        """
        A calculator that can evaluate arithmetic expressions. The input is a mathematical expression, as a string, and the output is the result of the calculation.

        Args:
            expression: the arithmetic expression to evaluate.

        Returns:
            The result of the calculation, as a string. Or error if the expression is invalid.
        """
        try:
            return str(evaluate_expression(expression))
        except Exception as e:
            return f"Error: {e}"

    return execute

Outside of inspect, tools are passed to the model by inserting the tool description in the tools parameter of the API call.

Agent

We will now implement an ArithmeticAgent class that is not specific to the ArithmeticTask, so that we can see the key components of a generic LLM agent.

Exercise - Build an `ArithmeticAgent`

Difficulty: 🔴🔴🔴🔴⚪

Importance: 🔵🔵🔵🔵🔵

You should spend up to 20-25 minutes on this exercise.

Now build our agent that will interact with the ArithmeticTask (with a calculator tool). You should implement the execute() function so that it runs the task in its entirety. For more complicated tasks, we'll implement a more modular agent design.

You should use inspect's generate() function in order to get the model to use tools and generate a response (you can pass the tools to generate() via the tools argument). If the model decides to call a tool, you should use inspect's execute_tools function in order to run the tool. Finally, you should check whether the model's answer is correct or not. We've provided a helper function that extracts answers wrapped in "<ANSWER></ANSWER>" tags from the content of the agent's response, accessible by utils.extract_answer().

Then if the answer is correct, you should update the task so that the model moves on to the next problem. Finally, if all the answers are solved correctly, the loop should stop, as the model has accomplished its task.

It may be useful for this exercise (and throughout this section) to refer to Inspect's agent docs here

@agent
def arithmetic_agent(task: ArithmeticTask):
    async def execute(state: AgentState) -> AgentState:
        raise NotImplementedError("You need to implement the arithmetic_agent")

    return execute

Solution

@agent
def arithmetic_agent(task: ArithmeticTask):
    async def execute(state: AgentState) -> AgentState:
        answer_list = ["wrong"] * len(task.operations)
        success = False
        while not success:
            state.messages.append(task.get_current_instruction())
            state.output = await get_model().generate(input=state.messages, tools=[calculate()], tool_choice="auto")
            state.messages.append(state.output.message)
            if state.output.message.tool_calls:
                messages, state.output = await execute_tools(state.messages, tools=[calculate()])
                state.messages.extend(messages)
            state.output = await get_model().generate(input=state.messages, tools=[calculate()], tool_choice="none")
            state.messages.append(state.output.message)
            try:
                if extract_answer(state.output.message.content) == task._generate_answers()[task.current_task_number]:
                    answer_list[task.current_task_number] = extract_answer(state.output.message.content)
                    task.update_current_task()

                else:
                    state.messages.append(ChatMessageUser(content="Incorrect answer. Try again."))
            except IndexError:
                state.messages.append(ChatMessageUser(content="Error: Could not extract answer"))
            if all(ans == task._generate_answers()[i] for i, ans in enumerate(answer_list)):
                success = True
        return state

    return execute

Exercise - Use eval to run the agent

Difficulty: 🔴⚪⚪⚪⚪

Importance: 🔵🔵🔵🔵⚪

You should spend up to 5-10 minutes on this exercise.

Now let's use inspect's eval() function to run the agent. We can turn our agent into a "solver" using Inspect's as_solver function. We don't need to pass a Solver in this case, as we'll mostly be reading agent output. However, since there is a clear cut goal, you can write a solver if you'd like. Refer back to section 3.3, Scorers if you don't remember how to do this. (We'll pass in an AgentState instead of a TaskState in this case).

WARNING!

When you're making API calls to LLMs to accomplish longer tasks, it can be tempting to use a while loop, and run the model until it finishes the task. But since every time we run a model we make an API call, this would allow us to spend arbitrarily large amounts of money on API calls. For this reason, always set a limit to your loop when making API calls It would be really unfortunate if you blew all your API budget on one mistake.

Inspect allows us to set a message_limit that restricts how many messages can occur in the conversation. This will be useful to ensure the model doesn't run for too long.

@task
def agent_task() -> str:
    return Task(dataset=[Sample(input="", target="")], message_limit=40)

eval(agent_task(), solver=as_solver(arithmetic_agent(task=ArithmeticTask(3, 5))))

2️⃣ Building a Simple Arithmetic Agent

Learning Objectives

Task

Exercise - Build a simple arithmetic task

Tool use via function calling

Exercise - Write CalculateTool

Tool Description

Agent

Exercise - Build an ArithmeticAgent

Exercise - Use eval to run the agent

Exercise - Write `CalculateTool`

Exercise - Build an `ArithmeticAgent`