Exercise Status: All exercises complete and verified

1️⃣ CoT Infrastructure & Sentence Taxonomy

Learning Objectives
  • Load and explore the MATH reasoning dataset structure (prompts, CoT traces, answers)
  • Implement sentence segmentation for chain-of-thought traces using regex patterns
  • Build both rule-based and LLM-based classifiers to categorize reasoning steps (problem_setup, active_computation, fact_retrieval, etc.)
  • Visualize how different sentence types are distributed across reasoning traces

In this section, we'll build infrastructure to analyze chain-of-thought reasoning by breaking it into meaningful units. When models generate long reasoning traces, we need to decide what the right "units of reasoning" are.

Tokens are too granular: a single reasoning step like "Let's calculate the area: $\pi r^2$" spans multiple tokens, and splitting it apart loses semantic meaning. Sentences are a more natural unit. They correspond to complete thoughts, they're granular enough to identify specific important steps, and they align with how humans chunk reasoning.

By splitting CoT traces into sentences and categorizing them (calculations, lookups, restatements, etc.), we can ask: which types are most critical for reaching the correct answer? This sentence-level analysis sets us up for the black-box and white-box methods in later sections.

Model Setup & Dataset Inspection

We'll start by setting a few constants (the purpose of these will become clear as we go through the exercises):

# Configuration
MODEL_NAME_1B = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
MODEL_NAME_8B = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
MODEL_NAME_14B = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
DATASET_NAME = "uzaymacar/math-rollouts"
BLACKMAIL_DATASET_NAME = "uzaymacar/blackmail-rollouts"
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
SIMILARITY_THRESHOLD = 0.8

Next, we'll load an embedding model. Embedding models take in text and output a vector - we'll use them to measure similarity between sentences, so we can find motifs in our reasoning traces.

embedding_model = SentenceTransformer(EMBEDDING_MODEL)
print(embedding_model)

To understand how embedding models work, let's look at cosine similarities between a few example sentences:

prompts = [
    "Wait, I think I made an error in my reasoning and need to backtrack",
    "Hold on, I believe I made a mistake in my logic and should reconsider",
    "After careful analysis, I've determined the correct answer is 42",
    "Time is an illusion. Lunchtime doubly so.",
]
labels = [x[:35] + "..." for x in prompts]

embedding = embedding_model.encode(prompts)
cosine_sims = embedding @ embedding.T

fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(cosine_sims, cmap="RdBu", vmin=-1, vmax=1)
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels, rotation=45, ha="right", fontsize=9)
ax.set_yticklabels(labels, fontsize=9)
plt.colorbar(im, label="Cosine Similarity")
plt.title("Sentence Embedding Similarity")
plt.tight_layout()
plt.show()
Click to see the expected output

We can also load the dataset that the paper's authors open-sourced. The dataset is very large, but the authors provide the structure on the HuggingFace page, so we can use huggingface_hub to load just the data we want. We'll inspect it more shortly.

PROBLEM_ID = 4682

path = f"deepseek-r1-distill-llama-8b/temperature_0.6_top_p_0.95/correct_base_solution/problem_{PROBLEM_ID}/base_solution.json"
local_path = hf_hub_download(repo_id=DATASET_NAME, filename=path, repo_type="dataset")

with open(local_path, "r") as f:
    problem_data = json.load(f)

print("Keys in problem data:", list(problem_data.keys()))
print(f"\nProblem prompt:\n{problem_data['prompt']}")
Click to see the expected output
Keys in problem data: ['prompt', 'solution', 'full_cot', 'answer', 'is_correct']

Problem prompt:
Solve this math problem step by step. You MUST put your final answer in \boxed{}. Problem: When the base-16 number $66666_{16}$ is written in base 2, how many base-2 digits (bits) does it have? Solution: 

Sentence Splitting

First, we'll need to split our CoT traces into sentences based on punctuation and paragraph breaks. We'll also need to handle special tokens like <think>. We've provided the function split_solution_into_chunks below, which implements the following rules:

  • The <think> ... </think> tags should be removed
  • You should split on sentences (i.e. ending in any of ., !, ?, or newlines), and characters like :
  • You should split on periods . unless they are decimal numbers e.g. x.y or a numbered list e.g. \n1.
  • No chunk should have length less than 10: if so, then merge it with the next chunk
  • Each chunk should be stripped of whitespace

Read through the function and run the test cases below to verify it works on basic inputs.

After running the test cases, try to come up with your own input where this function breaks down or splits into chunks in a counter-intuitive way. Some things to consider: what happens with URLs, abbreviations (e.g. "Dr. Smith"), LaTeX expressions (e.g. $3.14$), or ellipses (...)? Once you've found an edge case, see if you can modify the function to handle it.

def split_solution_into_chunks(text: str) -> list[str]:
    """Split solution into sentence-level chunks."""
    # Remove thinking tags
    if "<think>" in text:
        text = text.split("<think>")[1]
    if "</think>" in text:
        text = text.split("</think>")[0]
    text = text.strip()

    # Replace "." characters which we don't want to split on
    text = re.sub(r"(\d)\.(\d)", r"\1<DECIMAL>\2", text)  # e.g. "4.5" -> "4<DECIMAL>5"
    text = re.sub(r"\n(\d)\.(\s)", r"\n\1<DECIMAL>\2", text)  # e.g. "\n1. " -> "\n1<DECIMAL> "

    # Split on sentence endings, combining endings with previous chunk
    sentences = re.split(r"([!?:\n]|(?<!\n\d)\.)", text)
    chunks = []
    for i in range(0, len(sentences) - 1, 2):
        chunks.append((sentences[i] + sentences[i + 1]).replace("\n", " "))

    # Replace <DECIMAL> back with "."
    chunks = [re.sub(r"<DECIMAL>", ".", c) for c in chunks]

    # Merge chunks that are too short
    if not chunks:
        return []
    merged = [chunks[0]]
    for c in chunks[1:]:
        if len(merged[-1]) < 10:
            merged[-1] += c
        else:
            merged.append(c)
    return [c.strip() for c in merged if c.strip()]


test_cases = [
    (
        "<think>First, I understand the problem. Next, I'll solve for x. Finally, I verify!</think>",
        ["First, I understand the problem.", "Next, I'll solve for x.", "Finally, I verify!"],
    ),
    (
        "<think>Let me break this down: 1. Convert to decimal. 2. Calculate log. 3. Apply formula.</think>",
        ["Let me break this down:", "1. Convert to decimal.", "2. Calculate log.", "3. Apply formula."],
    ),
    (
        "<think>The answer is 42. Done.</think>",
        ["The answer is 42.", "Done."],
    ),
]

for input_text, expected_chunks in test_cases:
    chunks = split_solution_into_chunks(input_text)
    assert chunks == expected_chunks, f"Expected {expected_chunks}, got {chunks}"

print("All tests passed!")

Sentence Categorization

The paper uses a taxonomy of 8 categories: Problem Setup (parsing/rephrasing the problem), Plan Generation (stating a plan of action), Fact Retrieval (recalling facts, formulas, problem details), Active Computation (algebra, calculations, manipulations), Uncertainty Management (expressing confusion, re-evaluating, backtracking), Result Consolidation (aggregating intermediate results), Self Checking (verifying previous steps), and Final Answer Emission (explicitly stating the final answer).

We define these below:

CATEGORIES = {
    "problem_setup": "Problem Setup",
    "plan_generation": "Plan Generation",
    "fact_retrieval": "Fact Retrieval",
    "active_computation": "Active Computation",
    "uncertainty_management": "Uncertainty Management",
    "result_consolidation": "Result Consolidation",
    "self_checking": "Self Checking",
    "final_answer_emission": "Final Answer Emission",
    "unknown": "Unknown",
}

Exercise - heuristic-based categorization

Difficulty: 🔴🔴⚪⚪⚪
Importance: 🔵🔵🔵⚪⚪
You should spend up to 10-15 minutes on this exercise.

First, we'll implement a heuristic-based approach. You should do the following:

  • Fill out the CATEGORY_WORDS dictionary below, which maps each category to a list of words associated with that category. To get you started, we've filled out the first three categories.
  • Fill out the categorize_sentences_heuristic function below, which uses this dictionary to categorize sentences. We've given you a few example sentences to test your function - at minimum make sure your function works for these.

A few notes on the expected behavior:

  • The first sentence (idx == 0) should always be classified as "problem_setup" regardless of keyword matches, since reasoning traces typically start by restating the problem.
  • When a sentence matches keywords from multiple categories, the first matching category (in CATEGORY_WORDS iteration order) wins. Think about which categories should take priority when choosing your dictionary ordering.

Once you've passed the test sentences below, you should try taking rollouts from your model above (or examples from the dataset) and see how your function performs on them. Some questions you might ask yourself:

  • Do you think this taxonomy is reasonable?
  • Are there any sentences that are misclassified, or belong to more than one category?
  • How many words do you need to add before your classification works decently?

No heuristic-based classification will be perfect. The point of this exercise is to get you thinking about the different categories and the strengths / limitations of this kind of method. In research, try not to reach for a tool more complicated than what you need!

CATEGORY_WORDS = {
    # Fill in keywords for each category
    "final_answer_emission": ["\\boxed", "final answer"],
    "problem_setup": [],  # Add keywords
    "fact_retrieval": [],  # Add keywords
    # ... add more categories
}


def categorize_sentences_heuristic(chunks: list[str]) -> list[str]:
    """Categorize sentences using keyword matching."""
    raise NotImplementedError("Implement heuristic categorization")


example_sentences = [
    "I need to find the area of a circle with radius 5.",
    "The formula for circle area is A = πr².",
    "Substituting r = 5: A = π × 5² = 25π.",
    "Wait, let me look again at that calculation.",
    "So the area is 25π square units.",
    "Therefore, the answer is \\boxed{25π}.",
]
expected_categories = [
    "Problem Setup",
    "Fact Retrieval",
    "Active Computation",
    "Uncertainty Management",
    "Result Consolidation",
    "Final Answer Emission",
]

categories = categorize_sentences_heuristic(example_sentences)
for sentence, cat, expected in zip(example_sentences, categories, expected_categories):
    assert cat == expected, f"Expected {expected!r}, got {cat!r} for: {sentence!r}"

print("All tests for `categorize_sentences_heuristic` passed!")
Solution
CATEGORY_WORDS = {
    "final_answer_emission": ["\\boxed", "final answer"],
    "problem_setup": ["need to", "problem is", "given"],
    "fact_retrieval": ["remember", "formula", "know that", "recall"],
    "active_computation": ["calculate", "compute", "solve", "=", "equals", "result", "giving"],
    "uncertainty_management": ["wait", "let me", "double check", "hmm", "actually", "reconsider"],
    "result_consolidation": ["summarize", "so", "therefore", "in summary"],
    "self_checking": ["verify", "check", "confirm", "correct"],
    "plan_generation": ["plan", "approach", "strategy", "will", "i'll", "try"],
}


def categorize_sentences_heuristic(chunks: list[str]) -> list[str]:
    """Categorize sentences using keyword matching."""
    categories = []

    for idx, chunk in enumerate(chunks):
        chunk_lower = chunk.lower()

        if idx == 0:
            tag = "problem_setup"
        else:
            for category, words in CATEGORY_WORDS.items():
                if any(word in chunk_lower for word in words):
                    tag = category
                    break
            else:
                tag = "unknown"

        categories.append(CATEGORIES.get(tag))

    return categories


example_sentences = [
    "I need to find the area of a circle with radius 5.",
    "The formula for circle area is A = πr².",
    "Substituting r = 5: A = π × 5² = 25π.",
    "Wait, let me look again at that calculation.",
    "So the area is 25π square units.",
    "Therefore, the answer is \\boxed{25π}.",
]
expected_categories = [
    "Problem Setup",
    "Fact Retrieval",
    "Active Computation",
    "Uncertainty Management",
    "Result Consolidation",
    "Final Answer Emission",
]

categories = categorize_sentences_heuristic(example_sentences)
for sentence, cat, expected in zip(example_sentences, categories, expected_categories):
    assert cat == expected, f"Expected {expected!r}, got {cat!r} for: {sentence!r}"

print("All tests for `categorize_sentences_heuristic` passed!")

Exercise - implement an autorater

Difficulty: 🔴🔴⚪⚪⚪
Importance: 🔵🔵🔵⚪⚪
You should spend up to 10-15 minutes on this exercise.

Now we'll use a more advanced approach: an autorater (i.e. querying an LLM to do the categorization for us, rather than relying on hardcoded rules).

First we set up helper functions to call an API via OpenRouter (which supports many different model providers). We define generate_response for single calls and generate_responses_parallel for batch processing.

You'll need to create an .env file in the current working directory with OPENROUTER_API_KEY set, then you can run the cell below to see how the helper functions work.

In the exercise below, we've already constructed the system prompt and user prompt for you. Your job is to call generate_responses_parallel with the right arguments, then parse the LLM's response (a numbered list like 1. problem_setup\n2. active_computation\n...) into a list of category labels. You should handle the case where the LLM returns more or fewer labels than expected.

# These two functions are taken directly from section 4.1 (local model generation is not needed here).


def generate_response(
    model: str,
    messages: list[dict[str, str]],
    max_tokens: int = 128,
    stop_sequences: list[str] | None = None,
    temperature: float = 1.0,
    max_retries: int = 10,
) -> str:
    """Single API call with retry logic for rate limits."""
    assert openrouter_client, "OpenRouter API key not set (see earlier instructions)."

    stop_sequences = stop_sequences or []

    for attempt in range(max_retries):
        try:
            resp = openrouter_client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=max_tokens,
                temperature=temperature,
                stop=stop_sequences if stop_sequences else None,
            )
            return resp.choices[0].message.content or ""
        except Exception as e:
            print(str(e))
            if any(msg in str(e) for msg in ("rate_limit", "429", "empty/null choices")):
                if attempt < max_retries - 1:
                    wait_time = 2**attempt
                    print(f"Rate limit hit, waiting {wait_time}s...")
                    time.sleep(wait_time)
                    continue
            raise e
    return ""


def generate_responses_parallel(
    messages_list: list[list[dict[str, str]]],
    model: str,
    max_tokens: int = 128,
    stop_sequences: list[str] | None = None,
    temperature: float = 1.0,
    max_workers: int = 10,
) -> list[str | Exception]:
    """
    Run multiple API calls in parallel using ThreadPoolExecutor.

    Args:
        messages_list: List of message lists, each to be sent as a separate API call
        model: Model identifier for OpenRouter
        max_tokens: Max tokens per response
        stop_sequences: Stop sequences for generation
        temperature: Sampling temperature
        max_workers: Maximum number of parallel workers

    Returns:
        List of responses (strings) or Exceptions for failed calls, in same order as input
    """
    results: dict[int, str | Exception] = {}
    pbar = tqdm(total=len(messages_list), desc="API calls")

    def call_single(idx: int, messages: list[dict[str, str]]) -> tuple[int, str | Exception]:
        """Helper function to make a single API call."""
        try:
            time.sleep(0.1)  # Rate limiting
            result = generate_response(model, messages, max_tokens, stop_sequences, temperature)
            return idx, result
        except Exception as e:
            return idx, e

    # Execute tasks in parallel
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all tasks with their indices
        futures = [executor.submit(call_single, i, m) for i, m in enumerate(messages_list)]

        # Process completed tasks
        for future in as_completed(futures):
            idx, result = future.result()
            results[idx] = result
            pbar.update(1)

    pbar.close()

    # Return results in original order
    return [results[i] for i in range(len(messages_list))]


# Demo of how this function works:
sys_prompt = {"role": "system", "content": "Reply in rhyming couplets."}
test_messages = [
    [sys_prompt, {"role": "user", "content": "What is 2+2?"}],
    [sys_prompt, {"role": "user", "content": "What is the capital of France?"}],
]
responses = generate_responses_parallel(test_messages, model="openai/gpt-4.1-mini", max_tokens=40)
for i, response in enumerate(responses):
    print(f"Response {i + 1}:\n{response}\n")
Click to see the expected output
Response 1:
Two plus two, a sum so true,  
Equals four, that's clear to you!

Response 2:
The capital of France, I gladly recite,  
Is Paris, a city of charm and light.
DAG_SYSTEM_PROMPT = """
You are an expert in interpreting how language models solve math problems using multi-step reasoning. Your task is to analyze a Chain-of-Thought (CoT) reasoning trace, broken into discrete text chunks, and label each chunk with a tag that describes what this chunk is *doing* functionally in the reasoning process.

### Function Tags:

1. `problem_setup`: Parsing or rephrasing the problem
2. `plan_generation`: Stating or deciding on a plan of action
3. `fact_retrieval`: Recalling facts, formulas, problem details
4. `active_computation`: Performing algebra, calculations, manipulations
5. `result_consolidation`: Aggregating intermediate results, summarizing
6. `uncertainty_management`: Expressing confusion, re-evaluating, backtracking
7. `final_answer_emission`: Explicit statement of the final boxed answer
8. `self_checking`: Verifying previous steps
9. `unknown`: Use only if the chunk does not fit any of the above

### Output Format:

Return a numbered list with one function tag per chunk:

1. problem_setup
2. active_computation
...
"""


def categorize_sentences_autorater(
    problem_text: str, chunks: list[str], model: str = "openai/gpt-4.1-mini"
) -> list[str]:
    """Categorize sentences using an LLM autorater."""

    chunk_str = "\n".join(f"{i + 1}. {chunk}" for i, chunk in enumerate(chunks))

    user_prompt = f"""
Here is the math problem:

[PROBLEM]
{problem_text}

Here is the Chain of Thought, broken into chunks:

[CHUNKS]
{chunk_str}

Now label each chunk with function tags."""

    messages_list = [
        [
            {"role": "system", "content": DAG_SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ]
    ]

    # YOUR CODE HERE - get results with `generate_responses_parallel`, and parse them into
    # a list of category labels for each chunk.


example_problem = "What is the area of a circle with radius 5?"
autorater_categories = categorize_sentences_autorater(example_problem, example_sentences)

# Lightweight verification
assert len(autorater_categories) == len(example_sentences), (
    f"Expected {len(example_sentences)} categories, got {len(autorater_categories)}"
)
assert all(cat in CATEGORIES.values() or cat == "Unknown" for cat in autorater_categories), (
    f"All categories should be valid, got: {autorater_categories}"
)

df = pd.DataFrame(
    {
        "Sentence": example_sentences,
        "Heuristic": categories,
        "Autorater": autorater_categories,
    }
)
display(df)
Click to see the expected output
Solution
DAG_SYSTEM_PROMPT = """
You are an expert in interpreting how language models solve math problems using multi-step reasoning. Your task is to analyze a Chain-of-Thought (CoT) reasoning trace, broken into discrete text chunks, and label each chunk with a tag that describes what this chunk is *doing* functionally in the reasoning process.

### Function Tags:

1. `problem_setup`: Parsing or rephrasing the problem
2. `plan_generation`: Stating or deciding on a plan of action
3. `fact_retrieval`: Recalling facts, formulas, problem details
4. `active_computation`: Performing algebra, calculations, manipulations
5. `result_consolidation`: Aggregating intermediate results, summarizing
6. `uncertainty_management`: Expressing confusion, re-evaluating, backtracking
7. `final_answer_emission`: Explicit statement of the final boxed answer
8. `self_checking`: Verifying previous steps
9. `unknown`: Use only if the chunk does not fit any of the above

### Output Format:

Return a numbered list with one function tag per chunk:

1. problem_setup
2. active_computation
...
"""


def categorize_sentences_autorater(
    problem_text: str, chunks: list[str], model: str = "openai/gpt-4.1-mini"
) -> list[str]:
    """Categorize sentences using an LLM autorater."""

    chunk_str = "\n".join(f"{i + 1}. {chunk}" for i, chunk in enumerate(chunks))

    user_prompt = f"""
Here is the math problem:

[PROBLEM]
{problem_text}

Here is the Chain of Thought, broken into chunks:

[CHUNKS]
{chunk_str}

Now label each chunk with function tags."""

    messages_list = [
        [
            {"role": "system", "content": DAG_SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ]
    ]

    responses = generate_responses_parallel(
        messages_list=messages_list,
        model=model,
        max_tokens=128,
        temperature=0.0,
    )

    raw_response = responses[0] if isinstance(responses[0], str) else ""
    response = re.split(r"\n\d{1,2}\.", "\n" + raw_response.strip())
    response = [r.strip() for r in response if r.strip()]

    if len(response) != len(chunks):
        print(f"Warning: length mismatch {len(response)} != {len(chunks)}")
        response = response[: len(chunks)] + ["unknown"] * (len(chunks) - len(response))

    return [CATEGORIES.get(r, "Unknown") for r in response]


example_problem = "What is the area of a circle with radius 5?"
autorater_categories = categorize_sentences_autorater(example_problem, example_sentences)

# Lightweight verification
assert len(autorater_categories) == len(example_sentences), (
    f"Expected {len(example_sentences)} categories, got {len(autorater_categories)}"
)
assert all(cat in CATEGORIES.values() or cat == "Unknown" for cat in autorater_categories), (
    f"All categories should be valid, got: {autorater_categories}"
)

df = pd.DataFrame(
    {
        "Sentence": example_sentences,
        "Heuristic": categories,
        "Autorater": autorater_categories,
    }
)
display(df)