1️⃣ CoT Infrastructure & Sentence Taxonomy

Learning Objectives

Load and explore the GSM8K reasoning dataset structure (prompts, CoT traces, answers)

Implement sentence segmentation for chain-of-thought traces using regex patterns

Build both rule-based and LLM-based classifiers to categorize reasoning steps (calculation, lookup, restatement, etc.)

Visualize how different sentence types are distributed across reasoning traces

In this section, we'll build the infrastructure to analyze chain-of-thought reasoning by breaking it into meaningful units. When models generate long reasoning traces, we need to decide: what are the appropriate "units of reasoning" we should analyze?

Why not use tokens? Tokens are too granular - a single reasoning step like "Let's calculate the area: πr²" spans multiple tokens, and breaking it apart loses semantic meaning.

Why sentences? Sentences are a natural unit because: 1. They correspond to complete thoughts or reasoning steps 2. They're granular enough to identify specific important steps 3. They're coarse enough to be semantically meaningful 4. They align with how humans naturally chunk reasoning

By splitting CoT traces into sentences and categorizing them (calculations, lookups, restatements, etc.), we can start to ask questions like: "Which types of sentences are most critical for reaching the correct answer?" This sentence-level analysis will set us up for the black-box and white-box methods in later sections, where we'll measure which sentences actually shape the model's reasoning trajectory.

Model Setup & Dataset Inspection

Let's start by setting a few constants (the purpose of all of these will become clear as we go through the exercises):

# Configuration
MODEL_NAME_1B = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
MODEL_NAME_8B = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
MODEL_NAME_14B = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
DATASET_NAME = "uzaymacar/math-rollouts"
BLACKMAIL_DATASET_NAME = "uzaymacar/blackmail-rollouts"
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
SIMILARITY_THRESHOLD = 0.8

Next, we'll load in our embedding model. Embedding models are unsupervised models designed to take in text and output a vector - we'll be using them to classify the similarity of different sentences, so we can find motifs in our reasoning traces.

embedding_model = SentenceTransformer(EMBEDDING_MODEL)
print(embedding_model)

To understand how embedding models work, let's look at cosine similarities between a few example sentences:

prompts = [
    "Wait, I think I made an error in my reasoning and need to backtrack",
    "Hold on, I believe I made a mistake in my logic and should reconsider",
    "After careful analysis, I've determined the correct answer is 42",
    "Time is an illusion. Lunchtime doubly so.",
]
labels = [x[:35] + "..." for x in prompts]

embedding = embedding_model.encode(prompts)
cosine_sims = embedding @ embedding.T

fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(cosine_sims, cmap="RdBu", vmin=-1, vmax=1)
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels, rotation=45, ha="right", fontsize=9)
ax.set_yticklabels(labels, fontsize=9)
plt.colorbar(im, label="Cosine Similarity")
plt.title("Sentence Embedding Similarity")
plt.tight_layout()
plt.show()

Click to see the expected output

We can also load in the dataset that the paper's authors have helpfully open-sourced. The dataset is very large, but the authors provide the structure in the corresponding HuggingFace page, so we can use the huggingface_hub package to load in just the data we want. We'll inspect this dataset more shortly.

PROBLEM_ID = 4682

path = f"deepseek-r1-distill-llama-8b/temperature_0.6_top_p_0.95/correct_base_solution/problem_{PROBLEM_ID}/base_solution.json"
local_path = hf_hub_download(repo_id=DATASET_NAME, filename=path, repo_type="dataset")

with open(local_path, "r") as f:
    problem_data = json.load(f)

print("Keys in problem data:", list(problem_data.keys()))
print(f"\nProblem prompt:\n{problem_data['prompt']}")

Click to see the expected output

Keys in problem data: ['prompt', 'solution', 'full_cot', 'answer', 'is_correct']
Problem prompt:
Solve this math problem step by step. You MUST put your final answer in \boxed{}. Problem: When the base-16 number $66666_{16}$ is written in base 2, how many base-2 digits (bits) does it have? Solution:

Sentence Splitting

Exercise - implement sentence splitting

Difficulty: 🔴🔴⚪⚪⚪

Importance: 🔵⚪⚪⚪⚪

> You should spend ~10 minutes on this exercise.

First, we'll need to split our CoT traces into sentences based on punctuation and paragraph breaks. We'll also need to handle special tokens like <think>.

You should fill in the split_solution_into_chunks function below. We've given you a few basic tests to pass; when your solution passes all of them you can be confident you've dealt with enough edge cases. Here is the full set of rules as defined by the edge cases:

The <think> ... </think> tags should be removed
You should split on sentences (i.e. ending in any of ., !, ?, or newlines), and characters like :
If the period . is in a decimal or numbered list (e.g. 34.5 or \n1.) then you shouldn't split on it
You should split on periods . unless they are decimal numbers e.g. x.y or a numbered list e.g. \n1.
No chunk should have length less than 10: if so, then merge it with the next chunk
Each chunk should be stripped of whitespace

We recommend single-shotting this exercise with LLMs, and checking the output for bugs rather than writing it from scratch yourself.

def split_solution_into_chunks(text: str) -> list[str]:
    """Split solution into sentence-level chunks."""
    raise NotImplementedError("Implement sentence splitting")


test_cases = [
    (
        "<think>First, I understand the problem. Next, I'll solve for x. Finally, I verify!</think>",
        ["First, I understand the problem.", "Next, I'll solve for x.", "Finally, I verify!"],
    ),
    (
        "<think>Let me break this down: 1. Convert to decimal. 2. Calculate log. 3. Apply formula.</think>",
        ["Let me break this down:", "1. Convert to decimal.", "2. Calculate log.", "3. Apply formula."],
    ),
    (
        "<think>The answer is 42. Done.</think>",
        ["The answer is 42.", "Done."],
    ),
]

for input_text, expected_chunks in test_cases:
    chunks = split_solution_into_chunks(input_text)
    assert chunks == expected_chunks, f"Expected {expected_chunks}, got {chunks}"

print("All tests passed!")

Solution

def split_solution_into_chunks(text: str) -> list[str]:
    """Split solution into sentence-level chunks."""
    # Remove thinking tags
    if "<think>" in text:
        text = text.split("<think>")[1]
    if "</think>" in text:
        text = text.split("</think>")[0]
    text = text.strip()
# Replace "." characters which we don't want to split on
    text = re.sub(r"(\d)\.(\d)", r"\1<DECIMAL>\2", text)  # e.g. "4.5" -> "4<DECIMAL>5"
    text = re.sub(r"\n(\d)\.(\s)", r"\n\1<DECIMAL>\2", text)  # e.g. "\n1. " -> "\n1<DECIMAL> "
# Split on sentence endings, combining endings with previous chunk
    sentences = re.split(r"([!?:\n]|(?<!\n\d)\.)", text)
    chunks = []
    for i in range(0, len(sentences) - 1, 2):
        chunks.append((sentences[i] + sentences[i + 1]).replace("\n", " "))
# Replace <DECIMAL> back with "."
    chunks = [re.sub(r"<DECIMAL>", ".", c) for c in chunks]
# Merge chunks that are too short
    if not chunks:
        return []
    merged = [chunks[0]]
    for c in chunks[1:]:
        if len(merged[-1]) < 10:
            merged[-1] += c
        else:
            merged.append(c)
    return [c.strip() for c in merged if c.strip()]
test_cases = [
    (
        "<think>First, I understand the problem. Next, I'll solve for x. Finally, I verify!</think>",
        ["First, I understand the problem.", "Next, I'll solve for x.", "Finally, I verify!"],
    ),
    (
        "<think>Let me break this down: 1. Convert to decimal. 2. Calculate log. 3. Apply formula.</think>",
        ["Let me break this down:", "1. Convert to decimal.", "2. Calculate log.", "3. Apply formula."],
    ),
    (
        "<think>The answer is 42. Done.</think>",
        ["The answer is 42.", "Done."],
    ),
]
for input_text, expected_chunks in test_cases:
    chunks = split_solution_into_chunks(input_text)
    assert chunks == expected_chunks, f"Expected {expected_chunks}, got {chunks}"
print("All tests passed!")

Sentence Categorization

The paper uses a taxonomy of 8 categories:

Problem Setup: Parsing or rephrasing the problem
Plan Generation: Stating or deciding on a plan of action
Fact Retrieval: Recalling facts, formulas, problem details
Active Computation: Algebra, calculations, manipulations
Uncertainty Management: Expressing confusion, re-evaluating, backtracking
Result Consolidation: Aggregating intermediate results, summarizing
Self Checking: Verifying previous steps
Final Answer Emission: Explicitly stating the final answer

Let's define these below:

CATEGORIES = {
    "problem_setup": "Problem Setup",
    "plan_generation": "Plan Generation",
    "fact_retrieval": "Fact Retrieval",
    "active_computation": "Active Computation",
    "uncertainty_management": "Uncertainty Management",
    "result_consolidation": "Result Consolidation",
    "self_checking": "Self Checking",
    "final_answer_emission": "Final Answer Emission",
    "unknown": "Unknown",
}

Exercise - heuristic-based categorization

Difficulty: 🔴🔴⚪⚪⚪

Importance: 🔵🔵🔵⚪⚪

> You should spend up to 10-15 minutes on this exercise.

First, we'll implement a heuristic-based approach. You should do the following:

Fill out the CATEGORY_WORDS dictionary below, which maps each category to a list of words associated with that category. To get you started, we've filled out the first three categories.
Fill out the categorize_sentences_heuristic function below, which uses this dictionary to categorize sentences. We've given you a few example sentences to test your function - at minimum make sure your function works for these.

Once you've passed the test sentences below, you should try taking rollouts from your model above (or examples from the dataset) and see how your function performs on them. Some questions you might ask yourself:

Do you think this taxonomy is reasonable?
Are there any sentences that are misclassified, or belong to more than one category?
How many words do you need to add before your classification works decently?

Note that no heuristic-based classification will be perfect. The point of this exercise is to get you thinking about the different categories, and what the strengths / limitations of this kind of method are. In research, you should generally try not to reach for a tool more complicated than what you need!

CATEGORY_WORDS = {
    # Fill in keywords for each category
    "final_answer_emission": ["\\boxed", "final answer"],
    "problem_setup": [],  # Add keywords
    "fact_retrieval": [],  # Add keywords
    # ... add more categories
}


def categorize_sentences_heuristic(chunks: list[str]) -> list[str]:
    """Categorize sentences using keyword matching."""
    raise NotImplementedError("Implement heuristic categorization")


example_sentences = [
    "I need to find the area of a circle with radius 5.",
    "The formula for circle area is A = πr².",
    "Substituting r = 5: A = π × 5² = 25π.",
    "Wait, let me look again at that calculation.",
    "So the area is 25π square units.",
    "Therefore, the answer is \\boxed{25π}.",
]
expected_categories = [
    "Problem Setup",
    "Fact Retrieval",
    "Active Computation",
    "Uncertainty Management",
    "Result Consolidation",
    "Final Answer Emission",
]

categories = categorize_sentences_heuristic(example_sentences)
for sentence, cat, expected in zip(example_sentences, categories, expected_categories):
    assert cat == expected, f"Expected {expected!r}, got {cat!r} for: {sentence!r}"

print("All tests for `categorize_sentences_heuristic` passed!")

Solution

CATEGORY_WORDS = {
    "final_answer_emission": ["\\boxed", "final answer"],
    "problem_setup": ["need to", "problem is", "given"],
    "fact_retrieval": ["remember", "formula", "know that", "recall"],
    "active_computation": ["calculate", "compute", "solve", "=", "equals", "result", "giving"],
    "uncertainty_management": ["wait", "let me", "double check", "hmm", "actually", "reconsider"],
    "result_consolidation": ["summarize", "so", "therefore", "in summary"],
    "self_checking": ["verify", "check", "confirm", "correct"],
    "plan_generation": ["plan", "approach", "strategy", "will", "i'll", "try"],
}
def categorize_sentences_heuristic(chunks: list[str]) -> list[str]:
    """Categorize sentences using keyword matching."""
    categories = []
for idx, chunk in enumerate(chunks):
        chunk_lower = chunk.lower()
if idx == 0:
            tag = "problem_setup"
        else:
            for category, words in CATEGORY_WORDS.items():
                if any(word in chunk_lower for word in words):
                    tag = category
                    break
            else:
                tag = "unknown"
categories.append(CATEGORIES.get(tag))
return categories
example_sentences = [
    "I need to find the area of a circle with radius 5.",
    "The formula for circle area is A = πr².",
    "Substituting r = 5: A = π × 5² = 25π.",
    "Wait, let me look again at that calculation.",
    "So the area is 25π square units.",
    "Therefore, the answer is \\boxed{25π}.",
]
expected_categories = [
    "Problem Setup",
    "Fact Retrieval",
    "Active Computation",
    "Uncertainty Management",
    "Result Consolidation",
    "Final Answer Emission",
]
categories = categorize_sentences_heuristic(example_sentences)
for sentence, cat, expected in zip(example_sentences, categories, expected_categories):
    assert cat == expected, f"Expected {expected!r}, got {cat!r} for: {sentence!r}"
print("All tests for categorize_sentences_heuristic passed!")

Exercise - implement an autorater

Difficulty: 🔴🔴⚪⚪⚪

Importance: 🔵🔵🔵⚪⚪

> You should spend up to 10-15 minutes on this exercise.

We'll now progress to a slightly more advanced approach for classification, using an autorater. This essentially means we're querying an LLM to do the categorization for us, rather than relying on hardcoded rules.

We'll start by setting up helper functions to call an API via OpenRouter (which supports many different model providers). We'll define generate_response for single calls and generate_responses_parallel for batch processing.

You'll need to create an .env file in the current working directory and set the OPENROUTER_API_KEY environment variable, then you can run the cell below to see how the helper functions work.

def generate_response(
    model: str,
    messages: list[dict[str, str]],
    max_tokens: int = 128,
    stop_sequences: list[str] | None = None,
    temperature: float = 1.0,
    max_retries: int = 10,
) -> str:
    """Single API call with retry logic for rate limits."""
    assert openrouter_client, "OpenRouter API key not set (see earlier instructions)."

    stop_sequences = stop_sequences or []

    for attempt in range(max_retries):
        try:
            resp = openrouter_client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=max_tokens,
                temperature=temperature,
                stop=stop_sequences if stop_sequences else None,
            )
            return resp.choices[0].message.content or ""
        except Exception as e:
            print(str(e))
            if "rate_limit" in str(e).lower() or "429" in str(e):
                if attempt < max_retries - 1:
                    wait_time = 2**attempt
                    print(f"Rate limit hit, waiting {wait_time}s...")
                    time.sleep(wait_time)
                    continue
            raise e
    return ""


def generate_responses_parallel(
    messages_list: list[list[dict[str, str]]],
    model: str,
    max_tokens: int = 128,
    stop_sequences: list[str] | None = None,
    temperature: float = 1.0,
    max_workers: int = 10,
) -> list[str | Exception]:
    """
    Run multiple API calls in parallel using ThreadPoolExecutor.

    Args:
        messages_list: List of message lists, each to be sent as a separate API call
        model: Model identifier for OpenRouter
        max_tokens: Max tokens per response
        stop_sequences: Stop sequences for generation
        temperature: Sampling temperature
        max_workers: Maximum number of parallel workers

    Returns:
        List of responses (strings) or Exceptions for failed calls, in same order as input
    """
    results: dict[int, str | Exception] = {}
    pbar = tqdm(total=len(messages_list), desc="API calls")

    def call_single(idx: int, messages: list[dict[str, str]]) -> tuple[int, str | Exception]:
        """Helper function to make a single API call."""
        try:
            time.sleep(0.1)  # Rate limiting
            result = generate_response(model, messages, max_tokens, stop_sequences, temperature)
            return idx, result
        except Exception as e:
            return idx, e

    # Execute tasks in parallel
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all tasks with their indices
        futures = [executor.submit(call_single, i, m) for i, m in enumerate(messages_list)]

        # Process completed tasks
        for future in as_completed(futures):
            idx, result = future.result()
            results[idx] = result
            pbar.update(1)

    pbar.close()

    # Return results in original order
    return [results[i] for i in range(len(messages_list))]


# Demo of how this function works:
sys_prompt = {"role": "system", "content": "Reply in rhyming couplets."}
test_messages = [
    [sys_prompt, {"role": "user", "content": "What is 2+2?"}],
    [sys_prompt, {"role": "user", "content": "What is the capital of France?"}],
]
responses = generate_responses_parallel(test_messages, model="openai/gpt-4.1-mini", max_tokens=40)
for i, response in enumerate(responses):
    print(f"Response {i + 1}:\n{response}\n")

Click to see the expected output

Response 1: Two plus two, a sum so true, Equals four, that's clear to you!

Response 2: The capital of France, I gladly recite, Is Paris, a city of charm and light.

DAG_SYSTEM_PROMPT = """
You are an expert in interpreting how language models solve math problems using multi-step reasoning. Your task is to analyze a Chain-of-Thought (CoT) reasoning trace, broken into discrete text chunks, and label each chunk with a tag that describes what this chunk is *doing* functionally in the reasoning process.

### Function Tags:

1. `problem_setup`: Parsing or rephrasing the problem
2. `plan_generation`: Stating or deciding on a plan of action
3. `fact_retrieval`: Recalling facts, formulas, problem details
4. `active_computation`: Performing algebra, calculations, manipulations
5. `result_consolidation`: Aggregating intermediate results, summarizing
6. `uncertainty_management`: Expressing confusion, re-evaluating, backtracking
7. `final_answer_emission`: Explicit statement of the final boxed answer
8. `self_checking`: Verifying previous steps
9. `unknown`: Use only if the chunk does not fit any of the above

### Output Format:

Return a numbered list with one function tag per chunk:

1. problem_setup
2. active_computation
...
"""


def categorize_sentences_autorater(
    problem_text: str, chunks: list[str], model: str = "openai/gpt-4.1-mini"
) -> list[str]:
    """Categorize sentences using an LLM autorater."""
    raise NotImplementedError("Implement autorater categorization")


example_problem = "What is the area of a circle with radius 5?"
autorater_categories = categorize_sentences_autorater(example_problem, example_sentences)

df = pd.DataFrame(
    {
        "Sentence": example_sentences,
        "Heuristic": categories,
        "Autorater": autorater_categories,
    }
)
display(df)

Click to see the expected output

Solution

DAG_SYSTEM_PROMPT = """
You are an expert in interpreting how language models solve math problems using multi-step reasoning. Your task is to analyze a Chain-of-Thought (CoT) reasoning trace, broken into discrete text chunks, and label each chunk with a tag that describes what this chunk is doing functionally in the reasoning process.
### Function Tags:
1. problem_setup: Parsing or rephrasing the problem
2. plan_generation: Stating or deciding on a plan of action
3. fact_retrieval: Recalling facts, formulas, problem details
4. active_computation: Performing algebra, calculations, manipulations
5. result_consolidation: Aggregating intermediate results, summarizing
6. uncertainty_management: Expressing confusion, re-evaluating, backtracking
7. final_answer_emission: Explicit statement of the final boxed answer
8. self_checking: Verifying previous steps
9. unknown: Use only if the chunk does not fit any of the above
### Output Format:
Return a numbered list with one function tag per chunk:
1. problem_setup
2. active_computation
...
"""
def categorize_sentences_autorater(
    problem_text: str, chunks: list[str], model: str = "openai/gpt-4.1-mini"
) -> list[str]:
    """Categorize sentences using an LLM autorater."""
    chunk_str = "\n".join(f"{i + 1}. {chunk}" for i, chunk in enumerate(chunks))
user_prompt = f"""
Here is the math problem:
[PROBLEM]
{problem_text}
Here is the Chain of Thought, broken into chunks:
[CHUNKS]
{chunk_str}
Now label each chunk with function tags."""
messages_list = [
        [
            {"role": "system", "content": DAG_SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ]
    ]
responses = generate_responses_parallel(
        messages_list=messages_list,
        model=model,
        max_tokens=128,
        temperature=0.0,
    )
raw_response = responses[0] if isinstance(responses[0], str) else ""
    response = re.split(r"\n\d{1,2}\.", "\n" + raw_response.strip())
    response = [r.strip() for r in response if r.strip()]
if len(response) != len(chunks):
        print(f"Warning: length mismatch {len(response)} != {len(chunks)}")
        response = response[: len(chunks)] + ["unknown"] * (len(chunks) - len(response))
return [CATEGORIES.get(r, "Unknown") for r in response]
example_problem = "What is the area of a circle with radius 5?"
autorater_categories = categorize_sentences_autorater(example_problem, example_sentences)
df = pd.DataFrame(
    {
        "Sentence": example_sentences,
        "Heuristic": categories,
        "Autorater": autorater_categories,
    }
)
display(df)