1️⃣ CoT Infrastructure & Sentence Taxonomy
Learning Objectives
- Load and explore the GSM8K reasoning dataset structure (prompts, CoT traces, answers)
- Implement sentence segmentation for chain-of-thought traces using regex patterns
- Build both rule-based and LLM-based classifiers to categorize reasoning steps (calculation, lookup, restatement, etc.)
- Visualize how different sentence types are distributed across reasoning traces
In this section, we'll build the infrastructure to analyze chain-of-thought reasoning by breaking it into meaningful units. When models generate long reasoning traces, we need to decide: what are the appropriate "units of reasoning" we should analyze?
Why not use tokens? Tokens are too granular - a single reasoning step like "Let's calculate the area: πr²" spans multiple tokens, and breaking it apart loses semantic meaning.
Why sentences? Sentences are a natural unit because: 1. They correspond to complete thoughts or reasoning steps 2. They're granular enough to identify specific important steps 3. They're coarse enough to be semantically meaningful 4. They align with how humans naturally chunk reasoning
By splitting CoT traces into sentences and categorizing them (calculations, lookups, restatements, etc.), we can start to ask questions like: "Which types of sentences are most critical for reaching the correct answer?" This sentence-level analysis will set us up for the black-box and white-box methods in later sections, where we'll measure which sentences actually shape the model's reasoning trajectory.
Model Setup & Dataset Inspection
Let's start by setting a few constants (the purpose of all of these will become clear as we go through the exercises):
# Configuration
MODEL_NAME_1B = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
MODEL_NAME_8B = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
MODEL_NAME_14B = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
DATASET_NAME = "uzaymacar/math-rollouts"
BLACKMAIL_DATASET_NAME = "uzaymacar/blackmail-rollouts"
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
SIMILARITY_THRESHOLD = 0.8
Next, we'll load in our embedding model. Embedding models are unsupervised models designed to take in text and output a vector - we'll be using them to classify the similarity of different sentences, so we can find motifs in our reasoning traces.
embedding_model = SentenceTransformer(EMBEDDING_MODEL)
print(embedding_model)
To understand how embedding models work, let's look at cosine similarities between a few example sentences:
prompts = [
"Wait, I think I made an error in my reasoning and need to backtrack",
"Hold on, I believe I made a mistake in my logic and should reconsider",
"After careful analysis, I've determined the correct answer is 42",
"Time is an illusion. Lunchtime doubly so.",
]
labels = [x[:35] + "..." for x in prompts]
embedding = embedding_model.encode(prompts)
cosine_sims = embedding @ embedding.T
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(cosine_sims, cmap="RdBu", vmin=-1, vmax=1)
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels, rotation=45, ha="right", fontsize=9)
ax.set_yticklabels(labels, fontsize=9)
plt.colorbar(im, label="Cosine Similarity")
plt.title("Sentence Embedding Similarity")
plt.tight_layout()
plt.show()
Click to see the expected output
We can also load in the dataset that the paper's authors have helpfully open-sourced. The dataset is very large, but the authors provide the structure in the corresponding HuggingFace page, so we can use the huggingface_hub package to load in just the data we want. We'll inspect this dataset more shortly.
PROBLEM_ID = 4682
path = f"deepseek-r1-distill-llama-8b/temperature_0.6_top_p_0.95/correct_base_solution/problem_{PROBLEM_ID}/base_solution.json"
local_path = hf_hub_download(repo_id=DATASET_NAME, filename=path, repo_type="dataset")
with open(local_path, "r") as f:
problem_data = json.load(f)
print("Keys in problem data:", list(problem_data.keys()))
print(f"\nProblem prompt:\n{problem_data['prompt']}")
Click to see the expected output
Keys in problem data: ['prompt', 'solution', 'full_cot', 'answer', 'is_correct']Problem prompt: Solve this math problem step by step. You MUST put your final answer in \boxed{}. Problem: When the base-16 number $66666_{16}$ is written in base 2, how many base-2 digits (bits) does it have? Solution:
Sentence Splitting
Exercise - implement sentence splitting
First, we'll need to split our CoT traces into sentences based on punctuation and paragraph breaks. We'll also need to handle special tokens like <think>.
You should fill in the split_solution_into_chunks function below. We've given you a few basic tests to pass; when your solution passes all of them you can be confident you've dealt with enough edge cases. Here is the full set of rules as defined by the edge cases:
- The
<think> ... </think>tags should be removed - You should split on sentences (i.e. ending in any of
.,!,?, or newlines), and characters like: - If the period
.is in a decimal or numbered list (e.g.34.5or\n1.) then you shouldn't split on it - You should split on periods
.unless they are decimal numbers e.g.x.yor a numbered list e.g.\n1. - No chunk should have length less than 10: if so, then merge it with the next chunk
- Each chunk should be stripped of whitespace
We recommend single-shotting this exercise with LLMs, and checking the output for bugs rather than writing it from scratch yourself.
def split_solution_into_chunks(text: str) -> list[str]:
"""Split solution into sentence-level chunks."""
raise NotImplementedError("Implement sentence splitting")
test_cases = [
(
"<think>First, I understand the problem. Next, I'll solve for x. Finally, I verify!</think>",
["First, I understand the problem.", "Next, I'll solve for x.", "Finally, I verify!"],
),
(
"<think>Let me break this down: 1. Convert to decimal. 2. Calculate log. 3. Apply formula.</think>",
["Let me break this down:", "1. Convert to decimal.", "2. Calculate log.", "3. Apply formula."],
),
(
"<think>The answer is 42. Done.</think>",
["The answer is 42.", "Done."],
),
]
for input_text, expected_chunks in test_cases:
chunks = split_solution_into_chunks(input_text)
assert chunks == expected_chunks, f"Expected {expected_chunks}, got {chunks}"
print("All tests passed!")
Solution
def split_solution_into_chunks(text: str) -> list[str]:
"""Split solution into sentence-level chunks."""
# Remove thinking tags
if "<think>" in text:
text = text.split("<think>")[1]
if "</think>" in text:
text = text.split("</think>")[0]
text = text.strip()
# Replace "." characters which we don't want to split on
text = re.sub(r"(\d)\.(\d)", r"\1<DECIMAL>\2", text) # e.g. "4.5" -> "4<DECIMAL>5"
text = re.sub(r"\n(\d)\.(\s)", r"\n\1<DECIMAL>\2", text) # e.g. "\n1. " -> "\n1<DECIMAL> "
# Split on sentence endings, combining endings with previous chunk
sentences = re.split(r"([!?:\n]|(?<!\n\d)\.)", text)
chunks = []
for i in range(0, len(sentences) - 1, 2):
chunks.append((sentences[i] + sentences[i + 1]).replace("\n", " "))
# Replace <DECIMAL> back with "."
chunks = [re.sub(r"<DECIMAL>", ".", c) for c in chunks]
# Merge chunks that are too short
if not chunks:
return []
merged = [chunks[0]]
for c in chunks[1:]:
if len(merged[-1]) < 10:
merged[-1] += c
else:
merged.append(c)
return [c.strip() for c in merged if c.strip()]
test_cases = [
(
"<think>First, I understand the problem. Next, I'll solve for x. Finally, I verify!</think>",
["First, I understand the problem.", "Next, I'll solve for x.", "Finally, I verify!"],
),
(
"<think>Let me break this down: 1. Convert to decimal. 2. Calculate log. 3. Apply formula.</think>",
["Let me break this down:", "1. Convert to decimal.", "2. Calculate log.", "3. Apply formula."],
),
(
"<think>The answer is 42. Done.</think>",
["The answer is 42.", "Done."],
),
]
for input_text, expected_chunks in test_cases:
chunks = split_solution_into_chunks(input_text)
assert chunks == expected_chunks, f"Expected {expected_chunks}, got {chunks}"
print("All tests passed!")
Sentence Categorization
The paper uses a taxonomy of 8 categories:
- Problem Setup: Parsing or rephrasing the problem
- Plan Generation: Stating or deciding on a plan of action
- Fact Retrieval: Recalling facts, formulas, problem details
- Active Computation: Algebra, calculations, manipulations
- Uncertainty Management: Expressing confusion, re-evaluating, backtracking
- Result Consolidation: Aggregating intermediate results, summarizing
- Self Checking: Verifying previous steps
- Final Answer Emission: Explicitly stating the final answer
Let's define these below:
CATEGORIES = {
"problem_setup": "Problem Setup",
"plan_generation": "Plan Generation",
"fact_retrieval": "Fact Retrieval",
"active_computation": "Active Computation",
"uncertainty_management": "Uncertainty Management",
"result_consolidation": "Result Consolidation",
"self_checking": "Self Checking",
"final_answer_emission": "Final Answer Emission",
"unknown": "Unknown",
}
Exercise - heuristic-based categorization
First, we'll implement a heuristic-based approach. You should do the following:
- Fill out the
CATEGORY_WORDSdictionary below, which maps each category to a list of words associated with that category. To get you started, we've filled out the first three categories. - Fill out the
categorize_sentences_heuristicfunction below, which uses this dictionary to categorize sentences. We've given you a few example sentences to test your function - at minimum make sure your function works for these.
Once you've passed the test sentences below, you should try taking rollouts from your model above (or examples from the dataset) and see how your function performs on them. Some questions you might ask yourself:
- Do you think this taxonomy is reasonable?
- Are there any sentences that are misclassified, or belong to more than one category?
- How many words do you need to add before your classification works decently?
Note that no heuristic-based classification will be perfect. The point of this exercise is to get you thinking about the different categories, and what the strengths / limitations of this kind of method are. In research, you should generally try not to reach for a tool more complicated than what you need!
CATEGORY_WORDS = {
# Fill in keywords for each category
"final_answer_emission": ["\\boxed", "final answer"],
"problem_setup": [], # Add keywords
"fact_retrieval": [], # Add keywords
# ... add more categories
}
def categorize_sentences_heuristic(chunks: list[str]) -> list[str]:
"""Categorize sentences using keyword matching."""
raise NotImplementedError("Implement heuristic categorization")
example_sentences = [
"I need to find the area of a circle with radius 5.",
"The formula for circle area is A = πr².",
"Substituting r = 5: A = π × 5² = 25π.",
"Wait, let me look again at that calculation.",
"So the area is 25π square units.",
"Therefore, the answer is \\boxed{25π}.",
]
expected_categories = [
"Problem Setup",
"Fact Retrieval",
"Active Computation",
"Uncertainty Management",
"Result Consolidation",
"Final Answer Emission",
]
categories = categorize_sentences_heuristic(example_sentences)
for sentence, cat, expected in zip(example_sentences, categories, expected_categories):
assert cat == expected, f"Expected {expected!r}, got {cat!r} for: {sentence!r}"
print("All tests for `categorize_sentences_heuristic` passed!")
Solution
CATEGORY_WORDS = {
"final_answer_emission": ["\\boxed", "final answer"],
"problem_setup": ["need to", "problem is", "given"],
"fact_retrieval": ["remember", "formula", "know that", "recall"],
"active_computation": ["calculate", "compute", "solve", "=", "equals", "result", "giving"],
"uncertainty_management": ["wait", "let me", "double check", "hmm", "actually", "reconsider"],
"result_consolidation": ["summarize", "so", "therefore", "in summary"],
"self_checking": ["verify", "check", "confirm", "correct"],
"plan_generation": ["plan", "approach", "strategy", "will", "i'll", "try"],
}
def categorize_sentences_heuristic(chunks: list[str]) -> list[str]:
"""Categorize sentences using keyword matching."""
categories = []
for idx, chunk in enumerate(chunks):
chunk_lower = chunk.lower()
if idx == 0:
tag = "problem_setup"
else:
for category, words in CATEGORY_WORDS.items():
if any(word in chunk_lower for word in words):
tag = category
break
else:
tag = "unknown"
categories.append(CATEGORIES.get(tag))
return categories
example_sentences = [
"I need to find the area of a circle with radius 5.",
"The formula for circle area is A = πr².",
"Substituting r = 5: A = π × 5² = 25π.",
"Wait, let me look again at that calculation.",
"So the area is 25π square units.",
"Therefore, the answer is \\boxed{25π}.",
]
expected_categories = [
"Problem Setup",
"Fact Retrieval",
"Active Computation",
"Uncertainty Management",
"Result Consolidation",
"Final Answer Emission",
]
categories = categorize_sentences_heuristic(example_sentences)
for sentence, cat, expected in zip(example_sentences, categories, expected_categories):
assert cat == expected, f"Expected {expected!r}, got {cat!r} for: {sentence!r}"
print("All tests for categorize_sentences_heuristic passed!")
Exercise - implement an autorater
We'll now progress to a slightly more advanced approach for classification, using an autorater. This essentially means we're querying an LLM to do the categorization for us, rather than relying on hardcoded rules.
We'll start by setting up helper functions to call an API via OpenRouter (which supports many different model providers). We'll define generate_response for single calls and generate_responses_parallel for batch processing.
You'll need to create an .env file in the current working directory and set the OPENROUTER_API_KEY environment variable, then you can run the cell below to see how the helper functions work.
def generate_response(
model: str,
messages: list[dict[str, str]],
max_tokens: int = 128,
stop_sequences: list[str] | None = None,
temperature: float = 1.0,
max_retries: int = 10,
) -> str:
"""Single API call with retry logic for rate limits."""
assert openrouter_client, "OpenRouter API key not set (see earlier instructions)."
stop_sequences = stop_sequences or []
for attempt in range(max_retries):
try:
resp = openrouter_client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
stop=stop_sequences if stop_sequences else None,
)
return resp.choices[0].message.content or ""
except Exception as e:
print(str(e))
if "rate_limit" in str(e).lower() or "429" in str(e):
if attempt < max_retries - 1:
wait_time = 2**attempt
print(f"Rate limit hit, waiting {wait_time}s...")
time.sleep(wait_time)
continue
raise e
return ""
def generate_responses_parallel(
messages_list: list[list[dict[str, str]]],
model: str,
max_tokens: int = 128,
stop_sequences: list[str] | None = None,
temperature: float = 1.0,
max_workers: int = 10,
) -> list[str | Exception]:
"""
Run multiple API calls in parallel using ThreadPoolExecutor.
Args:
messages_list: List of message lists, each to be sent as a separate API call
model: Model identifier for OpenRouter
max_tokens: Max tokens per response
stop_sequences: Stop sequences for generation
temperature: Sampling temperature
max_workers: Maximum number of parallel workers
Returns:
List of responses (strings) or Exceptions for failed calls, in same order as input
"""
results: dict[int, str | Exception] = {}
pbar = tqdm(total=len(messages_list), desc="API calls")
def call_single(idx: int, messages: list[dict[str, str]]) -> tuple[int, str | Exception]:
"""Helper function to make a single API call."""
try:
time.sleep(0.1) # Rate limiting
result = generate_response(model, messages, max_tokens, stop_sequences, temperature)
return idx, result
except Exception as e:
return idx, e
# Execute tasks in parallel
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# Submit all tasks with their indices
futures = [executor.submit(call_single, i, m) for i, m in enumerate(messages_list)]
# Process completed tasks
for future in as_completed(futures):
idx, result = future.result()
results[idx] = result
pbar.update(1)
pbar.close()
# Return results in original order
return [results[i] for i in range(len(messages_list))]
# Demo of how this function works:
sys_prompt = {"role": "system", "content": "Reply in rhyming couplets."}
test_messages = [
[sys_prompt, {"role": "user", "content": "What is 2+2?"}],
[sys_prompt, {"role": "user", "content": "What is the capital of France?"}],
]
responses = generate_responses_parallel(test_messages, model="openai/gpt-4.1-mini", max_tokens=40)
for i, response in enumerate(responses):
print(f"Response {i + 1}:\n{response}\n")
Click to see the expected output
Response 1: Two plus two, a sum so true, Equals four, that's clear to you!Response 2: The capital of France, I gladly recite, Is Paris, a city of charm and light.
DAG_SYSTEM_PROMPT = """
You are an expert in interpreting how language models solve math problems using multi-step reasoning. Your task is to analyze a Chain-of-Thought (CoT) reasoning trace, broken into discrete text chunks, and label each chunk with a tag that describes what this chunk is *doing* functionally in the reasoning process.
### Function Tags:
1. `problem_setup`: Parsing or rephrasing the problem
2. `plan_generation`: Stating or deciding on a plan of action
3. `fact_retrieval`: Recalling facts, formulas, problem details
4. `active_computation`: Performing algebra, calculations, manipulations
5. `result_consolidation`: Aggregating intermediate results, summarizing
6. `uncertainty_management`: Expressing confusion, re-evaluating, backtracking
7. `final_answer_emission`: Explicit statement of the final boxed answer
8. `self_checking`: Verifying previous steps
9. `unknown`: Use only if the chunk does not fit any of the above
### Output Format:
Return a numbered list with one function tag per chunk:
1. problem_setup
2. active_computation
...
"""
def categorize_sentences_autorater(
problem_text: str, chunks: list[str], model: str = "openai/gpt-4.1-mini"
) -> list[str]:
"""Categorize sentences using an LLM autorater."""
raise NotImplementedError("Implement autorater categorization")
example_problem = "What is the area of a circle with radius 5?"
autorater_categories = categorize_sentences_autorater(example_problem, example_sentences)
df = pd.DataFrame(
{
"Sentence": example_sentences,
"Heuristic": categories,
"Autorater": autorater_categories,
}
)
display(df)
Click to see the expected output
Solution
DAG_SYSTEM_PROMPT = """
You are an expert in interpreting how language models solve math problems using multi-step reasoning. Your task is to analyze a Chain-of-Thought (CoT) reasoning trace, broken into discrete text chunks, and label each chunk with a tag that describes what this chunk is doing functionally in the reasoning process.
### Function Tags:
1. problem_setup: Parsing or rephrasing the problem
2. plan_generation: Stating or deciding on a plan of action
3. fact_retrieval: Recalling facts, formulas, problem details
4. active_computation: Performing algebra, calculations, manipulations
5. result_consolidation: Aggregating intermediate results, summarizing
6. uncertainty_management: Expressing confusion, re-evaluating, backtracking
7. final_answer_emission: Explicit statement of the final boxed answer
8. self_checking: Verifying previous steps
9. unknown: Use only if the chunk does not fit any of the above
### Output Format:
Return a numbered list with one function tag per chunk:
1. problem_setup
2. active_computation
...
"""
def categorize_sentences_autorater(
problem_text: str, chunks: list[str], model: str = "openai/gpt-4.1-mini"
) -> list[str]:
"""Categorize sentences using an LLM autorater."""
chunk_str = "\n".join(f"{i + 1}. {chunk}" for i, chunk in enumerate(chunks))
user_prompt = f"""
Here is the math problem:
[PROBLEM]
{problem_text}
Here is the Chain of Thought, broken into chunks:
[CHUNKS]
{chunk_str}
Now label each chunk with function tags."""
messages_list = [
[
{"role": "system", "content": DAG_SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
]
]
responses = generate_responses_parallel(
messages_list=messages_list,
model=model,
max_tokens=128,
temperature=0.0,
)
raw_response = responses[0] if isinstance(responses[0], str) else ""
response = re.split(r"\n\d{1,2}\.", "\n" + raw_response.strip())
response = [r.strip() for r in response if r.strip()]
if len(response) != len(chunks):
print(f"Warning: length mismatch {len(response)} != {len(chunks)}")
response = response[: len(chunks)] + ["unknown"] * (len(chunks) - len(response))
return [CATEGORIES.get(r, "Unknown") for r in response]
example_problem = "What is the area of a circle with radius 5?"
autorater_categories = categorize_sentences_autorater(example_problem, example_sentences)
df = pd.DataFrame(
{
"Sentence": example_sentences,
"Heuristic": categories,
"Autorater": autorater_categories,
}
)
display(df)