Chain of Thought Prompting

Chain of Thought (CoT) prompting is a technique that improves LLM reasoning by encouraging the model to generate intermediate reasoning steps before arriving at the final answer.

The Problem with Direct Prompting

Prompt: "If John has 3 apples and buys 2 more, then gives half 
         to Mary, how many apples does John have?"

Direct answer: "3"  ← Often wrong

LLMs struggle with multi-step reasoning when asked to output only the final answer.

Chain of Thought Solution

Prompt: "If John has 3 apples and buys 2 more, then gives half 
         to Mary, how many apples does John have? 
         Let's think step by step."

CoT Response:
"Let me work through this step by step:
1. John starts with 3 apples
2. He buys 2 more: 3 + 2 = 5 apples
3. He gives half to Mary: 5 / 2 = 2.5 apples
4. John has 2.5 apples (or 2 if we round down)

Answer: 2.5 apples"

Types of Chain of Thought

1. Zero-Shot CoT

Just add "Let's think step by step":

prompt = f"""
{question}

Let's think step by step.
"""

2. Few-Shot CoT

Provide examples with reasoning:

prompt = f"""
Q: Roger has 5 tennis balls. He buys 2 more cans of 3. 
   How many tennis balls does he have?
A: Roger started with 5 balls. He bought 2 cans of 3 balls 
   each, so 2 × 3 = 6 balls. Total: 5 + 6 = 11 balls.
   The answer is 11.

Q: The cafeteria had 23 apples. They used 20 for lunch and 
   bought 6 more. How many apples do they have?
A: Started with 23 apples. Used 20: 23 - 20 = 3 remaining. 
   Bought 6 more: 3 + 6 = 9 apples.
   The answer is 9.

Q: {new_question}
A:
"""

3. Self-Consistency

Sample multiple reasoning paths, take majority vote:

import collections

def self_consistency(prompt, model, num_samples=5):
    answers = []
    for _ in range(num_samples):
        response = model.generate(prompt, temperature=0.7)
        answer = extract_final_answer(response)
        answers.append(answer)
    
    # Return most common answer
    return collections.Counter(answers).most_common(1)[0][0]

Why CoT Works

1. Decomposition

Complex problem → Series of simple steps
"Calculate compound interest" → 
  1. Find simple interest
  2. Add to principal
  3. Repeat for each period

2. Working Memory

Intermediate results stored in generated text:
"5 + 3 = 8, now 8 × 2 = 16, finally 16 - 4 = 12"

3. Pattern Matching

Similar reasoning patterns in training data
get activated and followed

Effective CoT Strategies

Template Structures

# Problem decomposition
prompt = f"""
To solve this problem, I need to:
1. Identify the key information
2. Determine the operations needed
3. Execute each step
4. Verify the answer

Problem: {problem}
"""

# Explicit reasoning
prompt = f"""
Question: {question}

Let me reason through this:
- First, I observe that...
- This means...
- Therefore...
- My final answer is...
"""

Domain-Specific Prompts

Math:

"Show your work. Write out each calculation."

Logic:

"Consider each premise. What can we deduce?"

Code:

"Trace through the code step by step with example inputs."

Tree of Thoughts

Extension of CoT that explores multiple reasoning branches:

                    Problem
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
       Path A       Path B       Path C
          │            │            │
     ┌────┴────┐      Dead       ┌──┴──┐
     ▼         ▼      End        ▼     ▼
   A.1       A.2              C.1    C.2
   (best)                            Dead

def tree_of_thoughts(problem, model, breadth=3, depth=3):
    def explore(state, depth_remaining):
        if depth_remaining == 0:
            return evaluate(state)
        
        # Generate multiple next steps
        candidates = model.generate_thoughts(state, n=breadth)
        
        # Evaluate and prune
        scored = [(c, model.evaluate_thought(c)) for c in candidates]
        best = sorted(scored, key=lambda x: x[1], reverse=True)[:breadth//2]
        
        # Recurse on best candidates
        return max(explore(c, depth_remaining-1) for c, _ in best)
    
    return explore(problem, depth)

Benchmarks and Results

Task	Direct	Zero-Shot CoT	Few-Shot CoT
GSM8K (math)	17.1%	40.7%	58.1%
SVAMP (math)	63.4%	68.9%	79.0%
AQuA (algebra)	26.4%	39.4%	45.3%

Results for PaLM 540B model

When to Use CoT

Good For

Math word problems
Multi-step reasoning
Logical deduction
Code understanding
Complex decision making

Not Needed For

Simple factual questions
Single-step tasks
Classification with clear criteria
Tasks where reasoning doesn't help

Common Pitfalls

1. Reasoning Errors Compound

Step 1: 5 + 3 = 7  ← Error here
Step 2: 7 × 2 = 14  ← Carries forward

Solution: Use self-consistency with multiple samples

2. Verbose but Wrong

Long explanation that sounds confident but arrives at wrong answer

Solution: Verify final answer, add sanity checks

3. Overthinking Simple Problems

Question: "What is 2 + 2?"
CoT: "Let me break this down into components..." (overkill)

Solution: Match complexity to problem

Implementation Example

from openai import OpenAI

client = OpenAI()

def solve_with_cot(problem, use_few_shot=True):
    few_shot_examples = """
Q: A restaurant has 20 tables. Each table has 4 chairs. 
   If 12 tables are occupied with 3 people each, how many 
   empty chairs are there?
A: Let's solve this step by step:
   - Total chairs: 20 tables × 4 chairs = 80 chairs
   - Occupied chairs: 12 tables × 3 people = 36 chairs
   - Empty chairs: 80 - 36 = 44 chairs
   The answer is 44.

""" if use_few_shot else ""
    
    prompt = f"{few_shot_examples}Q: {problem}\nA: Let's solve this step by step:"
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    
    return response.choices[0].message.content

Key Takeaways

CoT improves reasoning by generating intermediate steps
"Let's think step by step" enables zero-shot CoT
Few-shot examples improve performance further
Self-consistency (multiple samples + voting) adds robustness
Tree of Thoughts explores multiple reasoning paths
Best for multi-step reasoning tasks, not simple questions