Chain of Thought Prompting
Chain of Thought (CoT) prompting is a technique that improves LLM reasoning by encouraging the model to generate intermediate reasoning steps before arriving at the final answer.
The Problem with Direct Prompting
Prompt: "If John has 3 apples and buys 2 more, then gives half
to Mary, how many apples does John have?"
Direct answer: "3" ← Often wrong
LLMs struggle with multi-step reasoning when asked to output only the final answer.
Chain of Thought Solution
Prompt: "If John has 3 apples and buys 2 more, then gives half
to Mary, how many apples does John have?
Let's think step by step."
CoT Response:
"Let me work through this step by step:
1. John starts with 3 apples
2. He buys 2 more: 3 + 2 = 5 apples
3. He gives half to Mary: 5 / 2 = 2.5 apples
4. John has 2.5 apples (or 2 if we round down)
Answer: 2.5 apples"
Types of Chain of Thought
1. Zero-Shot CoT
Just add "Let's think step by step":
prompt = f"""
{question}
Let's think step by step.
"""
2. Few-Shot CoT
Provide examples with reasoning:
prompt = f"""
Q: Roger has 5 tennis balls. He buys 2 more cans of 3.
How many tennis balls does he have?
A: Roger started with 5 balls. He bought 2 cans of 3 balls
each, so 2 × 3 = 6 balls. Total: 5 + 6 = 11 balls.
The answer is 11.
Q: The cafeteria had 23 apples. They used 20 for lunch and
bought 6 more. How many apples do they have?
A: Started with 23 apples. Used 20: 23 - 20 = 3 remaining.
Bought 6 more: 3 + 6 = 9 apples.
The answer is 9.
Q: {new_question}
A:
"""
3. Self-Consistency
Sample multiple reasoning paths, take majority vote:
import collections
def self_consistency(prompt, model, num_samples=5):
answers = []
for _ in range(num_samples):
response = model.generate(prompt, temperature=0.7)
answer = extract_final_answer(response)
answers.append(answer)
# Return most common answer
return collections.Counter(answers).most_common(1)[0][0]
Why CoT Works
1. Decomposition
Complex problem → Series of simple steps
"Calculate compound interest" →
1. Find simple interest
2. Add to principal
3. Repeat for each period
2. Working Memory
Intermediate results stored in generated text:
"5 + 3 = 8, now 8 × 2 = 16, finally 16 - 4 = 12"
3. Pattern Matching
Similar reasoning patterns in training data
get activated and followed
Effective CoT Strategies
Template Structures
# Problem decomposition
prompt = f"""
To solve this problem, I need to:
1. Identify the key information
2. Determine the operations needed
3. Execute each step
4. Verify the answer
Problem: {problem}
"""
# Explicit reasoning
prompt = f"""
Question: {question}
Let me reason through this:
- First, I observe that...
- This means...
- Therefore...
- My final answer is...
"""
Domain-Specific Prompts
Math:
"Show your work. Write out each calculation."
Logic:
"Consider each premise. What can we deduce?"
Code:
"Trace through the code step by step with example inputs."
Tree of Thoughts
Extension of CoT that explores multiple reasoning branches:
Problem
│
┌────────────┼────────────┐
▼ ▼ ▼
Path A Path B Path C
│ │ │
┌────┴────┐ Dead ┌──┴──┐
▼ ▼ End ▼ ▼
A.1 A.2 C.1 C.2
(best) Dead
def tree_of_thoughts(problem, model, breadth=3, depth=3):
def explore(state, depth_remaining):
if depth_remaining == 0:
return evaluate(state)
# Generate multiple next steps
candidates = model.generate_thoughts(state, n=breadth)
# Evaluate and prune
scored = [(c, model.evaluate_thought(c)) for c in candidates]
best = sorted(scored, key=lambda x: x[1], reverse=True)[:breadth//2]
# Recurse on best candidates
return max(explore(c, depth_remaining-1) for c, _ in best)
return explore(problem, depth)
Benchmarks and Results
| Task | Direct | Zero-Shot CoT | Few-Shot CoT |
|---|---|---|---|
| GSM8K (math) | 17.1% | 40.7% | 58.1% |
| SVAMP (math) | 63.4% | 68.9% | 79.0% |
| AQuA (algebra) | 26.4% | 39.4% | 45.3% |
Results for PaLM 540B model
When to Use CoT
Good For
- Math word problems
- Multi-step reasoning
- Logical deduction
- Code understanding
- Complex decision making
Not Needed For
- Simple factual questions
- Single-step tasks
- Classification with clear criteria
- Tasks where reasoning doesn't help
Common Pitfalls
1. Reasoning Errors Compound
Step 1: 5 + 3 = 7 ← Error here
Step 2: 7 × 2 = 14 ← Carries forward
Solution: Use self-consistency with multiple samples
2. Verbose but Wrong
Long explanation that sounds confident but arrives at wrong answer
Solution: Verify final answer, add sanity checks
3. Overthinking Simple Problems
Question: "What is 2 + 2?"
CoT: "Let me break this down into components..." (overkill)
Solution: Match complexity to problem
Implementation Example
from openai import OpenAI
client = OpenAI()
def solve_with_cot(problem, use_few_shot=True):
few_shot_examples = """
Q: A restaurant has 20 tables. Each table has 4 chairs.
If 12 tables are occupied with 3 people each, how many
empty chairs are there?
A: Let's solve this step by step:
- Total chairs: 20 tables × 4 chairs = 80 chairs
- Occupied chairs: 12 tables × 3 people = 36 chairs
- Empty chairs: 80 - 36 = 44 chairs
The answer is 44.
""" if use_few_shot else ""
prompt = f"{few_shot_examples}Q: {problem}\nA: Let's solve this step by step:"
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content
Key Takeaways
- CoT improves reasoning by generating intermediate steps
- "Let's think step by step" enables zero-shot CoT
- Few-shot examples improve performance further
- Self-consistency (multiple samples + voting) adds robustness
- Tree of Thoughts explores multiple reasoning paths
- Best for multi-step reasoning tasks, not simple questions