Interactive Course · Based on the 2022 Paper

What If You Just Asked an AI to Show Its Work?

How a simple tweak to the way we talk to AI models unlocked an ability no one expected: genuine multi-step reasoning.

Based on "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou (Google Brain, 2022)

Scroll to begin the course ↓
01

The Surprising Discovery

How showing an AI a few worked examples changed everything about what large language models can do

A Teacher's Oldest Trick

Think about how a math tutor helps a struggling student. They don't just show the final answer -- they write out every step on the whiteboard. "First we group the terms, then we divide both sides, then we simplify..."

In 2022, researchers at Google Brain tried the same approach with AI. Instead of just asking a large language model for answers, they showed it a few examples where the reasoning was written out step by step. The result was dramatic.

What They Found

💡

The Technique

Include a few examples of step-by-step reasoning in the prompt -- that's it. No retraining, no new architecture, no code changes.

📈

The Result

On a grade-school math benchmark, accuracy nearly doubled -- from 17.9% to 58.1% -- just by changing how the question was phrased.

🔍

The Surprise

This only works on sufficiently large models. Small models actually get worse with chain-of-thought. There seems to be a threshold of scale where reasoning "unlocks."

From the Paper: The Core Idea

FROM THE PAPER
// Standard prompting: Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many does he have now? A: The answer is 11. // Chain-of-thought prompting: Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6. 5 + 6 = 11. The answer is 11.
PLAIN ENGLISH

Standard prompting just gives the model a question and a final answer. The model sees the "what" but not the "how."

Chain-of-thought prompting adds intermediate reasoning steps before the answer. It's like showing your work on a math test.

The key insight: the model doesn't just mimic the format. It actually generates novel reasoning steps for new questions it hasn't seen before.

The only difference between these two approaches is what's in the examples. The model, the data, and the code are all identical.

💡
Transferable insight: The way you frame a question fundamentally changes the quality of the answer you get -- from AI and from people. The next time you ask someone to solve a problem, try asking them to "walk me through your reasoning" instead of just "give me the answer." The principle is identical.
STANDARD PROMPTING

Just Ask for Answers

  • Model sees: question and final answer
  • Works well for simple lookups
  • Fails on multi-step reasoning
  • GSM8K accuracy: ~17.9%
vs
CHAIN-OF-THOUGHT

Ask for Step-by-Step Reasoning

  • Model sees: question, reasoning steps, then answer
  • Unlocks multi-step problem solving
  • Works across math, logic, and common sense
  • GSM8K accuracy: ~58.1%

Check Your Understanding

A friend tells you: "The Google team built a brand new AI model that can reason." Based on what you just learned, what's actually true?

You're building a chatbot for customer service. Would chain-of-thought prompting help your bot handle a complex refund policy with multiple conditions?

02

Meet the Key Concepts

The building blocks you need to understand this paper -- and to talk about it intelligently

Few-Shot Prompting: Teaching by Example

Imagine you arrive in a foreign country and don't speak the language. Someone holds up a red apple and says "manzana." Then a red car and says "rojo." Then a blue hat and says "azul." From just these few examples, you can start guessing: the next red thing might also be "rojo."

That's few-shot prompting. You put a handful of input-output examples into the prompt (the text you feed to the model) and the model picks up the pattern. No retraining, no new data -- just examples tucked into the question itself.

📚
Field context: Few-shot prompting was popularized by GPT-3 in 2020. It was revolutionary because previously, getting a model to do a new task required collecting thousands of labeled examples and retraining it. Few-shot prompting said: "just show it a few examples at inference time."

Chain-of-Thought: Showing Your Work

A chain of thought is simply a series of intermediate reasoning steps. Instead of the model going from question straight to answer (a black box), it writes out each step of its thinking.

FROM THE PAPER
// The paper's definition: "We explore the ability of language models to generate a chain of thought -- a coherent series of intermediate natural language reasoning steps that lead to the final answer for a problem."
PLAIN ENGLISH

"chain of thought" -- a step-by-step walkthrough of reasoning, written in plain language (not code or math notation).

"coherent series" -- the steps logically follow from each other. Step 2 builds on step 1, step 3 builds on step 2, and so on.

"intermediate" -- these are the steps between reading the question and stating the answer. They're the thinking process made visible.

"natural language" -- the reasoning is in words, not equations or code. This matters because it means any human can read and verify the steps.

Emergent Abilities: Surprises at Scale

Here's the twist that makes this paper genuinely fascinating. Chain-of-thought prompting is an emergent ability -- it doesn't work at all on small models, and then it suddenly starts working once the model is large enough.

🐜

Small Models (~1B parameters)

Chain-of-thought prompting actually hurts performance. The model generates incoherent reasoning that leads to wrong answers.

🐕

Medium Models (~10B parameters)

Slight improvement on some tasks, but inconsistent. The reasoning is often logically shaky.

🐘

Large Models (~100B+ parameters)

Chain-of-thought dramatically improves performance. The model generates coherent, multi-step reasoning that often reaches the right answer.

💡
Transferable insight: "More of the same" doesn't always mean "slightly better." Sometimes systems undergo phase transitions -- where a quantitative change (bigger model) produces a qualitative change (new ability). This pattern appears everywhere: cities gaining new industries at certain population thresholds, teams developing new dynamics at certain sizes, even chemical reactions requiring a minimum concentration. When something "suddenly works" after scaling up, you might be witnessing a phase transition.

What "Parameters" Means

TECHNICAL CONCEPT
Model sizes tested: 422M parameters (GPT-like, small) 8B parameters (medium) 62B parameters (large) 540B parameters (PaLM, largest) // The paper tested five model families: GPT-3, LaMDA, PaLM, UL2, Codex
PLAIN ENGLISH

Parameters are the adjustable knobs inside a model. Think of them as the number of synapses in a brain -- more parameters means more capacity to learn patterns. "540B" means 540 billion of these knobs.

Why test different sizes? To prove chain-of-thought isn't just working because a model is generally better. The technique specifically helps large models in a way it doesn't help small ones.

Five model families means the effect isn't specific to one company's AI. It works across different architectures built by different teams, suggesting something fundamental is going on.

Check Your Understanding

Your startup is considering using a small, cheap language model (1 billion parameters) with chain-of-thought prompting to handle customer math questions. Based on what you've learned, what would you predict?

Why did the researchers test across five different model families instead of just one?

03

The Problem Before This Paper

Why large language models were impressively fluent but frustratingly bad at reasoning

Eloquent but Not Logical

By 2022, large language models could write poetry, summarize legal documents, and carry on conversations. But ask one a word problem a 10-year-old could solve, and it would confidently produce the wrong answer.

Imagine a brilliant orator who can give a compelling speech on quantum physics but can't calculate a restaurant tip. That was the state of AI in early 2022.

Where Models Fell Apart

🔢

Arithmetic Reasoning

"If a train travels 60 mph for 2.5 hours, how far does it go?" Models would jump to an answer without computing 60 x 2.5.

⚖️

Commonsense Reasoning

"Would a golf ball fit through a garden hose?" Requires spatial reasoning that models couldn't chain together.

🧩

Symbolic Reasoning

"If A is taller than B, and B is taller than C, who is shortest?" Simple logic, but models needed to chain multiple facts together.

Why Wasn't Scaling Enough?

The dominant approach was: "Make the model bigger, train it on more data, and eventually it will figure out reasoning." But scaling alone hit a wall. GSM8K -- a test of grade-school math -- stumped models that cost millions of dollars to train.

THE PROBLEM
Question: A cafeteria had 23 apples. If they used 20 for lunch and bought 6 more, how many apples do they have? // What the model does internally: Input [hidden computation] Output: "27" // No visible reasoning steps // Sometimes right, often wrong // No way to diagnose errors
PLAIN ENGLISH

The problem requires two steps: subtract 20 from 23 (getting 3), then add 6 (getting 9). A child writes these steps; a model tries to skip straight to the answer.

"Hidden computation" -- standard prompting gives the model no space to think out loud. It's like asking someone to do mental math but covering their scratch paper.

Why this matters: when the model is wrong, we can't tell WHERE it went wrong. Did it misread the problem? Did it add instead of subtract? The black box gives us no clues.

What People Had Already Tried

1

Fine-tuning on reasoning data

Collecting thousands of step-by-step solutions and retraining the model on them. This works but is expensive, time-consuming, and must be redone for every new task.

2

Training a separate verifier

Building a second model to check the first model's work. Doubles the cost and complexity.

3

Specialized architectures

Designing new model structures specifically for reasoning. Requires massive engineering effort and limits generality.

💡
Transferable insight: Sometimes the most powerful solutions aren't about building new tools -- they're about using existing tools differently. Before assuming you need a complex, expensive solution, ask: "What if the capability is already there and I just need to ask for it the right way?"

Check Your Understanding

A colleague says: "Language models can't reason because they weren't trained on enough reasoning examples." Based on what you've learned, what's the more nuanced take?

04

How They Did It

The research methodology -- deceptively simple in design, rigorous in execution

Beautifully Simple Design

Think about how a scientist tests whether a new ingredient makes bread rise higher. They bake two loaves -- one with the ingredient, one without -- keeping everything else identical. That's exactly what this paper does, but with AI prompts.

The researchers kept the same model, the same questions, and the same number of examples. The ONLY thing that changed was whether the examples included intermediate reasoning steps. This makes the results incredibly clean: any difference in performance must be caused by the chain-of-thought format.

The Research Pipeline

📝
Write Examples
Manually write 3-8 chain-of-thought exemplars per task
🤖
Prompt Model
Feed examples + new question to the model, let it generate
📊
Collect Outputs
Record both the reasoning chain and the final answer
⚖️
Compare
Measure accuracy vs. standard prompting baseline
🔎
Analyze Errors
Read wrong chains to understand WHERE reasoning fails

Crafting the Examples

The researchers wrote the chain-of-thought examples by hand. This is a critical detail: they didn't train the model to produce step-by-step reasoning. They just showed it what step-by-step reasoning looks like, and the model generalized from there.

ACTUAL EXEMPLAR FROM THE PAPER
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6 trees planted. The answer is 6.
WHY THIS WORKS

Restating the facts -- "There are 15 trees originally" -- forces the model to acknowledge what it knows before jumping to computation.

Identifying what changed -- "Then there were 21" -- establishes the before and after states.

Stating the operation -- "21 - 15 = 6" -- makes the arithmetic explicit. Without this step, the model might guess rather than compute.

"The answer is 6" -- a consistent format for extracting the final answer, separate from the reasoning.

What They Tested On

📐

Arithmetic (5 benchmarks)

GSM8K, SVAMP, ASDiv, AQuA, MAWPS -- grade-school to competition-level math word problems requiring 2-8 reasoning steps.

🌍

Commonsense (3 benchmarks)

CSQA, StrategyQA, Date Understanding -- questions requiring real-world knowledge and multi-hop reasoning about everyday situations.

🔮

Symbolic (2 benchmarks)

Last Letter Concatenation, Coin Flip -- abstract logical tasks that test pure rule-following ability.

⚠️
Important caveat: The researchers used a relatively small number of hand-written exemplars (typically 8 per task). They acknowledged that the specific exemplars matter -- different exemplars can produce different results. This is a limitation: the technique's performance depends partly on the quality of the human-written examples.

The Debate: Is This Really Reasoning?

The Concepts Discuss: What's Happening Inside?

STD
Standard Prompting

I give the model a question, it gives me an answer. Simple. Fast. What more do you need?

Chain-of-Thought

But you're forcing the model to solve everything in a single leap! Imagine trying to multiply 347 times 29 in your head without writing anything down.

CoT
SKP
Skeptic

Hold on. Just because the model writes out steps doesn't mean it's "reasoning." Maybe it's just mimicking the format of reasoning without understanding anything.

Chain-of-Thought

Fair point. But the model generates correct novel reasoning chains for problems it's never seen. If it were just copying a pattern, it would get the format right but the math wrong.

CoT
SKP
Skeptic

I'll concede it gets the RIGHT answers more often. But whether it truly "understands" is a different question -- and one this paper wisely doesn't try to answer.

Chain-of-Thought

Exactly. This paper is about results, not philosophy. And the results are clear: step-by-step prompting unlocks abilities that were invisible before.

CoT

Check Your Understanding

Why is it significant that the researchers wrote the chain-of-thought examples by hand rather than generating them automatically?

The researchers kept everything identical between standard and chain-of-thought prompting except the reasoning steps in the examples. Why is this controlled comparison important?

05

The Evidence

What the numbers actually show -- and how to read them critically

The Headline Numbers

Think of these results like a before-and-after photo for a medical treatment. The "patient" is the same model -- only the "treatment" (prompting style) changed.

🏆

GSM8K: 17.9% → 58.1%

On grade-school math, PaLM 540B more than tripled its accuracy with chain-of-thought. This was the paper's most dramatic result.

💪

Beat fine-tuned models

On several benchmarks, chain-of-thought prompting (with zero task-specific training) outperformed models specifically trained for those tasks.

🌐

Works across task types

Improvements appeared on arithmetic, commonsense, and symbolic reasoning -- three very different kinds of thinking.

The Scale Effect in Numbers

FROM THE PAPER (GSM8K RESULTS)
// PaLM model family on GSM8K: PaLM 8B + standard: 2.4% PaLM 8B + CoT: 2.3% (no improvement) PaLM 62B + standard: 12.0% PaLM 62B + CoT: 15.6% (slight gain) PaLM 540B + standard: 17.9% PaLM 540B + CoT: 58.1% (+40.2 points!)
WHAT THIS MEANS

8B model -- chain-of-thought actually hurts slightly. The model isn't large enough to generate coherent reasoning, so the extra steps introduce noise.

62B model -- a modest improvement. The model can sometimes produce useful reasoning, but it's inconsistent.

540B model -- the jump is enormous. The same technique that does nothing for small models transforms what the largest model can do. This is the "emergent ability" the paper is named for.

The pattern: It's not a gradual improvement across sizes. It's more like a light switch that flips on at a certain scale.

Beyond Just Math

The researchers tested commonsense reasoning tasks too. These require real-world knowledge, not just arithmetic.

STANDARD PROMPTING

PaLM 540B Results

  • StrategyQA: 73.9%
  • Date Understanding: 65.3%
  • Sports Understanding: 95.4%
vs
CHAIN-OF-THOUGHT

PaLM 540B Results

  • StrategyQA: 77.8% (+3.9)
  • Date Understanding: 77.4% (+12.1)
  • Sports Understanding: 97.3% (+1.9)

The gains on commonsense tasks are smaller than on math -- which makes sense. Commonsense questions often require fewer explicit steps, so the "show your work" benefit is less dramatic.

How Strong Is the Evidence?

Claim: "Chain-of-thought prompting significantly improves reasoning in large language models"
Weak Moderate Strong Definitive
Supporting: Tested across 5 model families, 10 benchmarks, and multiple model sizes. Consistent improvements on arithmetic and symbolic reasoning. Results are robust across different exemplar sets (though performance varies). Surpasses fine-tuned models on several benchmarks.
Caveats: All models were tested by Google researchers who designed the technique. The choice of exemplars matters and wasn't systematically optimized. Commonsense gains are more modest. The "why it works" mechanism isn't fully explained.

Check Your Understanding

A tech blog writes: "Chain-of-thought prompting improves AI performance by 40 percentage points on every task." Based on the actual evidence, what's wrong with this claim?

Why is it significant that chain-of-thought prompting outperformed some fine-tuned models?

06

What It Doesn't Prove

Every paper has boundaries -- understanding them makes you a more critical thinker

Honest Limitations

The best way to evaluate any research is to understand what it does NOT claim. Think of this paper as a powerful flashlight -- it illuminates a real phenomenon brilliantly, but there are corners of the room it doesn't reach.

🏳️

Only Works at Scale

You need a model with roughly 100B+ parameters. Most organizations can't run or afford models that large. This limits who can benefit.

No Guaranteed Correctness

The model can produce a perfectly logical-looking reasoning chain that reaches the wrong answer. Fluent reasoning is not the same as correct reasoning.

🔍

We Don't Know Why It Works

The paper shows THAT chain-of-thought works, not WHY. Is the model truly reasoning, or has it memorized reasoning patterns? This remains an open question.

The Correctness Problem

FROM THE PAPER
// Error analysis on GSM8K (PaLM 540B): Correct answer + correct chain: ~58% Wrong answer + flawed chain: ~42% // Among wrong answers: Calculator errors (8% of errors) -- correct logic, wrong arithmetic Missing step (24% of errors) -- skipped a necessary reasoning step Semantic misunderstanding (52% of errors) -- misinterpreted the problem
WHAT THIS TELLS US

58% accuracy is a huge improvement, but it still means the model is wrong nearly half the time. This is not reliable enough for high-stakes decisions.

Calculator errors are actually encouraging -- the model understood the problem but fumbled the arithmetic. This is fixable by plugging in a calculator tool.

Missing steps mean the model sometimes takes shortcuts, skipping reasoning that turns out to be crucial. Like a student who "sees" the answer but can't prove it.

Semantic misunderstanding is the biggest issue. The model misreads what the question is actually asking -- a fundamental comprehension failure that step-by-step reasoning can't fix.

Spot the Flaw

Spot the Flaw

LinkedIn post

"New paper proves AI can truly reason like humans! Chain-of-thought prompting gives language models real logical thinking abilities."

The paper's authors are careful never to claim the model "truly reasons." They show that prompting with step-by-step examples improves task performance, but they note that the model still makes errors, and the mechanism behind the improvement isn't fully understood. There's a big difference between "performs better on reasoning tasks" and "reasons like a human." The post conflates performance with understanding.
💡
Transferable insight: When evaluating any claim about AI (or any technology), distinguish between three levels: (1) "It produces better outputs" -- a performance claim that can be measured, (2) "It uses mechanism X internally" -- a mechanistic claim that needs deeper evidence, (3) "It truly understands" -- a philosophical claim that requires a definition of "understanding." Most AI hype conflates level 1 with level 3.
⚠️
Important caveat: The paper's experiments all use English-language benchmarks. Whether chain-of-thought prompting works as well in other languages, especially those with very different grammatical structures, was not tested.

Check Your Understanding

Your manager wants to use chain-of-thought prompting to automatically approve loan applications because "the AI can reason now." What's the most important concern you should raise?

07

The Big Picture

Why this paper matters beyond the specific results -- and where the field went next

Why This Paper Changed the Field

This paper wasn't just about a clever trick. It changed the way people think about what language models can and can't do. Imagine discovering that your car has a turbo mode you never knew about -- you'd rethink what you use the car for.

🚀

Prompt Engineering Was Born

This paper helped launch an entire discipline: figuring out how to talk to AI models to get the best results. "Prompt engineering" became a real job title.

🌱

Spawned a Research Field

Dozens of follow-up papers explored variations: self-consistency, tree-of-thought, zero-shot CoT ("Let's think step by step"), and more.

🔄

Reframed the Scaling Debate

Before this paper, bigger models just meant "better at the same things." This paper showed bigger models could do qualitatively different things.

What Came After

1

Zero-Shot Chain-of-Thought (Kojima et al., 2022)

Discovered you don't even need examples. Just adding "Let's think step by step" to the prompt triggers chain-of-thought reasoning. Even simpler than Wei et al.'s approach.

2

Self-Consistency (Wang et al., 2022)

Generate multiple reasoning chains for the same question and take a majority vote on the final answer. This boosted GSM8K accuracy from 58.1% to 74.4%.

3

Tree of Thoughts (Yao et al., 2023)

Instead of a single chain, explore multiple reasoning paths and backtrack when one path fails. Like a chess player thinking several moves ahead.

4

Today's AI Assistants

ChatGPT, Claude, and other modern AI assistants use chain-of-thought internally. When you ask a complex question, they're using variations of this technique behind the scenes.

The Paper's Lasting Contribution

FROM THE PAPER'S CONCLUSION
"We have explored chain-of-thought prompting as a simple mechanism for eliciting multi-step reasoning behavior in large language models. We find that chain-of-thought prompting is an emergent ability of model scale that opens up exciting questions about the nature of few-shot learning in large language models."
PLAIN ENGLISH

"simple mechanism" -- the authors are emphasizing how straightforward the technique is. No complex engineering, just better examples.

"eliciting" is a key word choice. It means "drawing out something that was already there." The reasoning ability wasn't created -- it was unlocked.

"emergent ability of model scale" -- this is the paper's most provocative claim. Something qualitatively new happens at large scale. It's not just "more of the same, but better."

"exciting questions" -- the authors are saying: we showed this works, but we don't fully understand why. This is an invitation for the rest of the field to investigate.

What This Means For You

💡
Practical takeaway: The next time you use ChatGPT or any AI assistant for a complex task, try explicitly asking it to "think step by step" or "show your reasoning." You're directly applying this paper's key finding. The quality of the answer often depends not on the AI's intelligence, but on how you frame the question.
💬

For Conversations About AI

You now understand the difference between "AI can reason" (overstatement) and "certain prompting techniques elicit better reasoning performance from large models" (accurate).

🎯

For Evaluating AI Products

When a company claims their AI "reasons," ask: How large is the model? What prompting techniques are they using? What's the error rate? These are now questions you can ask intelligently.

🧠

For Thinking About Thinking

The paper raises deep questions: if writing out reasoning steps helps AI, does it help us too? (Research says yes -- "thinking aloud" improves human problem-solving.) The mechanism may be universal.

Final Check

A journalist asks you to summarize this paper in one sentence. Which summary is most accurate?

Knowing what you now know, which of these future developments would most surprise you?

You've Completed the Course

You now understand the core idea, methodology, evidence, limitations, and impact of one of the most influential AI papers of the past decade. The next time someone mentions "chain-of-thought" or "prompt engineering," you can speak with genuine understanding -- not just buzzword fluency.

📚
Read the original: Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2301.02111. Originally published at NeurIPS 2022.