How a simple tweak to the way we talk to AI models unlocked an ability no one expected: genuine multi-step reasoning.
How showing an AI a few worked examples changed everything about what large language models can do
Think about how a math tutor helps a struggling student. They don't just show the final answer -- they write out every step on the whiteboard. "First we group the terms, then we divide both sides, then we simplify..."
In 2022, researchers at Google Brain tried the same approach with AI. Instead of just asking a large language model for answers, they showed it a few examples where the reasoning was written out step by step. The result was dramatic.
Include a few examples of step-by-step reasoning in the prompt -- that's it. No retraining, no new architecture, no code changes.
On a grade-school math benchmark, accuracy nearly doubled -- from 17.9% to 58.1% -- just by changing how the question was phrased.
This only works on sufficiently large models. Small models actually get worse with chain-of-thought. There seems to be a threshold of scale where reasoning "unlocks."
Standard prompting just gives the model a question and a final answer. The model sees the "what" but not the "how."
Chain-of-thought prompting adds intermediate reasoning steps before the answer. It's like showing your work on a math test.
The key insight: the model doesn't just mimic the format. It actually generates novel reasoning steps for new questions it hasn't seen before.
The only difference between these two approaches is what's in the examples. The model, the data, and the code are all identical.
The building blocks you need to understand this paper -- and to talk about it intelligently
Imagine you arrive in a foreign country and don't speak the language. Someone holds up a red apple and says "manzana." Then a red car and says "rojo." Then a blue hat and says "azul." From just these few examples, you can start guessing: the next red thing might also be "rojo."
That's few-shot prompting. You put a handful of input-output examples into the prompt (the text you feed to the model) and the model picks up the pattern. No retraining, no new data -- just examples tucked into the question itself.
A chain of thought is simply a series of intermediate reasoning steps. Instead of the model going from question straight to answer (a black box), it writes out each step of its thinking.
"chain of thought" -- a step-by-step walkthrough of reasoning, written in plain language (not code or math notation).
"coherent series" -- the steps logically follow from each other. Step 2 builds on step 1, step 3 builds on step 2, and so on.
"intermediate" -- these are the steps between reading the question and stating the answer. They're the thinking process made visible.
"natural language" -- the reasoning is in words, not equations or code. This matters because it means any human can read and verify the steps.
Here's the twist that makes this paper genuinely fascinating. Chain-of-thought prompting is an emergent ability -- it doesn't work at all on small models, and then it suddenly starts working once the model is large enough.
Chain-of-thought prompting actually hurts performance. The model generates incoherent reasoning that leads to wrong answers.
Slight improvement on some tasks, but inconsistent. The reasoning is often logically shaky.
Chain-of-thought dramatically improves performance. The model generates coherent, multi-step reasoning that often reaches the right answer.
Parameters are the adjustable knobs inside a model. Think of them as the number of synapses in a brain -- more parameters means more capacity to learn patterns. "540B" means 540 billion of these knobs.
Why test different sizes? To prove chain-of-thought isn't just working because a model is generally better. The technique specifically helps large models in a way it doesn't help small ones.
Five model families means the effect isn't specific to one company's AI. It works across different architectures built by different teams, suggesting something fundamental is going on.
Why large language models were impressively fluent but frustratingly bad at reasoning
By 2022, large language models could write poetry, summarize legal documents, and carry on conversations. But ask one a word problem a 10-year-old could solve, and it would confidently produce the wrong answer.
Imagine a brilliant orator who can give a compelling speech on quantum physics but can't calculate a restaurant tip. That was the state of AI in early 2022.
"If a train travels 60 mph for 2.5 hours, how far does it go?" Models would jump to an answer without computing 60 x 2.5.
"Would a golf ball fit through a garden hose?" Requires spatial reasoning that models couldn't chain together.
"If A is taller than B, and B is taller than C, who is shortest?" Simple logic, but models needed to chain multiple facts together.
The dominant approach was: "Make the model bigger, train it on more data, and eventually it will figure out reasoning." But scaling alone hit a wall. GSM8K -- a test of grade-school math -- stumped models that cost millions of dollars to train.
The problem requires two steps: subtract 20 from 23 (getting 3), then add 6 (getting 9). A child writes these steps; a model tries to skip straight to the answer.
"Hidden computation" -- standard prompting gives the model no space to think out loud. It's like asking someone to do mental math but covering their scratch paper.
Why this matters: when the model is wrong, we can't tell WHERE it went wrong. Did it misread the problem? Did it add instead of subtract? The black box gives us no clues.
Collecting thousands of step-by-step solutions and retraining the model on them. This works but is expensive, time-consuming, and must be redone for every new task.
Building a second model to check the first model's work. Doubles the cost and complexity.
Designing new model structures specifically for reasoning. Requires massive engineering effort and limits generality.
The research methodology -- deceptively simple in design, rigorous in execution
Think about how a scientist tests whether a new ingredient makes bread rise higher. They bake two loaves -- one with the ingredient, one without -- keeping everything else identical. That's exactly what this paper does, but with AI prompts.
The researchers kept the same model, the same questions, and the same number of examples. The ONLY thing that changed was whether the examples included intermediate reasoning steps. This makes the results incredibly clean: any difference in performance must be caused by the chain-of-thought format.
The researchers wrote the chain-of-thought examples by hand. This is a critical detail: they didn't train the model to produce step-by-step reasoning. They just showed it what step-by-step reasoning looks like, and the model generalized from there.
Restating the facts -- "There are 15 trees originally" -- forces the model to acknowledge what it knows before jumping to computation.
Identifying what changed -- "Then there were 21" -- establishes the before and after states.
Stating the operation -- "21 - 15 = 6" -- makes the arithmetic explicit. Without this step, the model might guess rather than compute.
"The answer is 6" -- a consistent format for extracting the final answer, separate from the reasoning.
GSM8K, SVAMP, ASDiv, AQuA, MAWPS -- grade-school to competition-level math word problems requiring 2-8 reasoning steps.
CSQA, StrategyQA, Date Understanding -- questions requiring real-world knowledge and multi-hop reasoning about everyday situations.
Last Letter Concatenation, Coin Flip -- abstract logical tasks that test pure rule-following ability.
What the numbers actually show -- and how to read them critically
Think of these results like a before-and-after photo for a medical treatment. The "patient" is the same model -- only the "treatment" (prompting style) changed.
On grade-school math, PaLM 540B more than tripled its accuracy with chain-of-thought. This was the paper's most dramatic result.
On several benchmarks, chain-of-thought prompting (with zero task-specific training) outperformed models specifically trained for those tasks.
Improvements appeared on arithmetic, commonsense, and symbolic reasoning -- three very different kinds of thinking.
8B model -- chain-of-thought actually hurts slightly. The model isn't large enough to generate coherent reasoning, so the extra steps introduce noise.
62B model -- a modest improvement. The model can sometimes produce useful reasoning, but it's inconsistent.
540B model -- the jump is enormous. The same technique that does nothing for small models transforms what the largest model can do. This is the "emergent ability" the paper is named for.
The pattern: It's not a gradual improvement across sizes. It's more like a light switch that flips on at a certain scale.
The researchers tested commonsense reasoning tasks too. These require real-world knowledge, not just arithmetic.
The gains on commonsense tasks are smaller than on math -- which makes sense. Commonsense questions often require fewer explicit steps, so the "show your work" benefit is less dramatic.
Every paper has boundaries -- understanding them makes you a more critical thinker
The best way to evaluate any research is to understand what it does NOT claim. Think of this paper as a powerful flashlight -- it illuminates a real phenomenon brilliantly, but there are corners of the room it doesn't reach.
You need a model with roughly 100B+ parameters. Most organizations can't run or afford models that large. This limits who can benefit.
The model can produce a perfectly logical-looking reasoning chain that reaches the wrong answer. Fluent reasoning is not the same as correct reasoning.
The paper shows THAT chain-of-thought works, not WHY. Is the model truly reasoning, or has it memorized reasoning patterns? This remains an open question.
58% accuracy is a huge improvement, but it still means the model is wrong nearly half the time. This is not reliable enough for high-stakes decisions.
Calculator errors are actually encouraging -- the model understood the problem but fumbled the arithmetic. This is fixable by plugging in a calculator tool.
Missing steps mean the model sometimes takes shortcuts, skipping reasoning that turns out to be crucial. Like a student who "sees" the answer but can't prove it.
Semantic misunderstanding is the biggest issue. The model misreads what the question is actually asking -- a fundamental comprehension failure that step-by-step reasoning can't fix.
"New paper proves AI can truly reason like humans! Chain-of-thought prompting gives language models real logical thinking abilities."
Why this paper matters beyond the specific results -- and where the field went next
This paper wasn't just about a clever trick. It changed the way people think about what language models can and can't do. Imagine discovering that your car has a turbo mode you never knew about -- you'd rethink what you use the car for.
This paper helped launch an entire discipline: figuring out how to talk to AI models to get the best results. "Prompt engineering" became a real job title.
Dozens of follow-up papers explored variations: self-consistency, tree-of-thought, zero-shot CoT ("Let's think step by step"), and more.
Before this paper, bigger models just meant "better at the same things." This paper showed bigger models could do qualitatively different things.
Discovered you don't even need examples. Just adding "Let's think step by step" to the prompt triggers chain-of-thought reasoning. Even simpler than Wei et al.'s approach.
Generate multiple reasoning chains for the same question and take a majority vote on the final answer. This boosted GSM8K accuracy from 58.1% to 74.4%.
Instead of a single chain, explore multiple reasoning paths and backtrack when one path fails. Like a chess player thinking several moves ahead.
ChatGPT, Claude, and other modern AI assistants use chain-of-thought internally. When you ask a complex question, they're using variations of this technique behind the scenes.
"simple mechanism" -- the authors are emphasizing how straightforward the technique is. No complex engineering, just better examples.
"eliciting" is a key word choice. It means "drawing out something that was already there." The reasoning ability wasn't created -- it was unlocked.
"emergent ability of model scale" -- this is the paper's most provocative claim. Something qualitatively new happens at large scale. It's not just "more of the same, but better."
"exciting questions" -- the authors are saying: we showed this works, but we don't fully understand why. This is an invitation for the rest of the field to investigate.
You now understand the difference between "AI can reason" (overstatement) and "certain prompting techniques elicit better reasoning performance from large models" (accurate).
When a company claims their AI "reasons," ask: How large is the model? What prompting techniques are they using? What's the error rate? These are now questions you can ask intelligently.
The paper raises deep questions: if writing out reasoning steps helps AI, does it help us too? (Research says yes -- "thinking aloud" improves human problem-solving.) The mechanism may be universal.
You now understand the core idea, methodology, evidence, limitations, and impact of one of the most influential AI papers of the past decade. The next time someone mentions "chain-of-thought" or "prompt engineering," you can speak with genuine understanding -- not just buzzword fluency.