Key Concepts
Click any concept card to reveal its real-world analogy. These eight ideas form the vocabulary of vLLM -- understanding them is the foundation for everything else in this course.
PagedAttention
An attention algorithm that manages KV cache using non-contiguous memory blocks, inspired by OS virtual memory paging. Reduces memory waste from 60-80% to under 4%.
Like a library where book chapters can be stored on any available shelf segment -- a catalog (block table) tracks which shelf holds which chapter. No shelf space is wasted because a book doesn't need one contiguous spot.
KV Cache
Stores the key and value vectors from attention computation. Every token generated requires attending to all previous tokens -- the KV cache holds those vectors so they aren't recomputed.
Imagine keeping running notes of everything you've already written in a long essay, so you don't have to re-read the entire essay every time you write the next sentence.
Continuous Batching
Dynamically adds and removes requests from the running batch at every generation step, rather than waiting for an entire batch to finish. Delivers 3-10x higher throughput.
A restaurant that seats new diners the moment a seat opens, rather than waiting until all diners at a table finish eating. The kitchen never stops cooking, and no seat sits idle.
Tensor Parallelism
Splits a model's weight matrices across multiple GPUs so each GPU holds a shard of every layer. They collectively compute each forward pass with fast NVLink communication.
A team of chefs each responsible for one ingredient of every dish. Each works simultaneously, they combine results at a handoff point, and together produce the dish faster than any single chef.
Pipeline Parallelism
Splits model layers across GPUs (or nodes) so each holds a contiguous subset. Activations flow from GPU to GPU in sequence, enabling multi-node deployments.
An assembly line in a factory -- each station handles one phase. While station 3 works on item A, station 2 starts on item B, and station 1 on item C.
Speculative Decoding
Uses a small draft model to guess multiple tokens ahead, then verifies them in a single pass of the larger model. Correct guesses are accepted, wrong ones discarded.
A junior associate drafts a legal brief and a senior partner reviews it in one pass. If the draft is mostly right, the partner saves time. Wrong parts get corrected, not rewritten from scratch.
Prefix Caching
Reuses the KV cache of shared prompt prefixes across requests. Uses hash-based lookups to detect shared prefixes and avoid redundant computation.
If a hundred students all start their essays with the same introduction, the teacher only reads that introduction once and reuses their notes for every student's essay.
CUDA Graphs
Records a sequence of GPU operations once and replays them without CPU-side launch overhead. Reduces decode latency by 10-30% for small batch steps.
Instead of a conductor calling out each instruction at every performance, the entire piece is recorded once and replayed -- every musician knows exactly what to do without waiting for each cue.
How They Fit Together
A request arrives at vLLM and enters the continuous batching scheduler. The scheduler checks if the prompt shares a prefix with cached entries (prefix caching). For the unique portion, it allocates KV cache blocks using PagedAttention's block table. The model runs across GPUs via tensor parallelism or pipeline parallelism, using CUDA graphs for efficient decode steps. If enabled, speculative decoding proposes multiple tokens per step. As requests complete, their KV blocks are reclaimed and new requests are admitted immediately.
Key insight: The genius of vLLM is that all these concepts work together in a tight loop. PagedAttention makes memory efficient, continuous batching keeps the GPU busy, prefix caching eliminates redundant work, and CUDA graphs minimize overhead. Each concept amplifies the others.