PagedAttention Memory Management
PagedAttention applies OS virtual memory paging to the KV cache. Traditional systems allocate one large contiguous chunk per request, wasting 60-80% of memory to fragmentation. PagedAttention breaks the KV cache into fixed-size blocks (16 tokens each), maps them via a block table, and achieves under 4% waste.
Block Division
KV cache divided into fixed-size blocks, each storing KV data for 16 tokens. At ~800 bytes per token for a 13B model, each block is about 12.8KB.
Block Table Mapping
A block table maps each request's logical token positions to physical block locations -- like a page table mapping virtual to physical addresses.
Dynamic Allocation
New blocks allocated from a free list on demand. Blocks don't need to be contiguous. Only the last block can have internal fragmentation (at most 15 wasted tokens).
Immediate Reclamation
When a request finishes, its blocks return to the free list instantly. New requests can use them on the very next scheduling step.
Speculative Decoding Pipeline
Speculative decoding breaks the autoregressive bottleneck by allowing multiple tokens to be verified in parallel.
Draft Proposal
A lightweight draft model (EAGLE, Medusa, or n-gram predictor) generates k candidate tokens quickly from the target model's hidden states.
Batch Verification
The full target model runs a single forward pass over all k candidates simultaneously -- efficient because it processes tokens in parallel, not sequentially.
Accept/Reject
Each draft token compared against the target model's output. Correct guesses accepted, first mismatch uses the target model's token, subsequent drafts discarded.
Repeat
Process repeats from the last accepted token. With high acceptance rates (>70%), this can generate 2-3x faster than standard autoregressive decoding.
Performance Characteristics
How vLLM compares to baselines on key performance dimensions. These are representative ranges -- actual numbers depend on model, hardware, and workload.