PagedAttention Memory Management

PagedAttention applies OS virtual memory paging to the KV cache. Traditional systems allocate one large contiguous chunk per request, wasting 60-80% of memory to fragmentation. PagedAttention breaks the KV cache into fixed-size blocks (16 tokens each), maps them via a block table, and achieves under 4% waste.

1

Block Division

KV cache divided into fixed-size blocks, each storing KV data for 16 tokens. At ~800 bytes per token for a 13B model, each block is about 12.8KB.

2

Block Table Mapping

A block table maps each request's logical token positions to physical block locations -- like a page table mapping virtual to physical addresses.

3

Dynamic Allocation

New blocks allocated from a free list on demand. Blocks don't need to be contiguous. Only the last block can have internal fragmentation (at most 15 wasted tokens).

4

Immediate Reclamation

When a request finishes, its blocks return to the free list instantly. New requests can use them on the very next scheduling step.

Speculative Decoding Pipeline

Speculative decoding breaks the autoregressive bottleneck by allowing multiple tokens to be verified in parallel.

1

Draft Proposal

A lightweight draft model (EAGLE, Medusa, or n-gram predictor) generates k candidate tokens quickly from the target model's hidden states.

2

Batch Verification

The full target model runs a single forward pass over all k candidates simultaneously -- efficient because it processes tokens in parallel, not sequentially.

3

Accept/Reject

Each draft token compared against the target model's output. Correct guesses accepted, first mismatch uses the target model's token, subsequent drafts discarded.

4

Repeat

Process repeats from the last accepted token. With high acceptance rates (>70%), this can generate 2-3x faster than standard autoregressive decoding.

⚠️
Trade-off: Speculative decoding reduces latency but consumes GPU compute for the draft model. If the acceptance rate is low (<50%), the overhead can actually slow things down. Best for latency-sensitive workloads where the draft model matches the target model's distribution well.

Performance Characteristics

How vLLM compares to baselines on key performance dimensions. These are representative ranges -- actual numbers depend on model, hardware, and workload.

Throughput Gain
2-4x
Memory Efficiency
96%+
V1 Engine Improvement
+24%
Spec Decode Speedup
2-3x
CUDA Graph Gain
10-30%
💡
Key insight: The 2-4x throughput gain comes almost entirely from PagedAttention's memory efficiency. By eliminating fragmentation, vLLM can fit more concurrent requests in the same GPU memory, and continuous batching ensures those requests keep the GPU fully utilized.