High-Level Design

vLLM's V1 architecture is a layered system where a top-level API server delegates to an EngineCore that orchestrates scheduling, KV cache management, and GPU execution. The design follows a producer-consumer pattern with shared-memory message queues for minimal overhead.

vLLM V1 Architecture

🌐
API Server
OpenAI-compatible
📦
LLMEngine
Request lifecycle
⚙️
EngineCore
Scheduling loop
📋
Scheduler
Batch decisions
🧩
KVCacheManager
Block allocation
📡
Model Executor
Worker dispatch
💻
GPU Worker
Forward pass
🎯
Model Runner
GPU execution

Click any component to expand details

Request Data Flow

A complete request lifecycle flows through six stages. Each stage hands off to the next with minimal overhead, using shared-memory communication between processes.

1

Request Arrival

Client sends POST to /v1/chat/completions. API server validates, tokenizes, creates an EngineCoreRequest with a unique ID.

2

Engine Dispatch

Request sent to EngineCore via ZMQ socket. Scheduler's add_request() places it in the waiting queue.

3

Scheduling

On each step(), scheduler processes running requests, then promotes waiting ones. Checks prefix cache hits, KV blocks, and token budget.

4

Execution

Scheduler output broadcast to GPU workers. Each worker builds input tensors and runs the forward pass. Multi-GPU uses NCCL all-reduce for tensor parallelism.

5

Sampling

Model runner extracts logits for the last token of each sequence. Samples next token per request's SamplingParams (temperature, top_p, top_k).

6

Output

Tokens returned to EngineCore. Finished requests freed (KV blocks reclaimed). Unfinished requests continue. Outputs streamed back to client.

Design Decisions

💡
Separate engine process: The EngineCore runs in its own process to avoid GIL contention with the async API server. Scheduling and KV cache management are CPU-intensive operations that would block the event loop in an async-only design.
🔧
Shared-memory over gRPC: Communication between processes uses ZMQ sockets and shared memory rather than gRPC or REST. This reduces serialization overhead, which matters when the engine loop runs at 100+ iterations per second.
💡
Flattened sequences: All sequences in a batch are concatenated into one "super sequence" with position indices ensuring each sequence only attends to its own tokens. This eliminates wasted computation on padding tokens.
🔧
Block table indirection: Rather than allocating contiguous memory per sequence, vLLM uses a block table (analogous to an OS page table) to map logical token positions to physical block locations. This single decision enables PagedAttention's near-zero fragmentation.