High-Level Design
vLLM's V1 architecture is a layered system where a top-level API server delegates to an EngineCore that orchestrates scheduling, KV cache management, and GPU execution. The design follows a producer-consumer pattern with shared-memory message queues for minimal overhead.
vLLM V1 Architecture
Click any component to expand details
Request Data Flow
A complete request lifecycle flows through six stages. Each stage hands off to the next with minimal overhead, using shared-memory communication between processes.
Request Arrival
Client sends POST to /v1/chat/completions. API server validates, tokenizes, creates an EngineCoreRequest with a unique ID.
Engine Dispatch
Request sent to EngineCore via ZMQ socket. Scheduler's add_request() places it in the waiting queue.
Scheduling
On each step(), scheduler processes running requests, then promotes waiting ones. Checks prefix cache hits, KV blocks, and token budget.
Execution
Scheduler output broadcast to GPU workers. Each worker builds input tensors and runs the forward pass. Multi-GPU uses NCCL all-reduce for tensor parallelism.
Sampling
Model runner extracts logits for the last token of each sequence. Samples next token per request's SamplingParams (temperature, top_p, top_k).
Output
Tokens returned to EngineCore. Finished requests freed (KV blocks reclaimed). Unfinished requests continue. Outputs streamed back to client.