Architecture — vLLM Course

High-Level Design

vLLM's V1 architecture is a layered system where a top-level API server delegates to an EngineCore that orchestrates scheduling, KV cache management, and GPU execution. The design follows a producer-consumer pattern with shared-memory message queues for minimal overhead.

vLLM V1 Architecture

🌐

API Server

OpenAI-compatible

📦

LLMEngine

Request lifecycle

⚙️

EngineCore

Scheduling loop

📋

Scheduler

Batch decisions

🧩

KVCacheManager

Block allocation

📡

Model Executor

Worker dispatch

💻

GPU Worker

Forward pass

🎯

Model Runner

GPU execution

Click any component to expand details

Request Data Flow

A complete request lifecycle flows through six stages. Each stage hands off to the next with minimal overhead, using shared-memory communication between processes.

1

Request Arrival

Client sends POST to /v1/chat/completions. API server validates, tokenizes, creates an EngineCoreRequest with a unique ID.

2

Engine Dispatch

Request sent to EngineCore via ZMQ socket. Scheduler's add_request() places it in the waiting queue.

3

Scheduling

On each step(), scheduler processes running requests, then promotes waiting ones. Checks prefix cache hits, KV blocks, and token budget.

4

Execution

Scheduler output broadcast to GPU workers. Each worker builds input tensors and runs the forward pass. Multi-GPU uses NCCL all-reduce for tensor parallelism.

5

Sampling

Model runner extracts logits for the last token of each sequence. Samples next token per request's SamplingParams (temperature, top_p, top_k).

6

Output

Tokens returned to EngineCore. Finished requests freed (KV blocks reclaimed). Unfinished requests continue. Outputs streamed back to client.

Design Decisions

💡

Separate engine process: The EngineCore runs in its own process to avoid GIL contention with the async API server. Scheduling and KV cache management are CPU-intensive operations that would block the event loop in an async-only design.

🔧

Shared-memory over gRPC: Communication between processes uses ZMQ sockets and shared memory rather than gRPC or REST. This reduces serialization overhead, which matters when the engine loop runs at 100+ iterations per second.

💡

Flattened sequences: All sequences in a batch are concatenated into one "super sequence" with position indices ensuring each sequence only attends to its own tokens. This eliminates wasted computation on padding tokens.

🔧

Block table indirection: Rather than allocating contiguous memory per sequence, vLLM uses a block table (analogous to an OS page table) to map logical token positions to physical block locations. This single decision enables PagedAttention's near-zero fragmentation.

High-Level Design

vLLM V1 Architecture

Request Data Flow

Request Arrival

Engine Dispatch

Scheduling

Execution

Sampling

Output

Design Decisions

📚 References