📂 Source Files Referenced
▼- vllm/v1/core/block_pool.pyBlockPool
- vllm/v1/core/kv_cache_manager.pyKVCacheManager
- vllm/v1/core/sched/scheduler.pyScheduler
- vllm/v1/executor/multiproc_executor.pyMultiprocExecutor
- vllm/v1/spec_decode/eagle.pyEAGLE Proposer
- vllm/v1/engine/core.pyEngineCore
Getting Started
bash
# Install vLLM (requires Python 3.9+ and CUDA 12.1+)
pip install vllm
# Start OpenAI-compatible API server
vllm serve meta-llama/Llama-3.1-8B-Instruct
# With tensor parallelism and prefix caching
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--enable-prefix-caching
Source Code Walkthrough
The following annotated source excerpts show how vLLM's core concepts are implemented. Each block maps to a concept from the Core Concepts page.
PagedAttention -- BlockPool
The BlockPool class manages fixed-size KV cache blocks using a free list and hash table -- the foundation of PagedAttention.
vllm/v1/core/block_pool.pyGitHub
class BlockPool:
"""Pool of KV cache blocks with prefix caching support."""
def __init__(self, num_gpu_blocks, enable_caching,
hash_block_size, ...):
self.num_gpu_blocks = num_gpu_blocks
self.enable_caching = enable_caching
# All KV cache blocks -- one per physical GPU block
self.blocks = [
KVCacheBlock(block_id=i)
for i in range(num_gpu_blocks)
]
# Free block queue: doubly-linked list for O(1) alloc
self.free_block_queue = FreeKVCacheBlockQueue(
self.blocks
)
# Hash table for prefix cache lookups
self.cached_block_hash_to_block = BlockHashToBlockMap()
# Null block: reserved placeholder (never evicted)
self.null_block = self.blocks[0]
self.null_block.ref_cnt = 1
Continuous Batching -- Scheduler
The Scheduler maintains separate queues for waiting and running requests, promoting at every step to keep the GPU fully utilized.
vllm/v1/core/sched/scheduler.pyGitHub
class Scheduler:
def __init__(self, scheduler_config, ...):
# Request queues: waiting and running
self.waiting = create_request_queue(self.policy)
self.running: list[Request] = []
# Resource constraints
self.max_num_running_reqs = (
scheduler_config.max_num_seqs
)
self.max_num_scheduled_tokens = (
scheduler_config.max_num_batched_tokens
)
# KV cache manager for block allocation
self.kv_cache_manager = KVCacheManager(
kv_cache_config=kv_cache_config,
enable_caching=cache_config.enable_prefix_caching,
)
Tensor Parallelism -- MultiprocExecutor
Spawns one worker process per GPU rank and uses shared-memory message queues for low-latency RPC dispatch.
vllm/v1/executor/multiproc_executor.pyGitHub
class MultiprocExecutor(Executor):
supports_pp: bool = True
def __init__(self, vllm_config, ...):
tp_size, pp_size, pcp_size = (
self._get_parallel_sizes()
)
assert self.world_size == (
tp_size * pp_size * pcp_size
)
# Shared-memory broadcast queue
self.rpc_broadcast_mq = MessageQueue(
tp_size * pp_size * pcp_size
)
# Spawn one worker per GPU rank
for rank in range(self.local_world_size):
worker = WorkerProc.make_worker_process(
vllm_config=vllm_config,
local_rank=rank,
rank=rank,
)
self.workers.append(worker)
Speculative Decoding -- EAGLE Proposer
The EAGLE proposer uses target model hidden states to predict draft tokens via a lightweight head.
vllm/v1/spec_decode/eagle.pyGitHub
class SpecDecodeBaseProposer:
def _greedy_sample(self, hidden_states):
"""Greedy-sample draft tokens from hidden states."""
if self.use_local_argmax_reduction:
return self.model.get_top_tokens(
hidden_states
)
return self.model.compute_logits(
hidden_states
).argmax(dim=-1)
def propose(self, target_token_ids,
target_hidden_states, ...):
# First pass: run draft model
self.set_inputs_first_pass(...)
hidden = self.model.forward(
self.draft_token_ids, ...
)
draft_tokens = self._greedy_sample(hidden)
# Iteratively generate remaining tokens
for i in range(1, self.num_speculative_tokens):
self.set_inputs_subsequent_pass(
draft_tokens, i
)
hidden = self.model.forward(...)
draft_tokens = self._greedy_sample(hidden)
Engine Core -- Main Loop
The step() method orchestrates one iteration: schedule, execute, sample, update.
vllm/v1/engine/core.pyGitHub
def step(self) -> EngineCoreOutputs:
"""Execute one iteration of the engine loop."""
if not self.scheduler.has_requests():
return EngineCoreOutputs.empty()
scheduler_output = self.scheduler.schedule()
executor_future = self.model_executor.execute_model(
scheduler_output, non_block=True
)
grammar_bitmask = (
self.scheduler.get_grammar_bitmask()
)
model_output = executor_future.result()
self._process_aborts_queue()
outputs = self.scheduler.update_from_output(
model_output
)
return outputs
Production tip: Monitor
gpu_cache_usage_perc at the /metrics endpoint. When it consistently hits 100%, reduce --max-model-len or add GPU capacity. The vLLM Production Stack provides Helm charts with built-in Prometheus integration.