Getting Started

bash
# Install vLLM (requires Python 3.9+ and CUDA 12.1+)
pip install vllm

# Start OpenAI-compatible API server
vllm serve meta-llama/Llama-3.1-8B-Instruct

# With tensor parallelism and prefix caching
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --enable-prefix-caching

Source Code Walkthrough

The following annotated source excerpts show how vLLM's core concepts are implemented. Each block maps to a concept from the Core Concepts page.

PagedAttention -- BlockPool

The BlockPool class manages fixed-size KV cache blocks using a free list and hash table -- the foundation of PagedAttention.

vllm/v1/core/block_pool.pyGitHub
class BlockPool:
    """Pool of KV cache blocks with prefix caching support."""

    def __init__(self, num_gpu_blocks, enable_caching,
                 hash_block_size, ...):
        self.num_gpu_blocks = num_gpu_blocks
        self.enable_caching = enable_caching

        # All KV cache blocks -- one per physical GPU block
        self.blocks = [
            KVCacheBlock(block_id=i)
            for i in range(num_gpu_blocks)
        ]

        # Free block queue: doubly-linked list for O(1) alloc
        self.free_block_queue = FreeKVCacheBlockQueue(
            self.blocks
        )

        # Hash table for prefix cache lookups
        self.cached_block_hash_to_block = BlockHashToBlockMap()

        # Null block: reserved placeholder (never evicted)
        self.null_block = self.blocks[0]
        self.null_block.ref_cnt = 1

Continuous Batching -- Scheduler

The Scheduler maintains separate queues for waiting and running requests, promoting at every step to keep the GPU fully utilized.

vllm/v1/core/sched/scheduler.pyGitHub
class Scheduler:
    def __init__(self, scheduler_config, ...):
        # Request queues: waiting and running
        self.waiting = create_request_queue(self.policy)
        self.running: list[Request] = []

        # Resource constraints
        self.max_num_running_reqs = (
            scheduler_config.max_num_seqs
        )
        self.max_num_scheduled_tokens = (
            scheduler_config.max_num_batched_tokens
        )

        # KV cache manager for block allocation
        self.kv_cache_manager = KVCacheManager(
            kv_cache_config=kv_cache_config,
            enable_caching=cache_config.enable_prefix_caching,
        )

Tensor Parallelism -- MultiprocExecutor

Spawns one worker process per GPU rank and uses shared-memory message queues for low-latency RPC dispatch.

vllm/v1/executor/multiproc_executor.pyGitHub
class MultiprocExecutor(Executor):
    supports_pp: bool = True

    def __init__(self, vllm_config, ...):
        tp_size, pp_size, pcp_size = (
            self._get_parallel_sizes()
        )
        assert self.world_size == (
            tp_size * pp_size * pcp_size
        )

        # Shared-memory broadcast queue
        self.rpc_broadcast_mq = MessageQueue(
            tp_size * pp_size * pcp_size
        )

        # Spawn one worker per GPU rank
        for rank in range(self.local_world_size):
            worker = WorkerProc.make_worker_process(
                vllm_config=vllm_config,
                local_rank=rank,
                rank=rank,
            )
            self.workers.append(worker)

Speculative Decoding -- EAGLE Proposer

The EAGLE proposer uses target model hidden states to predict draft tokens via a lightweight head.

vllm/v1/spec_decode/eagle.pyGitHub
class SpecDecodeBaseProposer:
    def _greedy_sample(self, hidden_states):
        """Greedy-sample draft tokens from hidden states."""
        if self.use_local_argmax_reduction:
            return self.model.get_top_tokens(
                hidden_states
            )
        return self.model.compute_logits(
            hidden_states
        ).argmax(dim=-1)

    def propose(self, target_token_ids,
               target_hidden_states, ...):
        # First pass: run draft model
        self.set_inputs_first_pass(...)
        hidden = self.model.forward(
            self.draft_token_ids, ...
        )
        draft_tokens = self._greedy_sample(hidden)

        # Iteratively generate remaining tokens
        for i in range(1, self.num_speculative_tokens):
            self.set_inputs_subsequent_pass(
                draft_tokens, i
            )
            hidden = self.model.forward(...)
            draft_tokens = self._greedy_sample(hidden)

Engine Core -- Main Loop

The step() method orchestrates one iteration: schedule, execute, sample, update.

vllm/v1/engine/core.pyGitHub
def step(self) -> EngineCoreOutputs:
    """Execute one iteration of the engine loop."""
    if not self.scheduler.has_requests():
        return EngineCoreOutputs.empty()

    scheduler_output = self.scheduler.schedule()
    executor_future = self.model_executor.execute_model(
        scheduler_output, non_block=True
    )
    grammar_bitmask = (
        self.scheduler.get_grammar_bitmask()
    )

    model_output = executor_future.result()
    self._process_aborts_queue()
    outputs = self.scheduler.update_from_output(
        model_output
    )
    return outputs
🔧
Production tip: Monitor gpu_cache_usage_perc at the /metrics endpoint. When it consistently hits 100%, reduce --max-model-len or add GPU capacity. The vLLM Production Stack provides Helm charts with built-in Prometheus integration.