LLM Inference Engine

vLLM

A high-throughput, memory-efficient inference and serving engine for large language models. PagedAttention eliminates KV cache fragmentation, delivering 2-4x higher throughput on the same hardware.

License Apache-2.0
Language Python / CUDA
Latest Version 0.19.0
💡 Core Concepts Beginner PagedAttention, KV cache, continuous batching, tensor parallelism, speculative decoding, and prefix caching -- with real-world analogies. 🏗 Architecture Intermediate V1 engine design: EngineCore, Scheduler, KVCacheManager, Workers, and how they coordinate for maximum throughput. ⚙️ How It Works Intermediate Internal mechanisms: block-based memory management, scheduling algorithms, prefix cache hashing, multi-GPU execution, and CUDA graphs. 💻 Implementation Advanced Getting started, configuration, code patterns, and annotated source code walkthrough of the vLLM codebase. 🚀 Use Cases Beginner When to use vLLM, when not to, and real-world deployments at Roblox, Stripe, Meta, and more. 🌍 Ecosystem Intermediate Production Stack, LangChain, Ray Serve, quantization, LoRA adapters, and common integration patterns. FAQ All Levels Practical questions: diagnosing throughput issues, right-sizing parameters, speculative decoding tradeoffs, and multi-model serving. ⚖️ Trade-offs Intermediate Honest strengths, real limitations, and head-to-head comparison with SGLang, TensorRT-LLM, and TGI.

Quick Start

Get vLLM running with an OpenAI-compatible API server in under a minute.

bash
# Install vLLM
pip install vllm

# Start the OpenAI-compatible API server
vllm serve meta-llama/Llama-3.1-8B-Instruct

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct",
       "messages": [{"role": "user", "content": "Hello!"}]}'

Point any OpenAI SDK client at http://localhost:8000/v1 -- zero code changes required.