A high-throughput, memory-efficient inference and serving engine for large language models. PagedAttention eliminates KV cache fragmentation, delivering 2-4x higher throughput on the same hardware.
Get vLLM running with an OpenAI-compatible API server in under a minute.
# Install vLLM
pip install vllm
# Start the OpenAI-compatible API server
vllm serve meta-llama/Llama-3.1-8B-Instruct
# Test with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]}'
Point any OpenAI SDK client at http://localhost:8000/v1 -- zero code changes required.