SGLang is a high-performance serving framework for large language models that automatically reuses computation across requests through RadixAttention — delivering up to 5x higher throughput on structured LLM workloads.
Install SGLang and launch a server with any Hugging Face model in minutes.
# Install SGLang
pip install "sglang[all]"
# Launch a server
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000
# Query with curl
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello!"}]}'
Dive into Core Concepts to understand RadixAttention and continuous batching, or jump to Implementation Details for code patterns and source code walkthrough.