LLM Serving & Inference

SGLang

SGLang is a high-performance serving framework for large language models that automatically reuses computation across requests through RadixAttention — delivering up to 5x higher throughput on structured LLM workloads.

License Apache-2.0
Language Python / C++ / CUDA
Latest Version v0.5.9
GPU Deployments 400,000+
💡 Core Concepts Beginner Key abstractions — RadixAttention, continuous batching, chunked prefill, constrained decoding, and the frontend DSL. 🏗 Architecture Intermediate System design — HTTP server, TokenizerManager, Scheduler, RadixCache, ModelRunner, and data flow. ⚙️ How It Works Intermediate Internal mechanisms — radix tree KV cache, scheduling loop, grammar backends, speculative decoding, and performance. 💻 Implementation Details Advanced Hands-on — getting started, configuration, code patterns, and annotated source code walkthrough. 🚀 Use Cases Beginner – Intermediate When to use SGLang, when not to, and real-world deployments at xAI, Cursor, and LinkedIn. 🔌 Ecosystem & Integrations Intermediate Tools, backends, and integration patterns — XGrammar, FlashInfer, LangChain, Kubernetes, and more. FAQ All Levels Senior-level questions about KV cache eviction, scheduling fallback, throughput tuning, and production debugging. ⚖️ Trade-offs & Limitations Intermediate Honest strengths, limitations, and comparison with vLLM, TensorRT-LLM, and llama.cpp.

Quick Start

Install SGLang and launch a server with any Hugging Face model in minutes.

bash
# Install SGLang
pip install "sglang[all]"

# Launch a server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000

# Query with curl
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello!"}]}'

Dive into Core Concepts to understand RadixAttention and continuous batching, or jump to Implementation Details for code patterns and source code walkthrough.