High-Level Design

SGLang follows a frontend-backend co-design architecture. The frontend provides a Python DSL for structured LLM programs, while the backend runtime (SRT) handles serving with aggressive optimization. The backend itself uses a multi-process, pipelined architecture with three major process groups communicating via ZMQ and shared memory.

SGLang System Architecture

🌐
HTTP Server
FastAPI + OpenAI API
📝
TokenizerManager
Text ↔ Tokens
📤
DetokenizerManager
Tokens → Text
🧠
Scheduler
Batching + Scheduling
🌳
RadixCache
KV Cache Tree
⚙️
ModelRunner
Model Execution
Attention Backend
FlashInfer / FA2
📋
Grammar Backend
XGrammar / Outlines
Click any component above to see details Each component in the architecture diagram is interactive. Click to expand and learn why it exists and what problem it solves.

Design Decisions

ZMQ for control, shared memory for data. SGLang uses ZMQ for inter-process communication of control messages (requests, completions) but shared memory for large tensor data (token IDs, logprobs). This avoids serialization overhead for bulk data while keeping control flow simple.

Single-process scheduler. Unlike systems using Ray or distributed coordinators, SGLang runs one scheduler per GPU group in a single process. This eliminates distributed coordination overhead and enables microsecond-level scheduling decisions.

Cache-aware scheduling as default. Most serving systems use FCFS. SGLang defaults to LPM which reorders the queue to prioritize requests that reuse cached KV data, trading strict fairness for higher throughput.

CUDA graph capture for decode. Decode steps are memory-bandwidth-bound with short kernels, making CPU launch overhead significant. CUDA graphs eliminate this overhead by replaying captured GPU command sequences.

Key Design Principle SGLang's architecture is optimized for the common case in production LLM serving: requests that share prefixes. The radix cache, cache-aware scheduling, and single-process scheduler are all designed around this observation. When prefix sharing is absent, these optimizations add negligible overhead but provide no benefit.