RadixAttention is SGLang's signature innovation for automatically reusing KV cache across multiple LLM calls. The system organizes all cached key-value tensors in a radix tree data structure, where each path from root to leaf represents a token sequence. When a new prompt shares a prefix with a cached one, the system reuses those cached KV pairs, skipping expensive prefill computation for the shared portion.
This matters because real-world workloads -- multi-turn chat, few-shot prompting, agentic tool use -- frequently share long common prefixes across requests.
Continuous batching keeps the GPU busy at all times. The scheduler continuously adds new requests to the running batch as existing requests finish generating tokens, rather than waiting for the slowest request to complete before processing new ones.
SGLang's scheduler runs a tight event loop that merges new prefill requests with ongoing decode operations every iteration, eliminating GPU idle time.
Chunked prefill splits long prompt processing into smaller pieces so the system can interleave prefill with ongoing decode work. When a prompt is very long (thousands of tokens), processing it all at once would block the GPU. SGLang splits it into configurable chunks and schedules them alongside decode steps.
Constrained decoding forces the model's output to follow a specific format or grammar. SGLang implements this through grammar backends (primarily XGrammar) that compile grammar specifications into efficient finite-state machines and apply token bitmasks at each decoding step.
The SGLang frontend language is a Python-embedded DSL for structured LLM programs. Key primitives: gen() for generation, select() for constrained choices, fork() for parallel branching. The frontend captures program structure that the backend runtime exploits for optimization.
Speculative decoding uses a smaller, faster draft model to predict multiple tokens ahead, then verifies them with the full model in a single forward pass. SGLang supports EAGLE-based speculative decoding, achieving 1.5-2x speedup on decode-heavy workloads.
Tensor parallelism splits a single model across multiple GPUs. SGLang uses NCCL for inter-GPU communication and supports TP, pipeline parallelism, and expert parallelism for MoE models.