How It Works — SGLang Course

The Radix Tree KV Cache

The heart of SGLang's performance is the radix tree data structure managing KV cache. Each TreeNode stores children, a key (token sequence on the edge), a value (tensor of GPU memory indices), a lock reference counter, and LRU timestamps.

Prefix Matching: When a new request arrives, match_prefix() walks the tree from root, following the longest matching child edge at each level. The result is the concatenated GPU memory indices for all matched KV pages.

Insertion: After processing, new KV mappings are added to the tree. If the sequence extends an existing path, it appends; if it diverges, it splits the edge at the divergence point.

Eviction: When GPU memory fills, LRU eviction removes leaf nodes first. Nodes with lock_ref > 0 (actively used) are never evicted. FIFO, LFU, and priority-based strategies are also available.

The Scheduling Loop

The scheduler's event_loop_normal runs in a tight loop: (1) receive new requests from ZMQ, (2) add to waiting queue, (3) call get_next_batch_to_run(), (4) run batch on GPU, (5) process results.

Prefill vs Decode priority: New prefill requests are prioritized because prefill is compute-bound (high arithmetic intensity) while decode is memory-bandwidth-bound. Interleaving prefill keeps compute units busy.

Cache-aware scheduling: LPM policy computes prefix match length for each waiting request, then sorts by descending match length. When the queue exceeds 128 requests, it falls back to FCFS to avoid O(n*m) overhead.

Dynamic Fallback The LPM scheduling policy dynamically switches to FCFS when the waiting queue grows beyond 128 requests. This prevents prefix matching from becoming a bottleneck under high load, at the cost of reduced cache hit rates.

Constrained Decoding via Grammar Backends

Grammar Compilation: When a request specifies a JSON schema or regex, the grammar backend (typically XGrammar) compiles it into token bitmasks indexed by automaton state.

Per-Step Masking: At each decode step, the backend provides a bitmask of valid tokens. This is applied to logits via apply_token_bitmask_inplace_triton on NVIDIA GPUs, zeroing out invalid tokens before sampling.

Jump-Forward: When only one token sequence is valid (e.g., a required JSON key), the system emits those tokens directly without running the model, saving compute.

Speculative Decoding (EAGLE)

Draft Phase: The EAGLE draft model generates a tree of candidate continuations (3-5 tokens deep) using top-k sampling. Each step is fast because the draft model shares embeddings with the target.

Verification: The target model processes all candidates in a single forward pass. Tokens matching the target distribution are accepted; the first mismatch becomes the cutoff. Typical acceptance rates are 60-80%, yielding 1.5-2x speedup.

Performance Characteristics

Throughput Comparison (tokens/sec on H100)

SGLang

~16,200

vLLM

~12,500

TensorRT-LLM

~17,000

Latency: SGLang maintains stable per-token latency of 4-21ms across concurrency levels. TTFT is competitive but vLLM can achieve lower TTFT on some workloads due to its C++ routing layer.

Memory Efficiency: The radix tree enables sharing KV cache across prefix-sharing requests, effectively increasing usable batch size for a given GPU memory budget. Impactful for few-shot workloads where 50-90% of tokens may be shared.

When RadixAttention shines The throughput advantage from RadixAttention grows with prefix reuse. On workloads with high prefix sharing (multi-turn chat with shared system prompts), SGLang can achieve 2-5x throughput over systems without prefix caching. On unique-prompt workloads, the advantage narrows to the base batching efficiency.