The heart of SGLang's performance is the radix tree data structure managing KV cache. Each TreeNode stores children, a key (token sequence on the edge), a value (tensor of GPU memory indices), a lock reference counter, and LRU timestamps.
Prefix Matching: When a new request arrives, match_prefix() walks the tree from root, following the longest matching child edge at each level. The result is the concatenated GPU memory indices for all matched KV pages.
Insertion: After processing, new KV mappings are added to the tree. If the sequence extends an existing path, it appends; if it diverges, it splits the edge at the divergence point.
Eviction: When GPU memory fills, LRU eviction removes leaf nodes first. Nodes with lock_ref > 0 (actively used) are never evicted. FIFO, LFU, and priority-based strategies are also available.
The scheduler's event_loop_normal runs in a tight loop: (1) receive new requests from ZMQ, (2) add to waiting queue, (3) call get_next_batch_to_run(), (4) run batch on GPU, (5) process results.
Prefill vs Decode priority: New prefill requests are prioritized because prefill is compute-bound (high arithmetic intensity) while decode is memory-bandwidth-bound. Interleaving prefill keeps compute units busy.
Cache-aware scheduling: LPM policy computes prefix match length for each waiting request, then sorts by descending match length. When the queue exceeds 128 requests, it falls back to FCFS to avoid O(n*m) overhead.
Grammar Compilation: When a request specifies a JSON schema or regex, the grammar backend (typically XGrammar) compiles it into token bitmasks indexed by automaton state.
Per-Step Masking: At each decode step, the backend provides a bitmask of valid tokens. This is applied to logits via apply_token_bitmask_inplace_triton on NVIDIA GPUs, zeroing out invalid tokens before sampling.
Jump-Forward: When only one token sequence is valid (e.g., a required JSON key), the system emits those tokens directly without running the model, saving compute.
Draft Phase: The EAGLE draft model generates a tree of candidate continuations (3-5 tokens deep) using top-k sampling. Each step is fast because the draft model shares embeddings with the target.
Verification: The target model processes all candidates in a single forward pass. Tokens matching the target distribution are accepted; the first mismatch becomes the cutoff. Typical acceptance rates are 60-80%, yielding 1.5-2x speedup.
Latency: SGLang maintains stable per-token latency of 4-21ms across concurrency levels. TTFT is competitive but vLLM can achieve lower TTFT on some workloads due to its C++ routing layer.
Memory Efficiency: The radix tree enables sharing KV cache across prefix-sharing requests, effectively increasing usable batch size for a given GPU memory budget. Impactful for few-shot workloads where 50-90% of tokens may be shared.