Click any question to expand the answer These are the questions senior engineers typically ask when evaluating or operating SGLang in production.
The RadixCache eviction policy activates. By default, LRU eviction starts from leaf nodes of the radix tree. Nodes with lock_ref > 0 (actively used by running requests) are protected. Eviction is recursive: when a leaf node is removed and its parent becomes childless, the parent is also eligible. SGLang supports FIFO, LFU, and priority-based strategies via configuration. If eviction alone isn't sufficient, the scheduler preempts lower-priority running requests, saving their partial state and re-adding them to the waiting queue.
match_prefix() performs a greedy longest-prefix match. It walks the radix tree from root, following the child edge matching the most tokens at each level. If two requests share tokens 1-100 but diverge at 101, only 1-100 is reused. The match is page-aligned (configurable page size) -- if the match falls mid-page, it rounds down to the nearest page boundary for memory alignment.
Choose SGLang when: (1) your workload has significant prefix sharing (multi-turn chat, shared system prompts, few-shot); (2) you need structured/constrained output; (3) you're building agentic workflows with branching logic; (4) you need day-0 support for latest open models. Choose vLLM when: (1) you need absolute lowest TTFT on single-shot requests; (2) your workload is mostly unique prompts; (3) you need features specific to vLLM's ecosystem. SGLang typically wins on throughput (~16,200 vs ~12,500 tokens/sec on H100) while vLLM can win on TTFT under high concurrency.
Not in the traditional sense -- each SGLang server instance serves one base model. However, you can serve multiple LoRA adapters on top of a single base model with dynamic adapter switching per request. For multi-model serving, run multiple SGLang instances and use a router (SGLang Router, NGINX, or Kubernetes Ingress) to direct requests to the appropriate instance.
The default LPM (Longest Prefix Match) policy computes the prefix match length for each waiting request against the RadixCache, then sorts by descending match length. This means requests reusing the most cached computation are served first. However, computing matches for every request is O(n*m). When the waiting queue exceeds 128 requests, the policy automatically falls back to FCFS to avoid this overhead. Force FCFS with --schedule-policy fcfs if your workload doesn't benefit from cache-aware scheduling.
Check: (1) sglang_cache_hit_rate -- low rates mean most requests need full prefill; (2) sglang_num_waiting_requests -- a growing queue means the scheduler can't keep up; (3) chunked prefill size -- if too large, long prefills block shorter requests; (4) GPU utilization via nvidia-smi -- underutilized GPU suggests CPU-side bottleneck. Consider reducing --chunked-prefill-size for more aggressive interleaving, or increasing --dp-size for more scheduler instances.
Yes, via Server-Sent Events (SSE) on the OpenAI-compatible API. Set "stream": true in your request. Internally, the detokenizer manager converts token IDs to text incrementally, sending each chunk through the HTTP response. For the SGLang frontend language, streaming is handled through async generators in the run() method.
Grammar-guided generation adds overhead at each decoding step for bitmask computation and application. With XGrammar (the default), the overhead is typically 5-15% throughput reduction depending on grammar complexity. Simple grammars (JSON with fixed schema) are near-free because the bitmask can be precomputed. Complex grammars (deeply nested, recursive) require per-step automaton transitions. The jump-forward optimization mitigates this by skipping model calls when the grammar dictates a deterministic token sequence.