lock_ref > 0 (actively used by running requests) are protected. Eviction is recursive: when a leaf node is removed and its parent becomes childless, the parent is also eligible. SGLang supports FIFO, LFU, and priority-based strategies via configuration. If eviction alone isn't sufficient, the scheduler preempts lower-priority running requests, saving their partial state and re-adding them to the waiting queue.match_prefix() performs a greedy longest-prefix match. It walks the radix tree from root, following the child edge matching the most tokens at each level. If two requests share tokens 1-100 but diverge at 101, only 1-100 is reused. The match is page-aligned (configurable page size) -- if the match falls mid-page, it rounds down to the nearest page boundary for memory alignment.--schedule-policy fcfs if your workload doesn't benefit from cache-aware scheduling.sglang_cache_hit_rate -- low rates mean most requests need full prefill; (2) sglang_num_waiting_requests -- a growing queue means the scheduler can't keep up; (3) chunked prefill size -- if too large, long prefills block shorter requests; (4) GPU utilization via nvidia-smi -- underutilized GPU suggests CPU-side bottleneck. Consider reducing --chunked-prefill-size for more aggressive interleaving, or increasing --dp-size for more scheduler instances."stream": true in your request. Internally, the detokenizer manager converts token IDs to text incrementally, sending each chunk through the HTTP response. For the SGLang frontend language, streaming is handled through async generators in the run() method.