SGLang follows a frontend-backend co-design architecture. The frontend provides a Python DSL for structured LLM programs, while the backend runtime (SRT) handles serving with aggressive optimization. The backend itself uses a multi-process, pipelined architecture with three major process groups communicating via ZMQ and shared memory.
ZMQ for control, shared memory for data. SGLang uses ZMQ for inter-process communication of control messages (requests, completions) but shared memory for large tensor data (token IDs, logprobs). This avoids serialization overhead for bulk data while keeping control flow simple.
Single-process scheduler. Unlike systems using Ray or distributed coordinators, SGLang runs one scheduler per GPU group in a single process. This eliminates distributed coordination overhead and enables microsecond-level scheduling decisions.
Cache-aware scheduling as default. Most serving systems use FCFS. SGLang defaults to LPM which reorders the queue to prioritize requests that reuse cached KV data, trading strict fairness for higher throughput.
CUDA graph capture for decode. Decode steps are memory-bandwidth-bound with short kernels, making CPU launch overhead significant. CUDA graphs eliminate this overhead by replaying captured GPU command sequences.