Core Concepts — SGLang Course

RadixAttention (KV Cache Reuse)

Analogy: Like a library's catalog system -- instead of re-reading an entire book every time someone asks about a chapter, the library keeps an index of which pages have been read and where they're stored. New requests that overlap with previous ones pick up from where the shared reading left off.

RadixAttention is SGLang's signature innovation for automatically reusing KV cache across multiple LLM calls. The system organizes all cached key-value tensors in a radix tree data structure, where each path from root to leaf represents a token sequence. When a new prompt shares a prefix with a cached one, the system reuses those cached KV pairs, skipping expensive prefill computation for the shared portion.

This matters because real-world workloads -- multi-turn chat, few-shot prompting, agentic tool use -- frequently share long common prefixes across requests.

Continuous Batching

Analogy: Like a bus that never waits at the station -- instead of waiting for all passengers to board before departing, new passengers hop on and completed passengers hop off while the bus keeps moving.

Continuous batching keeps the GPU busy at all times. The scheduler continuously adds new requests to the running batch as existing requests finish generating tokens, rather than waiting for the slowest request to complete before processing new ones.

SGLang's scheduler runs a tight event loop that merges new prefill requests with ongoing decode operations every iteration, eliminating GPU idle time.

Chunked Prefill

Analogy: Like a copy machine that processes large documents page by page rather than feeding the entire stack at once -- this lets other people make quick copies between pages.

Chunked prefill splits long prompt processing into smaller pieces so the system can interleave prefill with ongoing decode work. When a prompt is very long (thousands of tokens), processing it all at once would block the GPU. SGLang splits it into configurable chunks and schedules them alongside decode steps.

Constrained Decoding

Analogy: Like fill-in-the-blank with rules -- instead of letting someone write freely, you give them a form where certain fields must be numbers, others must come from a dropdown, and the overall structure must match a template.

Constrained decoding forces the model's output to follow a specific format or grammar. SGLang implements this through grammar backends (primarily XGrammar) that compile grammar specifications into efficient finite-state machines and apply token bitmasks at each decoding step.

Frontend Language (SGLang DSL)

Analogy: Like a screenplay script with stage directions -- instead of just writing dialogue (prompts), you can direct actors to improvise within constraints, fork into parallel scenes, and reference earlier dialogue.

The SGLang frontend language is a Python-embedded DSL for structured LLM programs. Key primitives: gen() for generation, select() for constrained choices, fork() for parallel branching. The frontend captures program structure that the backend runtime exploits for optimization.

Speculative Decoding

Analogy: Like a secretary who drafts a letter for the CEO -- the secretary writes quickly, the CEO reviews and approves most of it in one pass, and only the rejected parts need rewriting.

Speculative decoding uses a smaller, faster draft model to predict multiple tokens ahead, then verifies them with the full model in a single forward pass. SGLang supports EAGLE-based speculative decoding, achieving 1.5-2x speedup on decode-heavy workloads.

Tensor Parallelism

Analogy: Like a team of carpenters building one table together -- instead of each carpenter building a separate table, they each work on different legs simultaneously, then assemble the final product.

Tensor parallelism splits a single model across multiple GPUs. SGLang uses NCCL for inter-GPU communication and supports TP, pipeline parallelism, and expert parallelism for MoE models.

Key Insight What sets SGLang apart is the co-design of frontend and backend. The frontend captures program structure, and the backend exploits it. This is architecturally unique -- other serving frameworks treat each request independently.

How They Fit Together

SGLang Request Flow

1 User Request

→

2 Tokenizer

→

3 Scheduler + RadixCache

→

4 GPU Forward Pass

→

5 Sampling + Constraints

→

6 Response Stream

User sends an API request (OpenAI-compatible) or uses the SGLang frontend DSL to define a structured program.