Data-parallel request router distributing requests across worker instances with cache-aware routing.
Extension for accelerating video and image generation using diffusion models (Jan 2026).
Standalone package of optimized CUDA/Triton kernels: attention, MoE, quantization, and sampling.
JAX backend enabling SGLang to run natively on Google TPUs (Oct 2025).
Built-in tools for measuring throughput, latency, and TTFT under various workload patterns.
Default grammar backend for constrained decoding. High-performance JSON/regex/EBNF generation.
Primary attention kernel library. Optimized prefill and paged decode on NVIDIA GPUs.
Alternative grammar backend with regex-based generation and different grammar compilation approach.
Microsoft's grammar backend for complex grammar compositions and constrained decoding.
SGLang's OpenAI-compatible API enables drop-in integration. Point the framework's base_url at your SGLang server and benefit from RadixAttention with zero code changes.
Deploy as a Kubernetes Deployment with GPU resource requests. Use /health for liveness/readiness probes. Official Docker images have all dependencies pre-installed.
Enable --enable-metrics to expose Prometheus-compatible metrics. Build dashboards to monitor throughput, latency, cache hit rates, and memory usage in real-time.
Serve multiple LoRA adapters on a single base model with dynamic per-request switching. Enables multi-tenant deployments with fine-tuned adapters per customer.