LLM Serving & Inference

SGLang

SGLang is a high-performance serving framework for large language models that automatically reuses computation across requests through RadixAttention — delivering up to 5x higher throughput on structured LLM workloads.

License Apache-2.0

Language Python / C++ / CUDA

Latest Version v0.5.9

GitHub sgl-project/sglang

GPU Deployments 400,000+

Quick Start

Install SGLang and launch a server with any Hugging Face model in minutes.

bash

# Install SGLang
pip install "sglang[all]"

# Launch a server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000

# Query with curl
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello!"}]}'

Dive into Core Concepts to understand RadixAttention and continuous batching, or jump to Implementation Details for code patterns and source code walkthrough.