High-Level Design
TensorRT-LLM is organized into three main layers, each handling a different concern. The Python LLM API is the user-facing layer: it loads models from HuggingFace, manages tokenization, and exposes generate(). Underneath, it spawns one PyExecutor per GPU rank — a worker process that runs a continuous inference loop. At the bottom, the C++ Runtime handles GPU kernel execution, TensorRT engine inference, and multi-GPU communication via NCCL.
Click any component in the diagram below to see its details.
TensorRT-LLM System Architecture
Click components to expand details
Python LLM API
LLM class, SamplingParams, model loading, tokenizer
PyExecutor Worker Process
Continuous inference loop, one per GPU rank
📋
Scheduler
Request admission & batching
🧩
KVCacheManager
Paged memory allocation
⚙️
ModelEngine
Forward pass execution
🎯
Sampler
Token selection
C++ Runtime & TensorRT Engine
CUDA kernels, plugins, NCCL communication, executor
Key Components
Five core components drive TensorRT-LLM's inference pipeline. Each solves a specific problem that arises when serving LLMs at scale.
Decides which requests to admit and how to batch them. The scheduler is split into two sub-components: a Capacity Scheduler that checks whether resources exist for new requests (and can pause or evict if needed), and a Micro-Batch Scheduler that partitions admitted requests into context (prefill) and generation batches.
Why it exists
GPU resources (KV cache blocks, compute) are finite. Blindly admitting all requests would cause out-of-memory failures. The scheduler gates admission so the system operates within its memory budget at all times.
Manages paged KV cache blocks as a memory pool. Allocates blocks on demand and recycles them when sequences finish, similar to virtual memory in an OS. Supports block reuse for prefix caching and KV cache quantization (INT8/FP8) for further memory savings.
Why it exists
Naive contiguous KV cache allocation wastes memory proportional to
max_seq_len x batch_size even when most sequences are short. The paged approach reduces memory waste to near zero, enabling 2-3x more concurrent requests on the same hardware.
Wraps the compiled TensorRT engine or PyTorch model and executes forward passes. This is an abstraction layer that insulates the rest of the system from whether inference runs through TensorRT's compiled engine path or PyTorch's eager execution path (the two supported backends).
Why it exists
Two execution backends exist with very different performance characteristics. The abstraction lets the scheduler, KV cache manager, and sampler work identically regardless of which backend runs the forward pass.
Processes raw logits from the model and selects the next token. Supports greedy, top-k, top-p, beam search, and speculative decoding acceptance/rejection. The C++ sampler is default since v1.1.
Why it exists
Token selection logic must run at the speed of the inference loop (millisecond cadence). The C++ implementation avoids Python interpreter overhead and supports complex decoding strategies like speculative draft-verify without GIL contention.
The orchestration layer that ties everything together. Manages the request lifecycle: submit, schedule, execute, sample, respond. Comes in several flavors: GenerationExecutorWorker (single-process), GenerationExecutorProxy (multi-process IPC with MPI), and RayExecutor (distributed via Ray).
Why it exists
Different deployment scenarios need different process topologies. Single-GPU inference, multi-GPU tensor parallelism (via MPI), and distributed cluster inference (via Ray) all require different IPC patterns, but the request lifecycle remains the same.
Data Flow
A single inference request flows through the system in nine steps, from user submission to response delivery. The loop repeats continuously, admitting new requests into freed slots.
Request Lifecycle
1
Request Submission
User calls llm.generate("prompt"). The LLM API tokenizes the input and submits a GenerationRequest to the Executor.
2
Capacity Check
The Capacity Scheduler checks whether enough KV cache blocks exist for the new request. If yes, the request is admitted. If memory is tight, lower-priority requests may be paused to make room.
3
Micro-Batch Formation
The Micro-Batch Scheduler partitions all active requests into two groups: context requests (prefill, compute-bound — processes many tokens at once) and generation requests (produces one token per step, memory-bound).
4
KV Cache Allocation
The KVCacheManager allocates new blocks for context requests and extends existing allocations for generation requests (one new token per block slot).
5
Forward Pass
The ModelEngine runs the TensorRT engine (or PyTorch model) on the combined batch. Plugins handle attention (FlashAttention with paged KV cache), GEMM (FP8/FP4 kernels), and multi-GPU communication (NCCL all-reduce).
6
Token Sampling
The Sampler processes output logits. For greedy decoding, it picks the highest-probability token. For speculative decoding, it validates draft tokens against the target model's distribution.
7
Context-to-Generation Transition
Requests that finish prefill move from the context batch to the generation batch. Their KV cache blocks are retained for subsequent generation steps.
8
Response Delivery
Completed sequences (hit EOS or max length) have their KV cache blocks freed and responses returned to the user. Streaming mode delivers tokens incrementally.
9
Loop Continues
The executor returns to step 2 for the next iteration, potentially admitting new requests into the now-freed slots.
↻ Back to step 2
Design Decisions
💡
PyTorch as the Default Backend (v1.0+) — Starting with v1.0, TensorRT-LLM switched from a custom graph-building API to PyTorch as the primary model definition and execution framework. This dramatically simplified model onboarding (new architectures can reuse HuggingFace code patterns) at the cost of slightly less optimization control. The older TensorRT-native path still exists but is no longer the recommended workflow.
💡
Paged KV Cache over Contiguous Allocation — The team adopted the
PagedAttention approach (originally from vLLM) because contiguous KV cache allocation wastes 60–80% of GPU memory in typical production workloads where request lengths vary widely. Paged allocation reduces memory waste to near zero, enabling 2–3x more concurrent requests.
💡
C++ Core with Python Orchestration — Performance-critical paths (kernel execution, KV cache management, scheduling) are implemented in C++ for minimal overhead, while model definition and user-facing APIs use Python for developer ergonomics. Python's GIL and interpreter overhead would bottleneck the per-token scheduling loop that runs at millisecond cadence.
💡
Plugin Architecture for Specialized Kernels — Rather than relying solely on TensorRT's auto-optimization, TensorRT-LLM registers custom plugins for attention (FlashAttention), GEMM (FP8, FP4), and communication (NCCL). LLM-specific operations (multi-head attention with paged KV cache, quantized matrix multiplication) require hand-tuned kernels that TensorRT's general-purpose pattern matching cannot discover automatically.