Architecture — TensorRT-LLM Interactive Course

High-Level Design

TensorRT-LLM is organized into three main layers, each handling a different concern. The Python LLM API is the user-facing layer: it loads models from HuggingFace, manages tokenization, and exposes generate(). Underneath, it spawns one PyExecutor per GPU rank — a worker process that runs a continuous inference loop. At the bottom, the C++ Runtime handles GPU kernel execution, TensorRT engine inference, and multi-GPU communication via NCCL.

Click any component in the diagram below to see its details.

TensorRT-LLM System Architecture

Click components to expand details

Python LLM API

LLM class, SamplingParams, model loading, tokenizer

PyExecutor Worker Process

Continuous inference loop, one per GPU rank

📋

Scheduler

Request admission & batching

🧩

KVCacheManager

Paged memory allocation

⚙️

ModelEngine

Forward pass execution

🎯

Sampler

Token selection

C++ Runtime & TensorRT Engine

CUDA kernels, plugins, NCCL communication, executor

GPU 0

GPU 1

...

GPU N

Compute

Storage

Network / API

Key Components

Five core components drive TensorRT-LLM's inference pipeline. Each solves a specific problem that arises when serving LLMs at scale.

📋

Scheduler

Decides which requests to admit and how to batch them. The scheduler is split into two sub-components: a Capacity Scheduler that checks whether resources exist for new requests (and can pause or evict if needed), and a Micro-Batch Scheduler that partitions admitted requests into context (prefill) and generation batches.

Why it exists

GPU resources (KV cache blocks, compute) are finite. Blindly admitting all requests would cause out-of-memory failures. The scheduler gates admission so the system operates within its memory budget at all times.

🧩

KVCacheManager

Manages paged KV cache blocks as a memory pool. Allocates blocks on demand and recycles them when sequences finish, similar to virtual memory in an OS. Supports block reuse for prefix caching and KV cache quantization (INT8/FP8) for further memory savings.

Why it exists

Naive contiguous KV cache allocation wastes memory proportional to max_seq_len x batch_size even when most sequences are short. The paged approach reduces memory waste to near zero, enabling 2-3x more concurrent requests on the same hardware.

⚙️

ModelEngine

Wraps the compiled TensorRT engine or PyTorch model and executes forward passes. This is an abstraction layer that insulates the rest of the system from whether inference runs through TensorRT's compiled engine path or PyTorch's eager execution path (the two supported backends).

Why it exists

Two execution backends exist with very different performance characteristics. The abstraction lets the scheduler, KV cache manager, and sampler work identically regardless of which backend runs the forward pass.

🎯

Sampler

Processes raw logits from the model and selects the next token. Supports greedy, top-k, top-p, beam search, and speculative decoding acceptance/rejection. The C++ sampler is default since v1.1.

Why it exists

Token selection logic must run at the speed of the inference loop (millisecond cadence). The C++ implementation avoids Python interpreter overhead and supports complex decoding strategies like speculative draft-verify without GIL contention.

🔀

GenerationExecutor

The orchestration layer that ties everything together. Manages the request lifecycle: submit, schedule, execute, sample, respond. Comes in several flavors: GenerationExecutorWorker (single-process), GenerationExecutorProxy (multi-process IPC with MPI), and RayExecutor (distributed via Ray).

Why it exists

Different deployment scenarios need different process topologies. Single-GPU inference, multi-GPU tensor parallelism (via MPI), and distributed cluster inference (via Ray) all require different IPC patterns, but the request lifecycle remains the same.

Data Flow

A single inference request flows through the system in nine steps, from user submission to response delivery. The loop repeats continuously, admitting new requests into freed slots.

Request Lifecycle

Request Submission

User calls llm.generate("prompt"). The LLM API tokenizes the input and submits a GenerationRequest to the Executor.

Capacity Check

The Capacity Scheduler checks whether enough KV cache blocks exist for the new request. If yes, the request is admitted. If memory is tight, lower-priority requests may be paused to make room.

Micro-Batch Formation

The Micro-Batch Scheduler partitions all active requests into two groups: context requests (prefill, compute-bound — processes many tokens at once) and generation requests (produces one token per step, memory-bound).

KV Cache Allocation

The KVCacheManager allocates new blocks for context requests and extends existing allocations for generation requests (one new token per block slot).

Forward Pass

The ModelEngine runs the TensorRT engine (or PyTorch model) on the combined batch. Plugins handle attention (FlashAttention with paged KV cache), GEMM (FP8/FP4 kernels), and multi-GPU communication (NCCL all-reduce).

Token Sampling

The Sampler processes output logits. For greedy decoding, it picks the highest-probability token. For speculative decoding, it validates draft tokens against the target model's distribution.

Context-to-Generation Transition

Requests that finish prefill move from the context batch to the generation batch. Their KV cache blocks are retained for subsequent generation steps.

Response Delivery

Completed sequences (hit EOS or max length) have their KV cache blocks freed and responses returned to the user. Streaming mode delivers tokens incrementally.

Loop Continues

The executor returns to step 2 for the next iteration, potentially admitting new requests into the now-freed slots.

↻ Back to step 2

Design Decisions

💡

PyTorch as the Default Backend (v1.0+) — Starting with v1.0, TensorRT-LLM switched from a custom graph-building API to PyTorch as the primary model definition and execution framework. This dramatically simplified model onboarding (new architectures can reuse HuggingFace code patterns) at the cost of slightly less optimization control. The older TensorRT-native path still exists but is no longer the recommended workflow.

💡

Paged KV Cache over Contiguous Allocation — The team adopted the PagedAttention approach (originally from vLLM) because contiguous KV cache allocation wastes 60–80% of GPU memory in typical production workloads where request lengths vary widely. Paged allocation reduces memory waste to near zero, enabling 2–3x more concurrent requests.

💡

C++ Core with Python Orchestration — Performance-critical paths (kernel execution, KV cache management, scheduling) are implemented in C++ for minimal overhead, while model definition and user-facing APIs use Python for developer ergonomics. Python's GIL and interpreter overhead would bottleneck the per-token scheduling loop that runs at millisecond cadence.

💡

Plugin Architecture for Specialized Kernels — Rather than relying solely on TensorRT's auto-optimization, TensorRT-LLM registers custom plugins for attention (FlashAttention), GEMM (FP8, FP4), and communication (NCCL). LLM-specific operations (multi-head attention with paged KV cache, quantized matrix multiplication) require hand-tuned kernels that TensorRT's general-purpose pattern matching cannot discover automatically.

High-Level Design

TensorRT-LLM System Architecture

Key Components

Data Flow

Request Lifecycle

Design Decisions

References & Resources