How It Works — NVIDIA TensorRT-LLM Course

Compilation Pipeline

The compilation pipeline transforms a HuggingFace model into an optimized GPU executable through five stages. Each step narrows the representation from a general-purpose model definition down to a GPU-architecture-specific binary tuned for maximum throughput.

Model-to-Engine Pipeline

1

Model Loading

Weights loaded from a HuggingFace checkpoint or local directory. LLaMAConfig.from_hugging_face() maps HuggingFace config fields to TensorRT-LLM's internal representation.

2

Network Definition

Forward pass traced into a TensorRT INetworkDefinition graph. Each layer (attention, MLP, normalization) becomes a set of TensorRT operations. Custom plugins injected where hand-tuned kernels outperform auto-optimization.

3

Optimization Pass

optimize_model_with_config() applies pre-build transformations: quantization (replacing linear layers with FP8/INT4 variants), plugin injection, and layer fusion planning.

4

TensorRT Compilation

Builder.build_engine() invokes TensorRT's compiler: evaluates available CUDA kernel implementations, identifies fusible operation sequences (e.g., LayerNorm + quantize), selects optimal kernels via auto-tuning. This step takes ~28 minutes for a 70B model.

5

Serialization

Engine binary, config JSON, and managed weights written to disk. Subsequent loads skip compilation entirely — deserialization takes ~90 seconds.

⚠️

Engine is GPU-architecture-specific: an engine compiled on H100 cannot run on A100 or B200. This is the fundamental trade-off — maximum per-GPU optimization at the cost of portability.

Attention Mechanisms

TensorRT-LLM implements several attention strategies that activate based on the inference phase and hardware capabilities. The context (prefill) phase processes all input tokens at once, while the generation (decode) phase produces tokens one at a time against the KV cache.

⚡ Context Phase (Prefill)

Processes all input tokens at once

Without FMHA fallback

Falls back to a sequence of GPU kernels with quadratic memory overhead. Viable for short sequences but prohibitive at scale.

With FMHA default

Runs FlashAttention / FlashAttention-2 in a single kernel pass. On Hopper GPUs, FP8 context FMHA further reduces compute for Q x K and attention x V products.

🔄 Generation Phase (Decode)

Produces tokens one at a time

Masked MHA Kernel

Reads one new query token against the KV cache. Applies QKV bias, RoPE position embeddings, and KV cache dequantization (INT8/FP8) on the fly.

Multi-Block Mode

When batch_size x num_heads is less than the GPU's SM count, distributes work across multiple CUDA thread-blocks per attention head to keep all SMs busy.

XQA (Cross-Query Attention) default

Specialized kernel for MQA/GQA, exploiting the shared KV head structure for faster decode.

Memory Management

TensorRT-LLM's memory system is built around the paged KV cache, inspired by virtual memory systems. Blocks are pre-allocated at startup and dynamically assigned as requests arrive, avoiding fragmentation and enabling efficient sharing.

🧱

Block Pool

Pre-allocated at startup by KVCacheManager. Configurable block sizes (8/16/32/64/128 tokens). Pool shape: [num_blocks, num_layers, 2, num_heads, block_size, head_dim].

🔄

Dynamic Allocation

Blocks allocated from the pool on demand as new requests arrive. New blocks added one at a time during generation. Freed back to the pool on request completion.

🔗

Block Reuse (Prefix Caching)

Shared prefixes (system prompts, conversation history) reused across requests via storeContextBlocks() / findNewContextBlock(). Avoids recomputing KV cache for identical prefix tokens.

📦

KV Cache Quantization

INT8 or FP8 storage instead of FP16, cutting memory by 2–4x. On Blackwell GPUs, NVFP4 KV cache provides 50% reduction versus FP8.

Speculative Decoding

TensorRT-LLM supports six speculative decoding methods that trade extra compute for fewer autoregressive steps. The core mechanism: generate multiple candidate tokens cheaply, then validate them against the full model in a single forward pass. Accepted tokens skip autoregressive steps; rejected tokens trigger a rewind via KVCacheManager.rewindKVCache().

1 Draft-Target Good small model exists

A small draft model (e.g., LLaMA-68M) generates K candidate tokens autoregressively. The large target model (e.g., LLaMA-70B) validates all K candidates in a single forward pass. Accepted tokens are kept; the first rejected token triggers a fallback to the target model's prediction. Best when a high-quality small model exists for your domain.

2 EAGLE (1/2/3) General-purpose

A single-layer transformer predicts draft tokens from the target model's hidden states, avoiding a separate draft model. EAGLE-2 improves tree-structured verification. EAGLE-3 adds support for disaggregated serving, enabling speculative decoding across separate prefill and decode GPU pools. No separate draft model needed — general-purpose and widely applicable.

3 Medusa Fine-tunable extra heads

Additional LM heads are attached to the model, each predicting a future token position. Verification uses a tree-structured approach to evaluate multiple candidate sequences in parallel. Requires fine-tuning the extra heads on your target distribution, but achieves strong acceptance rates when the heads are well-trained.

4 Lookahead No training required

Uses two parallel branches: a lookahead branch that generates n-gram candidates and a verification branch that validates them. No extra model training or fine-tuning required — works with any model out of the box. Particularly useful for quick deployment scenarios.

5 NGram Summarization & QA

Copies token sequences directly from the input prompt as draft candidates. Extremely effective for tasks where output echoes parts of the input — summarization, extractive QA, and document rewriting. Zero compute overhead for draft generation.

6 MTP (Multi-Token Prediction) MTP-trained models

Generates multiple tokens per forward pass natively, leveraging models specifically trained with multi-token prediction objectives. The model itself produces several next-token predictions simultaneously, which are then verified for consistency. Best suited for models explicitly trained with MTP objectives (e.g., DeepSeek-V3).

Disaggregated Serving

Disaggregated serving separates prefill and decode onto different GPU pools, exploiting their fundamentally different hardware profiles.

Prefill Pool

Compute-bound

Dense matrix multiplications over many tokens. Benefits from maximum FLOPS.

➡️ KV Cache Transfer
MPI / UCX / NIXL

Decode Pool

Memory-bandwidth-bound

Reading the KV cache for one token at a time. Benefits from maximum memory bandwidth.

🚀

GB200 performance: On GB200 systems, disaggregated serving achieves 1.4x to 6.1x speedups depending on the workload mix, by allowing each GPU pool to be independently optimized for its bottleneck.

Performance Characteristics

Key performance numbers measured on a single H100 80GB running Llama 3.3 70B with FP8 quantization at 100 concurrent requests.

Throughput ~2,780 tok/s

At 100 concurrent requests (H100, Llama 70B FP8). ~15% faster than vLLM (~2,400 tok/s) and SGLang (~2,460 tok/s).

Time to First Token (p95) ~1,280 ms

At 100 concurrent requests on H100 (vs ~1,450ms for vLLM). Chunked context and FP8 context FMHA are the main levers for reducing TTFT.

Cold Start (First Compilation) ~28 min

First compilation of a 70B model. Cached engine reload takes ~90 seconds. vLLM loads in ~62s, SGLang in ~58s — this is the primary operational cost.

Peak VRAM Usage 74–79 GB

On H100 for a 70B FP8 model. Slightly higher than vLLM (71–78GB) due to TensorRT workspace buffers.

Compilation Pipeline

Attention Mechanisms

Memory Management

Speculative Decoding

Disaggregated Serving

Performance Characteristics

📚 References & Resources