The compilation pipeline transforms a HuggingFace model into an optimized GPU executable through five stages. Each step narrows the representation from a general-purpose model definition down to a GPU-architecture-specific binary tuned for maximum throughput.
Model-to-Engine Pipeline
1
Model Loading
Weights loaded from a HuggingFace checkpoint or local directory. LLaMAConfig.from_hugging_face() maps HuggingFace config fields to TensorRT-LLM's internal representation.
2
Network Definition
Forward pass traced into a TensorRT INetworkDefinition graph. Each layer (attention, MLP, normalization) becomes a set of TensorRT operations. Custom plugins injected where hand-tuned kernels outperform auto-optimization.
3
Optimization Pass
optimize_model_with_config() applies pre-build transformations: quantization (replacing linear layers with FP8/INT4 variants), plugin injection, and layer fusion planning.
4
TensorRT Compilation
Builder.build_engine() invokes TensorRT's compiler: evaluates available CUDA kernel implementations, identifies fusible operation sequences (e.g., LayerNorm + quantize), selects optimal kernels via auto-tuning. This step takes ~28 minutes for a 70B model.
5
Serialization
Engine binary, config JSON, and managed weights written to disk. Subsequent loads skip compilation entirely — deserialization takes ~90 seconds.
⚠️
Engine is GPU-architecture-specific: an engine compiled on H100 cannot run on A100 or B200. This is the fundamental trade-off — maximum per-GPU optimization at the cost of portability.
Attention Mechanisms
TensorRT-LLM implements several attention strategies that activate based on the inference phase and hardware capabilities. The context (prefill) phase processes all input tokens at once, while the generation (decode) phase produces tokens one at a time against the KV cache.
⚡Context Phase (Prefill)
Processes all input tokens at once
Without FMHA fallback
Falls back to a sequence of GPU kernels with quadratic memory overhead. Viable for short sequences but prohibitive at scale.
Runs FlashAttention / FlashAttention-2 in a single kernel pass. On Hopper GPUs, FP8 context FMHA further reduces compute for Q x K and attention x V products.
🔄Generation Phase (Decode)
Produces tokens one at a time
Masked MHA Kernel
Reads one new query token against the KV cache. Applies QKV bias, RoPE position embeddings, and KV cache dequantization (INT8/FP8) on the fly.
Multi-Block Mode
When batch_size x num_heads is less than the GPU's SM count, distributes work across multiple CUDA thread-blocks per attention head to keep all SMs busy.
XQA (Cross-Query Attention) default
Specialized kernel for MQA/GQA, exploiting the shared KV head structure for faster decode.
Memory Management
TensorRT-LLM's memory system is built around the paged KV cache, inspired by virtual memory systems. Blocks are pre-allocated at startup and dynamically assigned as requests arrive, avoiding fragmentation and enabling efficient sharing.
🧱
Block Pool
Pre-allocated at startup by KVCacheManager. Configurable block sizes (8/16/32/64/128 tokens). Pool shape: [num_blocks, num_layers, 2, num_heads, block_size, head_dim].
🔄
Dynamic Allocation
Blocks allocated from the pool on demand as new requests arrive. New blocks added one at a time during generation. Freed back to the pool on request completion.
🔗
Block Reuse (Prefix Caching)
Shared prefixes (system prompts, conversation history) reused across requests via storeContextBlocks() / findNewContextBlock(). Avoids recomputing KV cache for identical prefix tokens.
📦
KV Cache Quantization
INT8 or FP8 storage instead of FP16, cutting memory by 2–4x. On Blackwell GPUs, NVFP4 KV cache provides 50% reduction versus FP8.
Speculative Decoding
TensorRT-LLM supports six speculative decoding methods that trade extra compute for fewer autoregressive steps. The core mechanism: generate multiple candidate tokens cheaply, then validate them against the full model in a single forward pass. Accepted tokens skip autoregressive steps; rejected tokens trigger a rewind via KVCacheManager.rewindKVCache().
A small draft model (e.g., LLaMA-68M) generates K candidate tokens autoregressively. The large target model (e.g., LLaMA-70B) validates all K candidates in a single forward pass. Accepted tokens are kept; the first rejected token triggers a fallback to the target model's prediction. Best when a high-quality small model exists for your domain.
A single-layer transformer predicts draft tokens from the target model's hidden states, avoiding a separate draft model. EAGLE-2 improves tree-structured verification. EAGLE-3 adds support for disaggregated serving, enabling speculative decoding across separate prefill and decode GPU pools. No separate draft model needed — general-purpose and widely applicable.
Additional LM heads are attached to the model, each predicting a future token position. Verification uses a tree-structured approach to evaluate multiple candidate sequences in parallel. Requires fine-tuning the extra heads on your target distribution, but achieves strong acceptance rates when the heads are well-trained.
Uses two parallel branches: a lookahead branch that generates n-gram candidates and a verification branch that validates them. No extra model training or fine-tuning required — works with any model out of the box. Particularly useful for quick deployment scenarios.
Copies token sequences directly from the input prompt as draft candidates. Extremely effective for tasks where output echoes parts of the input — summarization, extractive QA, and document rewriting. Zero compute overhead for draft generation.
Generates multiple tokens per forward pass natively, leveraging models specifically trained with multi-token prediction objectives. The model itself produces several next-token predictions simultaneously, which are then verified for consistency. Best suited for models explicitly trained with MTP objectives (e.g., DeepSeek-V3).
Disaggregated Serving
Disaggregated serving separates prefill and decode onto different GPU pools, exploiting their fundamentally different hardware profiles.
Prefill Pool
Compute-bound
Dense matrix multiplications over many tokens. Benefits from maximum FLOPS.
➡️KV Cache Transfer MPI / UCX / NIXL
Decode Pool
Memory-bandwidth-bound
Reading the KV cache for one token at a time. Benefits from maximum memory bandwidth.
🚀
GB200 performance: On GB200 systems, disaggregated serving achieves 1.4x to 6.1x speedups depending on the workload mix, by allowing each GPU pool to be independently optimized for its bottleneck.
Performance Characteristics
Key performance numbers measured on a single H100 80GB running Llama 3.3 70B with FP8 quantization at 100 concurrent requests.
Throughput~2,780 tok/s
At 100 concurrent requests (H100, Llama 70B FP8). ~15% faster than vLLM (~2,400 tok/s) and SGLang (~2,460 tok/s).
Time to First Token (p95)~1,280 ms
At 100 concurrent requests on H100 (vs ~1,450ms for vLLM). Chunked context and FP8 context FMHA are the main levers for reducing TTFT.
Cold Start (First Compilation)~28 min
First compilation of a 70B model. Cached engine reload takes ~90 seconds. vLLM loads in ~62s, SGLang in ~58s — this is the primary operational cost.
Peak VRAM Usage74–79 GB
On H100 for a 70B FP8 model. Slightly higher than vLLM (71–78GB) due to TensorRT workspace buffers.