Practical answers to the questions senior engineers ask when evaluating or deploying TensorRT-LLM in production. Click any question to expand the detailed answer.
Compile once as a CI/CD step, then save the engine to persistent storage (S3, GCS, NFS). Loading a cached engine takes ~90 seconds instead of 28 minutes. This is the standard production pattern.
rank*.engine + config.json. These are the only artifacts needed for engine deserialization.
NVIDIA NIM handles this automatically by shipping pre-compiled engines for popular model/GPU combinations. If you are using NIM containers, cold start is already solved.
FP8 is the default choice for Hopper GPUs. It delivers a 1.4–2.3x speedup over FP16 with minimal quality loss. Start here.
INT4 AWQ cuts memory by 4x, which lets you fit larger models on fewer GPUs. The trade-off is some quality degradation, particularly on reasoning-heavy tasks. Best when memory is the bottleneck.
NVFP4 is Blackwell-only and provides a 50% memory reduction versus FP8. If you are on B200/GB200 hardware, this is the most efficient option.
trtllm-eval, then explore INT4 AWQ or NVFP4 only if you need to reduce memory further.
Tensor parallelism (TP) works well up to 8 GPUs within a single node connected by NVLink. Beyond 8, all-reduce communication overhead grows and starts eating into the throughput gains.
The sweet spot for a 70B model is TP=4 on H100, which delivers approximately ~1,564 tok/s. Going to TP=8 gives marginal improvement but uses twice the GPUs.
For 405B+ parameter models, combine TP=8 within a node with pipeline parallelism (PP) across nodes. This keeps NVLink-speed communication for the latency-sensitive TP dimension and uses slower inter-node links only for PP.
Start by establishing a baseline with trtllm-bench. Then systematically check each layer of the stack:
- KV cache utilization — are you running out of blocks? Check with
getNumFreeBlocks() - Batch size at saturation — throughput plateaus once the GPU is compute-bound; pushing beyond wastes memory
- Build flags — flags like
--multiple_profilesand--reduce_fusioncan make a ~30% difference - Chunked context — enable for TTFT spikes when large prompts block the generation pipeline
- GEMM plugin — disable for FP8 workloads (the native cuBLAS path is faster for FP8)
trtllm-bench with the same input/output length distribution.
LoRA adapters: yes. You can swap them at runtime via LoRARequest without restarting the server. Multiple LoRA adapters can be served concurrently on the same base engine.
Base models: no. Switching the base model requires compiling a new engine (~28 minutes) or loading a different pre-compiled engine (~90 seconds with a server restart).
Workarounds for model switching:
- Pre-compile engines for all target models and swap between them (~90s restart)
- Use vLLM for exploration and prototyping, then move to TRT-LLM for production
- Use NIM containers for pre-cached engines with simplified orchestration
Call Executor.cancelRequest(requestId) to mark a request for termination. The runtime picks this up at the next scheduler iteration (typically <10ms latency). All KV cache blocks allocated to the cancelled request are freed immediately.
API-specific patterns:
- Python: cancel the future returned from
generate_async() - Triton backend: standard gRPC cancellation propagates to the executor
Behavior depends on the configured scheduling policy:
- GuaranteedNoEvict (default): new requests are queued until blocks become available. No in-flight request is ever interrupted, but queue latency increases under pressure.
- MaxUtilization: the scheduler may pause lower-priority in-flight requests to free blocks for higher-priority ones. Better utilization, but adds complexity.
The key configuration lever is the KV cache memory fraction (default: 0.9), which controls what portion of free GPU memory is allocated to the KV cache pool at startup.
getNumFreeBlocks(). If consistently saturated: reduce max_batch_size, enable KV cache quantization (INT8/FP8 halves block size), or add more GPUs.