Practical answers to the questions senior engineers ask when evaluating or deploying TensorRT-LLM in production. Click any question to expand the detailed answer.

Compile once as a CI/CD step, then save the engine to persistent storage (S3, GCS, NFS). Loading a cached engine takes ~90 seconds instead of 28 minutes. This is the standard production pattern.

💡
Key files to persist: rank*.engine + config.json. These are the only artifacts needed for engine deserialization.

NVIDIA NIM handles this automatically by shipping pre-compiled engines for popular model/GPU combinations. If you are using NIM containers, cold start is already solved.

FP8 is the default choice for Hopper GPUs. It delivers a 1.4–2.3x speedup over FP16 with minimal quality loss. Start here.

INT4 AWQ cuts memory by 4x, which lets you fit larger models on fewer GPUs. The trade-off is some quality degradation, particularly on reasoning-heavy tasks. Best when memory is the bottleneck.

NVFP4 is Blackwell-only and provides a 50% memory reduction versus FP8. If you are on B200/GB200 hardware, this is the most efficient option.

Recommended workflow: Start with FP8, validate output quality with trtllm-eval, then explore INT4 AWQ or NVFP4 only if you need to reduce memory further.

Tensor parallelism (TP) works well up to 8 GPUs within a single node connected by NVLink. Beyond 8, all-reduce communication overhead grows and starts eating into the throughput gains.

The sweet spot for a 70B model is TP=4 on H100, which delivers approximately ~1,564 tok/s. Going to TP=8 gives marginal improvement but uses twice the GPUs.

For 405B+ parameter models, combine TP=8 within a node with pipeline parallelism (PP) across nodes. This keeps NVLink-speed communication for the latency-sensitive TP dimension and uses slower inter-node links only for PP.

Start by establishing a baseline with trtllm-bench. Then systematically check each layer of the stack:

  • KV cache utilization — are you running out of blocks? Check with getNumFreeBlocks()
  • Batch size at saturation — throughput plateaus once the GPU is compute-bound; pushing beyond wastes memory
  • Build flags — flags like --multiple_profiles and --reduce_fusion can make a ~30% difference
  • Chunked context — enable for TTFT spikes when large prompts block the generation pipeline
  • GEMM plugin — disable for FP8 workloads (the native cuBLAS path is faster for FP8)
⚠️
Build flags matter more than you think. A misconfigured build can leave 30% of throughput on the table. Always compare against trtllm-bench with the same input/output length distribution.

LoRA adapters: yes. You can swap them at runtime via LoRARequest without restarting the server. Multiple LoRA adapters can be served concurrently on the same base engine.

Base models: no. Switching the base model requires compiling a new engine (~28 minutes) or loading a different pre-compiled engine (~90 seconds with a server restart).

Workarounds for model switching:

  • Pre-compile engines for all target models and swap between them (~90s restart)
  • Use vLLM for exploration and prototyping, then move to TRT-LLM for production
  • Use NIM containers for pre-cached engines with simplified orchestration

Call Executor.cancelRequest(requestId) to mark a request for termination. The runtime picks this up at the next scheduler iteration (typically <10ms latency). All KV cache blocks allocated to the cancelled request are freed immediately.

API-specific patterns:

  • Python: cancel the future returned from generate_async()
  • Triton backend: standard gRPC cancellation propagates to the executor
ℹ️
Tokens before cancel are not rolled back. Any tokens already generated and streamed remain in the response. Cancellation prevents further generation, but does not undo what was already sent.

Behavior depends on the configured scheduling policy:

  • GuaranteedNoEvict (default): new requests are queued until blocks become available. No in-flight request is ever interrupted, but queue latency increases under pressure.
  • MaxUtilization: the scheduler may pause lower-priority in-flight requests to free blocks for higher-priority ones. Better utilization, but adds complexity.

The key configuration lever is the KV cache memory fraction (default: 0.9), which controls what portion of free GPU memory is allocated to the KV cache pool at startup.

🔧
Solutions when you hit the limit: Monitor with getNumFreeBlocks(). If consistently saturated: reduce max_batch_size, enable KV cache quantization (INT8/FP8 halves block size), or add more GPUs.