Strengths

Where TensorRT-LLM excels compared to alternative inference engines.

Highest throughput on NVIDIA GPUs
~2,780 tok/s at 100 concurrent users on H100 — consistently 10–15% ahead of alternatives in production benchmarks.
Lowest latency at production scale
p95 TTFT ~1,280ms and ITL ~31ms on 4xH100 — tight tail latencies even under heavy concurrent load.
Deepest hardware exploitation
First to support FP8, NVFP4, CUDA Graphs, and XQA kernels. Every new NVIDIA architecture gets optimized kernels here first.
Comprehensive feature set
Disaggregated serving, 6 speculative decoding methods, 10+ quantization formats, paged attention, prefix caching, and runtime LoRA — all in one engine.
Production-ready deployment
OpenAI-compatible API via trtllm-serve, Triton Inference Server backend, NIM containers, and Apache 2.0 licensing. Multiple deployment paths for different team sizes.
Peak Throughput (H100, Llama 70B FP8, 100 concurrent) ~2,780 tok/s
Highest among tested engines (vLLM ~2,400, SGLang ~2,460)
p95 TTFT (4xH100) ~1,280ms
Competitive with vLLM; SGLang edges ahead on prefix-heavy workloads
p95 ITL (4xH100) ~31ms
Tight inter-token latency for smooth streaming

Limitations

Honest constraints to weigh before committing to TensorRT-LLM.

NVIDIA GPU lock-in
No AMD ROCm, no Intel, no CPU fallback. If your infrastructure includes non-NVIDIA hardware, you need a separate solution for those accelerators.
28-minute compilation per model version
The engine is GPU-architecture-specific. Every new model, quantization change, or GPU target requires a full recompile. Caching mitigates but doesn't eliminate the cost.
Higher operational complexity
Build flags, engine management, Triton configuration, and TP/PP topology planning add operational surface area compared to simpler alternatives.
Higher idle memory footprint
74–79 GB peak VRAM on H100 versus 71–78 GB for vLLM. The pre-allocated KV cache pool and compiled engine consume more baseline memory.
No model hot-swapping
Switching the base model requires a server restart (~90s with a cached engine). No in-place model replacement while requests are in flight.
Windows deprecated (v0.18+)
Linux only. Windows support was deprecated in v0.18. Development and production must target Linux or WSL2.

Alternatives Comparison

How TensorRT-LLM stacks up against the two leading open-source alternatives for LLM serving.

TensorRT-LLM vLLM SGLang
Peak throughput Highest (~2,780) Moderate (~2,400) Moderate (~2,460)
Cold start ~28 min / ~90s cached ~62s ~58s
Hardware NVIDIA only NVIDIA, AMD ROCm, CPU NVIDIA, AMD ROCm
Model switching Recompile Hot-swap Hot-swap
Setup Complex Simple Simple
Quantization 10+ formats FP8, AWQ, GPTQ FP8, AWQ, GPTQ
Speculative decoding 6 methods Draft model, EAGLE Draft model
Disaggregated Built-in Experimental Not built-in
Prefix caching Block reuse Automatic RadixAttention (superior)
Best for Long-term production Quick deploy, flexibility Shared-prefix workloads

The Honest Take

💡
Use TensorRT-LLM when you have settled on a model, are running NVIDIA hardware, and throughput per dollar is your primary metric. The ~15% throughput advantage over vLLM translates to fewer GPUs at scale — real cost savings on large deployments.

For prototyping, model exploration, multi-vendor hardware, or small teams — start with vLLM. Its simplicity and hot-swap capability mean faster iteration.

The worst outcome is building a complex TRT-LLM pipeline for a model you will replace in two months.