Trade-offs & Limitations — NVIDIA TensorRT-LLM Course

Strengths

Where TensorRT-LLM excels compared to alternative inference engines.

✓

Highest throughput on NVIDIA GPUs

~2,780 tok/s at 100 concurrent users on H100 — consistently 10–15% ahead of alternatives in production benchmarks.

✓

Lowest latency at production scale

p95 TTFT ~1,280ms and ITL ~31ms on 4xH100 — tight tail latencies even under heavy concurrent load.

✓

Deepest hardware exploitation

First to support FP8, NVFP4, CUDA Graphs, and XQA kernels. Every new NVIDIA architecture gets optimized kernels here first.

✓

Comprehensive feature set

Disaggregated serving, 6 speculative decoding methods, 10+ quantization formats, paged attention, prefix caching, and runtime LoRA — all in one engine.

✓

Production-ready deployment

OpenAI-compatible API via trtllm-serve, Triton Inference Server backend, NIM containers, and Apache 2.0 licensing. Multiple deployment paths for different team sizes.

Peak Throughput (H100, Llama 70B FP8, 100 concurrent) ~2,780 tok/s

Highest among tested engines (vLLM ~2,400, SGLang ~2,460)

p95 TTFT (4xH100) ~1,280ms

Competitive with vLLM; SGLang edges ahead on prefix-heavy workloads

p95 ITL (4xH100) ~31ms

Tight inter-token latency for smooth streaming

Limitations

Honest constraints to weigh before committing to TensorRT-LLM.

✗

NVIDIA GPU lock-in

No AMD ROCm, no Intel, no CPU fallback. If your infrastructure includes non-NVIDIA hardware, you need a separate solution for those accelerators.

✗

28-minute compilation per model version

The engine is GPU-architecture-specific. Every new model, quantization change, or GPU target requires a full recompile. Caching mitigates but doesn't eliminate the cost.

✗

Higher operational complexity

Build flags, engine management, Triton configuration, and TP/PP topology planning add operational surface area compared to simpler alternatives.

✗

Higher idle memory footprint

74–79 GB peak VRAM on H100 versus 71–78 GB for vLLM. The pre-allocated KV cache pool and compiled engine consume more baseline memory.

✗

No model hot-swapping

Switching the base model requires a server restart (~90s with a cached engine). No in-place model replacement while requests are in flight.

✗

Windows deprecated (v0.18+)

Linux only. Windows support was deprecated in v0.18. Development and production must target Linux or WSL2.

Alternatives Comparison

How TensorRT-LLM stacks up against the two leading open-source alternatives for LLM serving.

	TensorRT-LLM	vLLM	SGLang
Peak throughput	Highest (~2,780)	Moderate (~2,400)	Moderate (~2,460)
Cold start	~28 min / ~90s cached	~62s	~58s
Hardware	NVIDIA only	NVIDIA, AMD ROCm, CPU	NVIDIA, AMD ROCm
Model switching	Recompile	Hot-swap	Hot-swap
Setup	Complex	Simple	Simple
Quantization	10+ formats	FP8, AWQ, GPTQ	FP8, AWQ, GPTQ
Speculative decoding	6 methods	Draft model, EAGLE	Draft model
Disaggregated	Built-in	Experimental	Not built-in
Prefix caching	Block reuse	Automatic	RadixAttention (superior)
Best for	Long-term production	Quick deploy, flexibility	Shared-prefix workloads

The Honest Take

💡

Use TensorRT-LLM when you have settled on a model, are running NVIDIA hardware, and throughput per dollar is your primary metric. The ~15% throughput advantage over vLLM translates to fewer GPUs at scale — real cost savings on large deployments.

For prototyping, model exploration, multi-vendor hardware, or small teams — start with vLLM. Its simplicity and hot-swap capability mean faster iteration.

The worst outcome is building a complex TRT-LLM pipeline for a model you will replace in two months.

Strengths

Limitations

Alternatives Comparison

The Honest Take

📚 References & Resources