Official Tools & Extensions
TensorRT-LLM ships with a growing set of first-party tools that cover the full lifecycle from benchmarking to production serving. These are maintained alongside the core library and follow the same release cadence.
trtllm-serve
Stable since v1.0
OpenAI-compatible HTTP server built into TensorRT-LLM. Supports chat completions, completions, and models endpoints. The simplest path to serving:
trtllm-serve "model-name". Drop-in replacement for any OpenAI-compatible client.
trtllm-bench
Benchmarking
Benchmarking tool for measuring throughput, latency, and TTFT under configurable load patterns. Essential for capacity planning before production deployment.
trtllm-eval
Quality
Evaluation tool for measuring model quality (perplexity, accuracy) after quantization. Validates that FP8/INT4 quantization hasn't degraded output quality below acceptable thresholds.
NVIDIA ModelOpt
nvidia-modelopt
Quantization calibration toolkit. Handles the data collection and scaling factor computation needed for post-training quantization (PTQ). Required for INT8 SmoothQuant and recommended for FP8. Shipped as a separate package (
nvidia-modelopt), tightly integrated with TensorRT-LLM's quantization pipeline.
NVIDIA NIXL
KV Transfer
KV cache transfer protocol for disaggregated serving. Enables efficient GPU-to-GPU KV cache migration between prefill and decode pools. Used internally by TensorRT-LLM's disaggregated serving mode alongside MPI and UCX.
Community Ecosystem
TensorRT-LLM integrates with NVIDIA's broader inference stack and the open-source ML ecosystem. These integrations range from production-grade serving platforms to application frameworks.
Triton Inference Server
Production serving platform
The primary production serving platform for TensorRT-LLM engines. The
tensorrtllm_backend wraps TensorRT-LLM with enterprise features.- In-flight batching and paged KV cache
- Multi-Instance GPU (MIG) support
- LoRA adapter hot-loading
- GenAI-Perf benchmarking tool
NVIDIA NIM
Containerized microservices
Bundles model weights + optimized inference engine + OpenAI-compatible API into a single container. Auto-selects the best backend (TensorRT-LLM, vLLM, or SGLang) and builds optimized engines for the target GPU.
- Automatic engine compilation
- Pre-optimized GPU configurations
- Zero manual compilation
NVIDIA Dynamo
Datacenter-scale orchestrator
Datacenter-scale inference orchestrator for disaggregated serving. Provides smart request routing based on KV cache locality, decoupled pre/post-processing, and Kubernetes-native autoscaling.
- KV-cache-aware request routing
- Disaggregated prefill/decode
- Kubernetes-native scaling
NVIDIA NeMo
Training framework
Training framework with direct export to TensorRT-LLM. Models trained with NeMo can be converted to TensorRT-LLM engines without intermediate HuggingFace conversion, streamlining the train-to-serve pipeline.
- Direct TRT-LLM export
- No HuggingFace intermediary
- End-to-end NVIDIA pipeline
BentoML
ML serving framework
Higher-level abstraction for packaging TensorRT-LLM models as containerized services with built-in scaling, monitoring, and model versioning.
- TRT-LLM engine packaging
- Built-in autoscaling
- Model versioning
LangChain / LlamaIndex
LLM application frameworks
Connect to TensorRT-LLM via its OpenAI-compatible API endpoint (
trtllm-serve), enabling use in RAG pipelines, agents, and multi-step reasoning chains.- OpenAI-compatible API bridge
- RAG pipeline integration
- Agent framework support
Choosing between Triton and trtllm-serve: Use
trtllm-serve for quick deployment with OpenAI-compatible endpoints. Use Triton when you need multi-model serving, A/B testing, model pipelines (preprocessor + engine + postprocessor), or enterprise metrics and monitoring.
Common Integration Patterns
These four patterns represent the most common ways TensorRT-LLM is deployed in production, from simple single-server setups to datacenter-scale disaggregated architectures.
1. TRT-LLM + Triton (Standard Production)
The most common production deployment pattern. Triton handles HTTP/gRPC endpoints, request queuing, health checks, and metrics. TensorRT-LLM handles the actual inference with in-flight batching and paged attention.
Client
→
Triton Inference Server
→
tensorrtllm_backend
Preprocessor
→
TRT-LLM Engine
→
Postprocessor
2. TRT-LLM + Dynamo (Disaggregated Serving)
For high-throughput workloads mixing long-context prefill with real-time generation. Dynamo routes requests to specialized GPU pools and manages KV cache transfer between them.
Client
→
Dynamo Router
→
Prefill Pool
↓
KV Transfer (NIXL)
↓
Decode Pool
→
Response
3. NeMo → TRT-LLM → NIM (Full Lifecycle)
End-to-end pipeline from model training to optimized production serving, staying within NVIDIA's ecosystem throughout.
Training (NeMo)
→
Export
→
Engine Build (TRT-LLM)
→
Container (NIM)
→
Deploy
4. TRT-LLM + Speculative + LoRA
Combines multiple optimization strategies: FP8 quantization for memory efficiency, EAGLE speculative decoding for latency reduction, and LoRA for task specialization without recompiling the base engine.
Base Model (FP8 Engine)
Main Inference
+
EAGLE Draft Head
Speculative Tokens
+
LoRA Adapter
Task Adaptation
Start simple, scale up: Most teams begin with Pattern 1 (TRT-LLM + Triton) and evolve to Pattern 2 (disaggregated) as throughput requirements grow. Pattern 4 (speculative + LoRA) can be added to any serving configuration.