Official Tools & Extensions

TensorRT-LLM ships with a growing set of first-party tools that cover the full lifecycle from benchmarking to production serving. These are maintained alongside the core library and follow the same release cadence.

🌐
trtllm-serve Stable since v1.0
OpenAI-compatible HTTP server built into TensorRT-LLM. Supports chat completions, completions, and models endpoints. The simplest path to serving: trtllm-serve "model-name". Drop-in replacement for any OpenAI-compatible client.
📊
trtllm-bench Benchmarking
Benchmarking tool for measuring throughput, latency, and TTFT under configurable load patterns. Essential for capacity planning before production deployment.
trtllm-eval Quality
Evaluation tool for measuring model quality (perplexity, accuracy) after quantization. Validates that FP8/INT4 quantization hasn't degraded output quality below acceptable thresholds.
NVIDIA ModelOpt nvidia-modelopt
Quantization calibration toolkit. Handles the data collection and scaling factor computation needed for post-training quantization (PTQ). Required for INT8 SmoothQuant and recommended for FP8. Shipped as a separate package (nvidia-modelopt), tightly integrated with TensorRT-LLM's quantization pipeline.
🔗
NVIDIA NIXL KV Transfer
KV cache transfer protocol for disaggregated serving. Enables efficient GPU-to-GPU KV cache migration between prefill and decode pools. Used internally by TensorRT-LLM's disaggregated serving mode alongside MPI and UCX.

Community Ecosystem

TensorRT-LLM integrates with NVIDIA's broader inference stack and the open-source ML ecosystem. These integrations range from production-grade serving platforms to application frameworks.

🏭
Triton Inference Server
Production serving platform
The primary production serving platform for TensorRT-LLM engines. The tensorrtllm_backend wraps TensorRT-LLM with enterprise features.
  • In-flight batching and paged KV cache
  • Multi-Instance GPU (MIG) support
  • LoRA adapter hot-loading
  • GenAI-Perf benchmarking tool
📦
NVIDIA NIM
Containerized microservices
Bundles model weights + optimized inference engine + OpenAI-compatible API into a single container. Auto-selects the best backend (TensorRT-LLM, vLLM, or SGLang) and builds optimized engines for the target GPU.
  • Automatic engine compilation
  • Pre-optimized GPU configurations
  • Zero manual compilation
🌍
NVIDIA Dynamo
Datacenter-scale orchestrator
Datacenter-scale inference orchestrator for disaggregated serving. Provides smart request routing based on KV cache locality, decoupled pre/post-processing, and Kubernetes-native autoscaling.
  • KV-cache-aware request routing
  • Disaggregated prefill/decode
  • Kubernetes-native scaling
🧠
NVIDIA NeMo
Training framework
Training framework with direct export to TensorRT-LLM. Models trained with NeMo can be converted to TensorRT-LLM engines without intermediate HuggingFace conversion, streamlining the train-to-serve pipeline.
  • Direct TRT-LLM export
  • No HuggingFace intermediary
  • End-to-end NVIDIA pipeline
🍻
BentoML
ML serving framework
Higher-level abstraction for packaging TensorRT-LLM models as containerized services with built-in scaling, monitoring, and model versioning.
  • TRT-LLM engine packaging
  • Built-in autoscaling
  • Model versioning
🔌
LangChain / LlamaIndex
LLM application frameworks
Connect to TensorRT-LLM via its OpenAI-compatible API endpoint (trtllm-serve), enabling use in RAG pipelines, agents, and multi-step reasoning chains.
  • OpenAI-compatible API bridge
  • RAG pipeline integration
  • Agent framework support
💡
Choosing between Triton and trtllm-serve: Use trtllm-serve for quick deployment with OpenAI-compatible endpoints. Use Triton when you need multi-model serving, A/B testing, model pipelines (preprocessor + engine + postprocessor), or enterprise metrics and monitoring.

Common Integration Patterns

These four patterns represent the most common ways TensorRT-LLM is deployed in production, from simple single-server setups to datacenter-scale disaggregated architectures.

1. TRT-LLM + Triton (Standard Production)
The most common production deployment pattern. Triton handles HTTP/gRPC endpoints, request queuing, health checks, and metrics. TensorRT-LLM handles the actual inference with in-flight batching and paged attention.
Client
Triton Inference Server
tensorrtllm_backend
Preprocessor
TRT-LLM Engine
Postprocessor
2. TRT-LLM + Dynamo (Disaggregated Serving)
For high-throughput workloads mixing long-context prefill with real-time generation. Dynamo routes requests to specialized GPU pools and manages KV cache transfer between them.
Client
Dynamo Router
Prefill Pool
KV Transfer (NIXL)
Decode Pool
Response
3. NeMo → TRT-LLM → NIM (Full Lifecycle)
End-to-end pipeline from model training to optimized production serving, staying within NVIDIA's ecosystem throughout.
Training (NeMo)
Export
Engine Build (TRT-LLM)
Container (NIM)
Deploy
4. TRT-LLM + Speculative + LoRA
Combines multiple optimization strategies: FP8 quantization for memory efficiency, EAGLE speculative decoding for latency reduction, and LoRA for task specialization without recompiling the base engine.
Base Model (FP8 Engine)
Main Inference
+
EAGLE Draft Head
Speculative Tokens
+
LoRA Adapter
Task Adaptation
🚀
Start simple, scale up: Most teams begin with Pattern 1 (TRT-LLM + Triton) and evolve to Pattern 2 (disaggregated) as throughput requirements grow. Pattern 4 (speculative + LoRA) can be added to any serving configuration.