LLM Inference Engine

NVIDIA TensorRT-LLM

Open-source library that compiles and optimizes LLM inference on NVIDIA GPUs, delivering maximum throughput via kernel-level optimization, continuous batching, and paged attention.

License Apache 2.0
Language C++ / Python
Latest Version 1.2.0
💡 Core Concepts Beginner Key abstractions — Engines, Builders, KV Cache, In-Flight Batching, and more. 🏗 Architecture Intermediate Three-layer design: Python API, PyExecutor, and C++ Runtime. ⚙️ How It Works Intermediate Compilation pipeline, attention mechanisms, memory management, speculative decoding. 💻 Implementation Details Advanced Getting started, configuration, code patterns, and source code walkthrough. 🎯 Use Cases Beginner – Intermediate When to use TensorRT-LLM, when not to, and real-world deployments. 🔌 Ecosystem & Integrations Intermediate Triton, NIM, Dynamo, NeMo, and community integrations. FAQ All Levels Hard practical questions a senior engineer would ask. ⚖️ Trade-offs & Limitations Intermediate Strengths, limitations, and honest comparison with vLLM and SGLang.

Quick Start

Get up and running in minutes. TensorRT-LLM auto-downloads and optimizes models for your GPU.

bash
# Start an OpenAI-compatible server (auto-downloads and optimizes)
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
python
from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

for output in llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8)):
    print(output.outputs[0].text)

Dive into Core Concepts to understand Engines and KV Cache, or jump to Implementation Details for more code patterns.