LLM Inference Engine

NVIDIA TensorRT-LLM

Open-source library that compiles and optimizes LLM inference on NVIDIA GPUs, delivering maximum throughput via kernel-level optimization, continuous batching, and paged attention.

License Apache 2.0

Language C++ / Python

Latest Version 1.2.0

GitHub NVIDIA/TensorRT-LLM

Quick Start

Get up and running in minutes. TensorRT-LLM auto-downloads and optimizes models for your GPU.

bash

# Start an OpenAI-compatible server (auto-downloads and optimizes)
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

python

from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

for output in llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8)):
    print(output.outputs[0].text)

Dive into Core Concepts to understand Engines and KV Cache, or jump to Implementation Details for more code patterns.