Open-source library that compiles and optimizes LLM inference on NVIDIA GPUs, delivering maximum throughput via kernel-level optimization, continuous batching, and paged attention.
Get up and running in minutes. TensorRT-LLM auto-downloads and optimizes models for your GPU.
# Start an OpenAI-compatible server (auto-downloads and optimizes)
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
for output in llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8)):
print(output.outputs[0].text)
Dive into Core Concepts to understand Engines and KV Cache, or jump to Implementation Details for more code patterns.