Core Concepts — NVIDIA TensorRT-LLM Course

The Building Blocks of TensorRT-LLM

TensorRT-LLM transforms large language models into high-performance inference engines for NVIDIA GPUs. Understanding these eight core concepts is the foundation for everything else in the system. Click each card to reveal the analogy and why it matters.

⚙️

Engine

A compiled, optimized binary that runs your model on the GPU. The end product of the TensorRT-LLM compilation pipeline.

Analogy: A compiled C program versus an interpreted Python script. The compilation step analyzes every operation, fuses layers, selects the fastest CUDA kernels for your specific GPU, and bakes the weights into an optimized binary.

Why it matters: Engines are GPU-specific (an H100 engine won't run on an A100) and take time to compile (~28 minutes for a 70B model), but once built, they deliver maximum throughput. This is the fundamental trade-off at the heart of TensorRT-LLM.

🏗️

Builder

The compiler that transforms a model definition into an Engine — takes architecture plus optimization settings and produces the optimized binary.

Analogy: Like gcc for neural networks. It takes your source (model architecture) plus compiler flags (quantization level, tensor parallelism, max sequence length) and produces a tuned binary.

Why it matters: The Builder is where you make key optimization decisions. You interact with it through the CLI (trtllm-build) or the Python Builder class. Changing Builder settings means recompiling the Engine.

📡

LLM API

The high-level Python interface — a single entry point that handles downloading, compilation, and inference in one call.

Analogy: Like HuggingFace's pipeline() but with TensorRT optimization under the hood. Initialize with a model name and you get an optimized inference pipeline.

Why it matters: The LLM API is the recommended entry point for most users. LLM("meta-llama/Llama-3-8B") gets you from zero to optimized inference — it abstracts away the Builder, Engine, and Executor complexity.

📝

KV Cache (Paged)

Stores previously computed Key and Value tensors during autoregressive generation so the model doesn't reprocess the entire prompt for each new token.

Analogy: A notepad where the model writes down what it has already "read." Paged KV cache breaks this notepad into fixed-size blocks, like memory pages in an OS, dynamically allocating and freeing them.

Why it matters: Paging avoids wasting GPU memory on pre-allocated buffers for maximum sequence lengths that most requests never reach. This directly increases the number of concurrent requests a GPU can handle.

🍽️

In-Flight Batching

Dynamic request batching that processes new requests as soon as existing ones free up resources, rather than waiting for an entire batch to finish.

Analogy: A restaurant that seats new diners as soon as a table frees up, rather than waiting for the entire dining room to finish eating before seating anyone new.

Why it matters: Static batching wastes GPU cycles waiting for the slowest request. In-flight batching processes prefill (the "reading the prompt" phase) and generation (the "writing tokens" phase) together, dramatically improving GPU utilization.

📦

Quantization

Reduces the numerical precision of model weights and activations to use less memory and compute, with minimal quality loss.

Analogy: Compressing a high-resolution photo to JPEG — you lose some fidelity but use much less storage and bandwidth.

Why it matters: TensorRT-LLM supports 10+ quantization formats: FP8 (Hopper GPUs), NVFP4 (Blackwell), INT8 SmoothQuant, INT4 AWQ, and more. FP8 on H100 typically delivers 1.4–2.3x speedup with minimal quality loss.

🧩

Plugins (Custom Kernels)

Custom CUDA kernels registered as TensorRT operations for cases where automatic optimization can't discover the best implementation.

Analogy: Hand-tuned assembly subroutines called from a higher-level program. When the compiler's best effort isn't fast enough, engineers write optimized code by hand.

Why it matters: Plugins power FlashAttention for the attention mechanism, NCCL-based all-reduce for multi-GPU communication, and specialized GEMM kernels for quantized matrix multiplies. They live in cpp/tensorrt_llm/plugins/.

🛤️

Executor

The runtime orchestrator that manages the continuous inference loop: fetching requests, scheduling, executing, sampling tokens, and delivering results.

Analogy: An air traffic controller managing incoming flights (requests), runway allocation (GPU resources), and departure sequencing (response delivery).

Why it matters: The Executor runs the continuous loop: fetch new requests, schedule them for execution, allocate KV cache blocks, run the forward pass, sample tokens, and deliver results. The C++ Executor provides the low-level API; the Python PyExecutor wraps it.

How They Fit Together

A user initializes the LLM API with a model name. The Builder compiles the model into an optimized Engine, applying Quantization and inserting Plugins for specialized operations. At runtime, the Executor receives requests and uses In-Flight Batching to schedule them efficiently. As tokens are generated, the KV Cache stores intermediate results in paged memory blocks.

TensorRT-LLM Inference Pipeline

User Code

→

LLM API

→

Builder

→

Engine

→

Executor

Runtime components

KV Cache (Paged)

Plugins (Kernels)

In-Flight Batching

💡

Key Insight: The split between build-time and run-time is the fundamental design principle. The Builder does expensive optimization work once (layer fusion, kernel selection, quantization) so the Executor can run the Engine at maximum speed during serving. This is why TensorRT-LLM achieves higher throughput than eager-execution frameworks — the optimization cost is paid upfront, not on every request.

The Building Blocks of TensorRT-LLM

How They Fit Together

TensorRT-LLM Inference Pipeline

📚 References & Resources