The Building Blocks of Ollama
Ollama's power comes from a handful of key concepts working together. Click any card below to learn more about each one.
Model Library
Ollama's curated registry of pre-built models -- like an app store for LLMs.
GGUF Format
The binary file format that stores model weights and metadata.
Quantization
Reduces model precision to use less memory and run faster.
Modelfile
Ollama's recipe file for building custom models -- like a Dockerfile for LLMs.
REST API
The HTTP interface that applications use to talk to the inference server.
Scheduler
The internal brain managing which models are loaded in memory.
Runner
The inference subprocess that actually executes model computations.
How They Fit Together
Request Lifecycle
1You type
ollama run llama3.2 -- the CLI sends a chat request to the REST API server↓
2The server asks the Scheduler whether the model is already loaded
↓
3If not loaded, the Scheduler checks the local blob store for the GGUF model file (downloading from the Model Library if needed)
↓
4The Scheduler selects GPUs with enough memory and spawns a Runner subprocess
↓
5The Runner loads the quantized weights, applies Modelfile settings, and starts generating tokens
↓
6Tokens stream back to you in real-time; the Scheduler sets a keep-alive timer for the loaded model
Key insight: Every concept in Ollama has a clear job. The Model Library handles discovery, GGUF handles storage, Quantization handles compression, the Modelfile handles customization, the API handles communication, the Scheduler handles memory, and the Runner handles computation. This separation of concerns is what makes Ollama "just work" for most users.