Core Concepts — Ollama Course

The Building Blocks of Ollama

Ollama's power comes from a handful of key concepts working together. Click any card below to learn more about each one.

📚

Model Library

Ollama's curated registry of pre-built models -- like an app store for LLMs.

When you type ollama run llama3.2, Ollama checks the library at ollama.com/library, finds the right model variant for your hardware, and downloads it. Think of it as Docker Hub for language models: a central place where models are published with tags for different sizes and quantization levels (e.g., llama3.2:3b-q4_K_M).

📦

GGUF Format

The binary file format that stores model weights and metadata.

GGUF is like a ZIP file specifically designed for neural networks -- it packages the model's weights, vocabulary, architecture parameters, and chat template into a single file that inference engines can memory-map directly. GGUF replaced the older GGML format and is designed for fast loading.

🔍

Quantization

Reduces model precision to use less memory and run faster.

Like compressing a high-resolution photo to a smaller file while keeping it recognizable. A model trained with 16-bit weights might be quantized to 4-bit integers (Q4_K_M), reducing memory usage by roughly 4x with modest quality loss. Common levels: Q4_K_M (good balance), Q5_K_M (higher quality), Q8_0 (near-original).

📝

Modelfile

Ollama's recipe file for building custom models -- like a Dockerfile for LLMs.

Uses a simple declarative syntax with instructions like FROM (base model), PARAMETER (inference settings), SYSTEM (system prompt), and TEMPLATE (chat format). Example: FROM llama3.2 followed by SYSTEM "You are a helpful coding assistant" creates a coding-focused variant.

🌐

REST API

The HTTP interface that applications use to talk to the inference server.

Like a waiter taking orders and bringing back responses from the kitchen. The server listens on localhost:11434 and provides endpoints for generation (/api/generate), chat (/api/chat), embeddings (/api/embed), and an OpenAI-compatible endpoint at /v1/chat/completions.

🧠

Scheduler

The internal brain managing which models are loaded in memory.

Like an air traffic controller deciding which planes get to land and take off. Since models consume gigabytes of GPU memory, the scheduler tracks loaded models, handles concurrent requests, manages memory pressure, and decides when to evict idle models to make room for new ones.

⚡

Runner

The inference subprocess that actually executes model computations.

Like the engine in a car -- the API provides the dashboard, but the Runner does the real work. Ollama spawns a separate process for each loaded model, communicating via local HTTP. Supports two backends: legacy llama.cpp and newer native Ollama engine with MLX support for Apple Silicon.

How They Fit Together

Request Lifecycle

1You type ollama run llama3.2 -- the CLI sends a chat request to the REST API server

↓

2The server asks the Scheduler whether the model is already loaded

↓

3If not loaded, the Scheduler checks the local blob store for the GGUF model file (downloading from the Model Library if needed)

↓

4The Scheduler selects GPUs with enough memory and spawns a Runner subprocess

↓

5The Runner loads the quantized weights, applies Modelfile settings, and starts generating tokens

↓

6Tokens stream back to you in real-time; the Scheduler sets a keep-alive timer for the loaded model

💡

Key insight: Every concept in Ollama has a clear job. The Model Library handles discovery, GGUF handles storage, Quantization handles compression, the Modelfile handles customization, the API handles communication, the Scheduler handles memory, and the Runner handles computation. This separation of concerns is what makes Ollama "just work" for most users.

The Building Blocks of Ollama

How They Fit Together

Request Lifecycle

📚 References & Resources