How It Works — Ollama Course

Model Pulling and Storage

When you run ollama pull llama3.2, Ollama contacts its model registry at registry.ollama.ai and downloads a manifest -- a JSON document listing the model's layers with their digests and media types. Each layer is a content-addressed blob (identified by SHA256 hash) stored in ~/.ollama/models/blobs/.

Model Pull Process

1Parse model name into host/namespace/model:tag (defaults: registry.ollama.ai/library/model:latest)

↓

2Fetch manifest from registry -- lists layers with SHA256 digests and media types

↓

3Check local blob store -- skip layers already present (content-addressed dedup)

↓

4Download missing layers: model weights (GGUF), template, parameters, system prompt, license

↓

5Write manifest to local store -- model is now ready to run

Inference Pipeline

Once a model is loaded, each inference request follows a precise pipeline from raw text to generated tokens.

Token Generation Pipeline

1Template rendering -- Chat messages formatted using the model's Go template into raw prompt format

↓

2Tokenization -- Rendered prompt split into token IDs using the model's BPE/SentencePiece vocabulary

↓

3Context management -- If tokens exceed the context window, truncate or shift (preserving system prompt via num_keep)

↓

4Forward pass -- Tokens fed through transformer layers, computing attention and feed-forward on GPU/CPU

↓

5Sampling -- Output logits converted to probabilities with temperature, top_k, top_p, min_p, and repeat penalty

↓

6Streaming -- Each token immediately sent back via SSE to the API server, then to the client

⚡

Streaming is the default. Ollama streams tokens as they are generated rather than waiting for the complete response. This gives users immediate feedback. Set "stream": false in the API request to buffer the entire response instead.

Performance Characteristics

Token generation speed depends heavily on hardware and model size. Here are representative benchmarks.

7B Q4_K_M on M3 MacBook Pro~40 tok/s

7B Q4_K_M on RTX 4090~80 tok/s

70B Q4_K_M on M3 Max (96GB)~8 tok/s

Cold start (7B model)2-10 seconds

Warm request (cached model)<100ms first token

⚠

Context window overhead scales quadratically. Doubling the context window roughly quadruples the memory needed for the KV cache. Ollama supports flash attention (OLLAMA_FLASH_ATTENTION=1) and KV cache quantization to mitigate this on supported hardware.

Model Pulling and Storage

Model Pull Process

Inference Pipeline

Token Generation Pipeline

Performance Characteristics

📚 References & Resources