Model Pulling and Storage
When you run ollama pull llama3.2, Ollama contacts its model registry at registry.ollama.ai and downloads a manifest -- a JSON document listing the model's layers with their digests and media types. Each layer is a content-addressed blob (identified by SHA256 hash) stored in ~/.ollama/models/blobs/.
Model Pull Process
1Parse model name into host/namespace/model:tag (defaults: registry.ollama.ai/library/model:latest)
↓
2Fetch manifest from registry -- lists layers with SHA256 digests and media types
↓
3Check local blob store -- skip layers already present (content-addressed dedup)
↓
4Download missing layers: model weights (GGUF), template, parameters, system prompt, license
↓
5Write manifest to local store -- model is now ready to run
Inference Pipeline
Once a model is loaded, each inference request follows a precise pipeline from raw text to generated tokens.
Token Generation Pipeline
1Template rendering -- Chat messages formatted using the model's Go template into raw prompt format
↓
2Tokenization -- Rendered prompt split into token IDs using the model's BPE/SentencePiece vocabulary
↓
3Context management -- If tokens exceed the context window, truncate or shift (preserving system prompt via
num_keep)↓
4Forward pass -- Tokens fed through transformer layers, computing attention and feed-forward on GPU/CPU
↓
5Sampling -- Output logits converted to probabilities with temperature, top_k, top_p, min_p, and repeat penalty
↓
6Streaming -- Each token immediately sent back via SSE to the API server, then to the client
Streaming is the default. Ollama streams tokens as they are generated rather than waiting for the complete response. This gives users immediate feedback. Set
"stream": false in the API request to buffer the entire response instead.Performance Characteristics
Token generation speed depends heavily on hardware and model size. Here are representative benchmarks.
7B Q4_K_M on M3 MacBook Pro~40 tok/s
7B Q4_K_M on RTX 4090~80 tok/s
70B Q4_K_M on M3 Max (96GB)~8 tok/s
Cold start (7B model)2-10 seconds
Warm request (cached model)<100ms first token
Context window overhead scales quadratically. Doubling the context window roughly quadruples the memory needed for the KV cache. Ollama supports flash attention (
OLLAMA_FLASH_ATTENTION=1) and KV cache quantization to mitigate this on supported hardware.