High-Level Design
Ollama follows a client-server architecture with a process-per-model isolation pattern. The system has four main layers: clients at the top, a REST API server, a Scheduler managing model lifecycle, and isolated Runner processes at the bottom communicating with GPU/CPU hardware.
Ollama System Architecture
CLI / HTTP Client
User interface
REST API Server
server/routes.go
Model Store
~/.ollama/models
Scheduler
server/sched.go
GPU Discovery
discover/
Runner Process
llm/server.go
GPU / CPU
Hardware layer
Design Decisions
Process-per-model isolation over in-process loading. Ollama runs each model in a separate OS process rather than loading models into the server's memory space. This costs some IPC overhead but means a crash in the C/C++ inference code does not take down the API server. It also enables clean memory reclamation -- killing a process guarantees its GPU memory is freed.
Content-addressable blob storage over flat files. Models are stored as digests referencing blobs, with manifests pointing to layers. This mirrors Docker's storage model and enables layer sharing between model variants. A 7B base model and its fine-tuned variant share the base weights, saving gigabytes of disk space.
OpenAI API compatibility as a first-class concern. By providing
/v1/chat/completions, Ollama can be a drop-in replacement for OpenAI in any application that uses the OpenAI SDK. This was a deliberate choice to lower the barrier for developers migrating from cloud APIs to local inference.Automatic GPU memory management over manual configuration. The Scheduler proactively tracks VRAM usage and evicts models when needed, rather than requiring users to specify memory limits. This makes Ollama "just work" on consumer hardware, at the cost of some predictability for advanced users who want fine-grained control.