Architecture — Ollama Course

High-Level Design

Ollama follows a client-server architecture with a process-per-model isolation pattern. The system has four main layers: clients at the top, a REST API server, a Scheduler managing model lifecycle, and isolated Runner processes at the bottom communicating with GPU/CPU hardware.

Ollama System Architecture

💻

CLI / HTTP Client

User interface

🌐

REST API Server

server/routes.go

💾

Model Store

~/.ollama/models

🧠

Scheduler

server/sched.go

🔍

GPU Discovery

discover/

⚡

Runner Process

llm/server.go

🎮

GPU / CPU

Hardware layer

Design Decisions

💡

Process-per-model isolation over in-process loading. Ollama runs each model in a separate OS process rather than loading models into the server's memory space. This costs some IPC overhead but means a crash in the C/C++ inference code does not take down the API server. It also enables clean memory reclamation -- killing a process guarantees its GPU memory is freed.

📦

Content-addressable blob storage over flat files. Models are stored as digests referencing blobs, with manifests pointing to layers. This mirrors Docker's storage model and enables layer sharing between model variants. A 7B base model and its fine-tuned variant share the base weights, saving gigabytes of disk space.

🔀

OpenAI API compatibility as a first-class concern. By providing /v1/chat/completions, Ollama can be a drop-in replacement for OpenAI in any application that uses the OpenAI SDK. This was a deliberate choice to lower the barrier for developers migrating from cloud APIs to local inference.

⚠

Automatic GPU memory management over manual configuration. The Scheduler proactively tracks VRAM usage and evicts models when needed, rather than requiring users to specify memory limits. This makes Ollama "just work" on consumer hardware, at the cost of some predictability for advanced users who want fine-grained control.

High-Level Design

Ollama System Architecture

Design Decisions

📚 References & Resources