Local LLM Inference Runtime

Ollama

The simplest way to run open-source LLMs locally -- one command to download, one command to run, with a REST API that works like OpenAI's.

License MIT
Language Go
Version 0.20.0
GPU Support CUDA / ROCm / Metal
💡 Core Concepts Beginner Model Library, GGUF format, quantization, Modelfiles, and the REST API that powers everything. 🏗 Architecture Intermediate Client-server design, process-per-model isolation, scheduler, and content-addressable blob storage. How It Works Intermediate Model pulling, GPU allocation, inference pipeline, scheduler internals, and the conversion pipeline. 💻 Implementation Advanced Getting started, configuration, code patterns, and an annotated source code walkthrough. 🎯 Use Cases Beginner-Intermediate When to use Ollama, when NOT to, and real-world examples from development to production. 🌍 Ecosystem Intermediate Client libraries, AI coding tools, RAG frameworks, and common integration patterns. FAQ All Levels Memory requirements, multi-model loading, performance diagnostics, and OpenAI compatibility. Trade-offs Intermediate Honest assessment of strengths, limitations, and how Ollama compares to llama.cpp, LM Studio, and vLLM.

Quick Start

Get from zero to running a local LLM in under a minute.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run your first model
ollama run llama3.2

# Or use the API
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Ollama auto-detects your GPU and downloads the best model variant for your hardware.