Frequently Asked Questions

Real questions engineers ask when evaluating and operating Ollama. Click a question to expand the answer.

How much RAM/VRAM do I actually need to run a model?

A Q4_K_M quantized model needs roughly 0.6 GB per billion parameters for weights alone. So a 7B model needs ~4.5 GB, a 13B model needs ~8 GB, and a 70B model needs ~40 GB. Add 10-20% for KV cache at default context lengths. On Apple Silicon, Ollama uses unified memory, so your total system RAM is your VRAM budget. For a practical minimum: 8 GB RAM lets you run 7B models, 16 GB handles 13B, and 32 GB can run some 30B models.

What happens if a model is too large for my GPU memory?

Ollama gracefully handles this with "partial offload." The Scheduler's layer assignment uses binary search to find the optimal split between GPU and CPU. Layers that fit in VRAM run on the GPU; the rest run on CPU. This is slower than full GPU (typically 2-5x for CPU-bound layers) but faster than pure CPU. Monitor with ollama ps, which shows the GPU/CPU percentage. Force full CPU mode with OLLAMA_NUM_GPU=0.

Why is the first request slow but subsequent requests fast?

The first request triggers model loading: reading the GGUF file, allocating GPU memory, and loading weights. This "cold start" takes 2-10 seconds for 7B models and up to 60 seconds for 70B. Once loaded, the model stays in memory for the keep-alive duration (default: 5 minutes). Set OLLAMA_KEEP_ALIVE=-1 to keep models permanently loaded.

Can I run multiple models simultaneously?

Yes. Ollama's Scheduler supports multiple loaded models. By default, it auto-detects how many fit based on available memory. Models load on demand and the least-recently-used is evicted when memory is needed. Each model gets its own Runner subprocess. The Scheduler also supports parallel requests to the same model via OLLAMA_NUM_PARALLEL. Watch memory: two 7B models need ~9 GB, tight on 16 GB.

How do I use Ollama as a drop-in replacement for OpenAI?

Set base URL to http://localhost:11434/v1 and use any API key (Ollama ignores it). With the Python OpenAI SDK: client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama"). The /v1/chat/completions, /v1/completions, and /v1/embeddings endpoints implement the OpenAI specification. Most features work -- streaming, tool calling, structured outputs, system messages.

How does Ollama compare to running llama.cpp directly?

Ollama wraps llama.cpp and adds model management, auto GPU detection, a persistent server, model scheduling, and the OpenAI-compatible API. Trade-off: ~5-15% performance overhead and less control over low-level settings. Use raw llama.cpp if you need max tokens/second, custom CUDA kernels, or speculative decoding. Use Ollama if you value convenience or need an API server.

What's the diagnostic approach for unexpectedly slow generation?

Start with ollama ps to check GPU vs CPU layer split. Then check OLLAMA_NUM_PARALLEL for contention. Verify no other process is consuming GPU memory (nvidia-smi on NVIDIA, Activity Monitor on macOS). Try enabling flash attention with OLLAMA_FLASH_ATTENTION=1 and reducing context length. On Linux, ensure you're using the proprietary NVIDIA driver, not nouveau.
💡
Pro tip: Create a custom model with your preferred settings using a Modelfile: FROM llama3.2, PARAMETER temperature 0.3, SYSTEM "Your persona here". Then run ollama create my-model -f Modelfile. Custom models share base weights, consuming minimal extra disk space.