Ollama Setup Guide 2026: Install and Run Local LLMs on Mac, Windows & Linux

11 minute read

This is a step-by-step Ollama setup guide for Mac, Windows, and Linux — install it in one command, pull your first model, and have a local AI assistant (and an OpenAI-compatible API) running in about ten minutes. Everything stays on your machine: no cloud, no API keys, no monthly bills.

Key Takeaways

Who this is for: Anyone with a Mac (M1+), a Windows/Linux PC with a recent GPU, or even a CPU-only machine who wants to run AI models locally from the terminal.
What you’ll get: Ollama installed, your first model pulled and running, the essential commands, and a local OpenAI-compatible API at http://localhost:11434.
Time required: ~10 minutes to a working chat; ~20 to a local API wired into your code.
Cost: $0 — every tool in this guide is free and open source.

Step 1 – What Is Ollama and Why Use It?

Prefer a GUI? Ollama is the terminal-first, developer-first way to run local models. If you’d rather click than type, read our Ollama vs LM Studio comparison first — they run the same models, just with very different ergonomics.

Ollama is a lightweight runtime that downloads, manages, and serves open-source large language models with a single command. Think of it as the git of local LLMs: ollama run qwen3 and you’re talking to a model. Under the hood it handles model downloads, quantization, GPU offloading, and an HTTP server — the parts that used to mean an afternoon of llama.cpp flags.

Why this matters in 2026:

Privacy: Your prompts never leave your machine. For proprietary code, client data, medical or legal text, that isn’t a nice-to-have — it’s the requirement.
Cost: Cloud API bills compound. A handful of developers on frontier models can burn $500–2,000/month. Local inference costs $0 after the hardware you already own.
No rate limits: No throttling at 2 AM on a deadline, no quota emails.
Offline: Works on a plane, on bad coffee-shop Wi-Fi, or inside an air-gapped network.
Scriptable: Because it’s a CLI with an HTTP API, Ollama drops straight into shell scripts, cron jobs, and agent frameworks. This is why it’s the local backend of choice for so many AI-agent stacks.

Ollama now also ships a native desktop app for Mac and Windows if you want a chat window, but the CLI is where its real power lives — and it’s what the rest of this guide uses.

Already know what Ollama is? Jump to Step 3 – Installation.

Apple Silicon LLM model guide — which model runs best on your Mac

Step 2 – Can Your Machine Run Ollama? System Requirements

Who This Guide Is For

Developers who want a local, OpenAI-compatible API for hybrid LLM workflows.
LM Studio users who want a scriptable, headless alternative for servers and automation.
Power users and tinkerers comfortable with a terminal who want full control over models and parameters.

Minimum Requirements

Component	Spec
RAM / Unified Memory	8 GB (runs 3–8B models; tight but usable)
Storage	10 GB free to start (models are 2–50+ GB each)
OS	macOS 14 Sonoma+, Windows 10 22H2+, or modern Linux
GPU	Not required, but strongly recommended for speed

Recommended for a Good Experience

Component	Spec
RAM / Unified Memory	16–64 GB
GPU (NVIDIA)	RTX 3060 12 GB or better (driver 452.39+)
GPU (AMD)	ROCm 7-capable card, or Vulkan fallback
GPU (Apple)	M1 Pro / M2 / M3 / M4 with 16 GB+ unified memory
Storage	SSD with 50+ GB free

The Sweet Spot in 2026

Mac users: Apple Silicon is the best value in local LLMs because the CPU and GPU share one memory pool. An M-series Mac with 16 GB comfortably runs 12–14B models; 32 GB handles 27–32B; 64 GB+ opens up 70B-class models. This is the machine class that punches above its price.
Windows / Linux users: Any NVIDIA card with 8+ GB VRAM works. The RTX 4060 (8 GB) is the budget champion; a 24 GB RTX 3090/4090 unlocks the 27–32B tier where local quality starts replacing cloud calls.

No dedicated GPU? CPU-only inference works — it’s just slower. Expect a few tokens per second on a modern CPU versus 20–60+ tokens per second with a capable GPU, depending on model and hardware.

Ready to install? Jump to Step 3 – Installation.

Step 3 – Installing Ollama

macOS

The fastest path — one command in Terminal:

curl -fsSL https://ollama.com/install.sh | sh

Prefer Homebrew? brew install ollama works too. Or download the desktop app from ollama.com/download and drag it to Applications. (macOS 14 Sonoma or later required.)

Windows

Go to ollama.com/download and grab the Windows installer.
Run OllamaSetup.exe — it installs in your user account, no Administrator rights needed.
Ollama runs in the background; the ollama command becomes available in cmd, PowerShell, or your terminal of choice.

Prefer a package manager? winget install Ollama.Ollama. NVIDIA and AMD GPUs are supported automatically once your drivers are current.

Linux

curl -fsSL https://ollama.com/install.sh | sh

The script downloads the binary, creates a systemd service, sets up an ollama user, and detects your GPU. To run it as an always-on service:

sudo systemctl enable ollama
sudo systemctl start ollama

Verify the Install (all platforms)

ollama --version

If that prints a version, you’re done. To pin a specific release, prefix the install script: curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.21.0 sh.

Checkpoint: ollama --version returns a version number, and the Ollama service is running in the background (or via ollama serve). If a later command says “connection refused,” the server isn’t running — see Troubleshooting.

Step 4 – Pull and Run Your First Model

One command downloads a model and drops you into a chat:

ollama run llama3.1:8b

The first run downloads the model (a few GB), then gives you a >>> prompt. Type a message, hit Enter, and you’re running inference locally. Type /bye to exit.

Which Model Should You Pull?

Match the model to your memory. These are solid 2026 picks:

Your RAM / VRAM	Model size	Good starting models
8 GB	3–8B, Q4	`llama3.2:3b`, `llama3.1:8b`, `qwen3:8b`, `gemma3:4b`
16 GB	12–14B, Q4	`qwen3:14b`, `gemma4:12b`, `deepseek-r1:14b`
24–32 GB	27–32B, Q4	`qwen3:30b`, `gemma4:26b`, `deepseek-r1:32b`, `qwen3-coder:30b`
64 GB+	70B+, Q4–Q8	`llama3.3:70b`, `qwen3:32b` (Q8), `gpt-oss:120b` (high-end)

A few notes for 2026:

Best all-rounder: qwen3:30b if it fits, qwen3:14b if not. Strong reasoning and tool use.
Best for agents / function calling: gemma4:26b — native tool calling, multimodal, and light on memory thanks to its MoE design.
Best for reasoning: deepseek-r1 (the distilled 14B/32B tags run locally; the real 671B does not). It “thinks” before answering, which helps math and logic but is slower.
Best for coding: qwen3-coder:30b or devstral-small-2.
Watch the :latest tag. qwen3:latest is not the biggest Qwen3 — always pick an explicit size tag like qwen3:30b or gemma4:26b.

What the Q Numbers (Quantization) Mean

Quantization is how aggressively a model is compressed. Lower = smaller and faster, slightly lower quality; higher = larger, closer to original.

Q4_K_M — best size/quality balance. Ollama’s default tags usually land here. Start here.
Q5_K_M — noticeably better quality, ~25% larger.
Q8 — near-original quality, roughly double Q4’s size.

Pro tip: pull a small model first (ollama pull llama3.1:8b) to confirm everything works before committing to a 40 GB download.

Checkpoint: ollama run <model> gives you an interactive chat, and ollama list shows the model you just pulled.

Step 5 – Essential Ollama Commands

Ollama is a terminal tool first, and a handful of commands cover 95% of daily use:

Command	What it does
`ollama run <model>`	Download (if needed) and start an interactive chat
`ollama pull <model>`	Download a model without running it
`ollama list`	List models on disk
`ollama ps`	Show running models and whether they’re on GPU or CPU
`ollama stop <model>`	Unload a model from memory
`ollama rm <model>`	Delete a model
`ollama show <model>`	Show params, context length, and chat template
`ollama cp <src> <dst>`	Copy or alias a model
`ollama create <name> -f Modelfile`	Build a custom model from a Modelfile
`ollama serve`	Start the server manually (when not using the desktop app)

You can also pipe input straight in — handy for scripts:

ollama run qwen3:14b "Summarize this file: $(cat README.md)"

Inside a chat session, slash commands tune behavior on the fly: /set parameter num_ctx 8192 (raise context), /show info (model details), /bye (exit), /? (help).

Step 6 – Using Ollama as a Local API Server

This is where Ollama becomes a real development tool — and the foundation of a hybrid LLM stack.

Whenever the Ollama service is running, it serves an HTTP API on http://localhost:11434. You get two interfaces:

Ollama’s native REST API — /api/chat and /api/generate.
An OpenAI-compatible endpoint — /v1/chat/completions. Point any OpenAI-SDK code at it with a one-line change.

Native API (curl)

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3:14b",
  "messages": [{"role": "user", "content": "Why is the sky blue?"}],
  "stream": false
}'

OpenAI-Compatible API (Python)

The same OpenAI SDK you already use — just change base_url:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the SDK, ignored by Ollama
)

response = client.chat.completions.create(
    model="qwen3:14b",
    messages=[{"role": "user", "content": "Explain quicksort in Python"}]
)

print(response.choices[0].message.content)

Your existing code works with zero refactoring — only the endpoint changes.

Why This Is the Foundation of a Hybrid LLM Stack

This is the core of what we write about at HybridLLM: not every task needs a paid cloud call. Route by difficulty.

Tier	Where	Tasks	Cost
Tier 1: Local (Ollama)	Your machine	Summarization, extraction, formatting, classification, boilerplate, embeddings	$0
Tier 2: Cloud (Claude / GPT / Gemini)	API call	Complex reasoning, multimodal judgment, frontier-quality output	Pay per use

Because both tiers speak the same OpenAI API, your app doesn’t need to know which one answered. Same code, different economics — and the cheap tier handles the bulk of the volume. For a full walkthrough, see Building a Hybrid LLM Stack.

Checkpoint: curl http://localhost:11434/api/tags lists your installed models over HTTP — proof the API is live and ready for your code.

Step 7 – Configuration and Performance Tuning

Ollama is configured almost entirely through environment variables. The ones that actually matter:

Variable	What it controls
`OLLAMA_HOST`	Bind address/port. Default `127.0.0.1:11434`; set `0.0.0.0` to reach Ollama from other devices on your LAN.
`OLLAMA_MODELS`	Where models are stored. Point this at an external/secondary SSD to keep your system drive clear.
`OLLAMA_KEEP_ALIVE`	How long a model stays loaded after a request (default `5m`). Set `-1` to keep it resident (no cold-start lag), `0` to unload immediately.
`OLLAMA_CONTEXT_LENGTH`	Default context window, e.g. `8192`. Bigger context = more memory and slower generation.
`OLLAMA_FLASH_ATTENTION`	Set `1` to enable flash attention — lower memory use and faster on supported GPUs.

On macOS (desktop app) set these in your shell profile or via launchctl setenv; on Linux, add them to the systemd unit (systemctl edit ollama) and restart.

The single most impactful setting is keeping the model you use constantly resident. If you hit a multi-second pause before every reply, your model is being unloaded between calls — set OLLAMA_KEEP_ALIVE=-1 (or a long value) and the lag disappears.

For deeper customization — system prompts, default parameters, templates — create a Modelfile:

FROM qwen3:14b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
SYSTEM "You are a terse senior engineer. Answer in code first, prose second."

ollama create my-coder -f Modelfile
ollama run my-coder

Step 8 – Troubleshooting Common Issues

Symptom	Likely cause	Quick fix
`Error: connection refused` on :11434	Server not running	Start the desktop app, or run `ollama serve`
“model requires more system memory”	Model too big for RAM/VRAM	Pull a smaller size or lower quant (Q4); close other apps
Very slow / stuck on CPU	Not offloaded to GPU	Run `ollama ps` and check the PROCESSOR column; update GPU drivers; pick a model that fits in memory with headroom
Multi-second pause before every reply	Model unloaded between calls	Set `OLLAMA_KEEP_ALIVE=-1` to keep it resident
Garbled / incoherent output	Corrupt download or wrong template	`ollama rm <model>` then `ollama pull <model>` again
Can’t reach Ollama from another device	Bound to localhost only	Set `OLLAMA_HOST=0.0.0.0` and restart the service
Disk filling up	Models accumulate fast	`ollama list` to audit, `ollama rm` unused models, or move `OLLAMA_MODELS`

Check GPU usage at a glance: ollama ps shows each loaded model and whether it’s running on GPU, CPU, or split across both. If a model you expected to fit is on CPU, it’s too large for your available memory at that quantization — drop a size or a quant level.

My Ollama Stack (What I Actually Run)

I run Ollama daily on an M2 Max with 64 GB unified memory, and it’s the local tier of a larger agent setup rather than a chat toy. The models that earn their disk space:

gemma4:26b — my default reasoning and reviewer model. Multimodal, native tool calling, and the MoE design keeps it light enough to stay resident.
qwen3.5-nothink — fast web extraction, compression, and short subtasks where I don’t want chain-of-thought overhead.
qwen2.5-coder:14b — coding help (with the caveat that this size doesn’t do tool calling well).
nomic-embed-text — free, open embeddings for retrieval and research pipelines.

These run as profiles inside an agent framework, with OLLAMA_KEEP_ALIVE long so the model I lean on doesn’t cold-start. For one heavier research workload I run a larger Qwen via llama-server instead of Ollama — but for everything day-to-day, ollama run is the whole workflow. I wrote up how that local-first stack keeps a deep-research loop down to ~$0.22 a query in The $0.22 Research Loop.

The point isn’t this exact lineup — it’s that once Ollama is serving a local API, the model becomes a component you route work to, not an app you open.

What’s Next

Read these next, in order:

Best Local LLM Models for M2/M3/M4 Mac: Benchmark 2026 — match the right model to your exact Mac.
Ollama vs LM Studio — decide whether terminal or GUI fits your workflow.
Choosing Local LLMs for AI Agents — how I picked the models above for real agent work.
Building a Hybrid LLM Stack — wire local + cloud into one routing layer.

Running Ollama in your own hybrid setup? Tell me what’s in your stack on X/Twitter.

Share on

X Facebook LinkedIn Bluesky

HybridLLM.dev