LM Studio Setup Guide 2026: How to Install and Run Local LLMs in 5 Minutes

9 minute read

This is a step-by-step LM Studio setup guide for Mac and Windows to install and run local LLMs — completely offline, completely free, with zero data leaving your machine.

Key Takeaways

Who this is for: Anyone with a Mac (M1+) or Windows PC (RTX 3060+) who wants to run AI models locally
What you’ll get: LM Studio installed, your first model downloaded and running, a local API server ready for development
Time required: ~30 minutes from zero to a working local AI assistant
Cost: $0 — everything in this guide is free

Step 1 – What Is LM Studio and Why Use It Instead of Cloud LLMs?

Already using Ollama? Think of LM Studio as the GUI-first alternative — same models, visual interface, built-in API server. Read our detailed Ollama vs LM Studio comparison to see which fits your workflow.

LM Studio is a desktop application that lets you discover, download, and run open-source large language models locally. Think of it as the iTunes of AI models — a clean interface on top of what would otherwise require terminal commands and manual configuration.

Why this matters in 2026:

Privacy: Your prompts never leave your computer. For anyone working with proprietary code, medical records, legal documents, or client data, this isn’t optional — it’s a requirement.
Cost: Cloud API calls add up fast. A team of five developers using GPT-4-level models can easily spend $500–2,000 per month. Local inference costs exactly $0 after the hardware investment.
No rate limits: You won’t get throttled at 3 AM when you’re on a deadline.
Offline access: Works on a plane, in a coffee shop with bad Wi-Fi, or in an air-gapped corporate network.

The catch? You need decent hardware. But if you’re reading this on a machine bought in the last two years, you probably have enough.

Already know what LM Studio is? Jump to Step 2 – System Requirements.

Step 2 – Can Your Mac or PC Run LM Studio? System Requirements

Who This Guide Is For

First-time local LLM users on Mac or Windows who want a visual, no-terminal experience
Ollama users looking for a GUI alternative with a built-in model browser
Developers who want a local OpenAI-compatible API for hybrid LLM workflows

Minimum Requirements

Component	Spec
RAM	8 GB (runs 7B models slowly)
Storage	10 GB free (models are 4–50 GB each)
OS	macOS 13+, Windows 10+, Ubuntu 22.04+
GPU	Not strictly required, but strongly recommended

Recommended for a Good Experience

Component	Spec
RAM	16–32 GB
GPU (NVIDIA)	RTX 3060 12 GB or better
GPU (Apple)	M1 Pro / M2 / M3 with 16 GB+ unified memory
Storage	SSD with 50+ GB free

The Sweet Spot in 2026

Mac users: M2/M3/M4 with 24–64 GB unified memory. Apple Silicon handles local LLMs exceptionally well because the CPU and GPU share the same memory pool. A MacBook Pro M2 with 32 GB can typically run 30B-parameter Q4 models comfortably for most workloads.
Windows/Linux users: Any NVIDIA GPU with 8+ GB VRAM. The RTX 4060 (8 GB) is the price-to-performance champion. The RTX 3090 (24 GB) remains the enthusiast sweet spot on the used market.

No dedicated GPU? CPU-only inference works — it’s just slower. Expect around 3–8 tokens per second on a modern CPU versus 20–60+ tokens per second with a capable GPU, depending on your specific hardware and model choice.

Ready to install? Jump to Step 3 – Installation.

Step 3 – Installing LM Studio on Mac, Windows, and Linux

macOS

Visit lmstudio.ai
Click Download for Mac — it auto-detects Intel vs Apple Silicon
Open the .dmg file and drag LM Studio to Applications
Launch LM Studio from Applications

That’s it. No Homebrew, no terminal commands, no Python environment.

Windows

Visit lmstudio.ai
Click Download for Windows
Run the .exe installer
Follow the standard Windows installation wizard
Launch LM Studio from the Start menu

NVIDIA users: Make sure your GPU drivers are up to date. LM Studio will automatically detect and use your GPU if CUDA-compatible drivers are installed.

Linux

Visit lmstudio.ai
Download the .AppImage file
Make it executable: chmod +x LM-Studio-*.AppImage
Run it: ./LM-Studio-*.AppImage

For NVIDIA GPU acceleration, ensure you have the latest NVIDIA drivers and CUDA toolkit installed.

Checkpoint: At this point you should have LM Studio installed and running on your machine. You’ll see a clean interface with a sidebar on the left. If LM Studio won’t launch, see the Troubleshooting section below.

Step 4 – How to Choose and Download Your First Model

When you first open LM Studio, the model library is empty. Here’s how to pick the right model for your hardware.

Which Model Size Fits Your Hardware?

Open the Discover tab (magnifying glass icon in the sidebar). You’ll see thousands of models. Don’t get overwhelmed. Here’s your decision framework:

Your RAM / VRAM	Recommended Model Size	Example Models
8 GB	7B parameters, Q4	Llama 3.2 7B, Mistral 7B
16 GB	13–14B parameters, Q4–Q5	Llama 3.3 14B, Qwen 2.5 14B
32 GB	30–34B parameters, Q4–Q5	Qwen 2.5 32B, Deepseek-Coder 33B
64 GB+	70B parameters, Q4–Q6	Llama 3.3 70B Q5, Deepseek-V3

What Do the Q Numbers (Quantization) Mean?

Quantization (Q4, Q5, Q6, Q8) refers to how aggressively the model is compressed. Lower numbers = smaller file, slightly lower quality. Higher numbers = larger file, closer to original quality.

Q4_K_M: Best balance of size and quality. Start here.
Q5_K_M: Noticeably better quality, ~25% larger.
Q8: Near-original quality, roughly double the size of Q4.

Download Step-by-Step

Open the Discover tab
Search for Llama 3.3 (the current performance king for its size)
Look for a quantized version from a trusted uploader (TheBloke, bartowski, or the model creator)
Select Q4_K_M for your first model
Click Download

The download will take a few minutes depending on your connection. A 7B Q4 model is roughly 4 GB; a 70B Q4 is roughly 40 GB.

Pro tip: Start with a smaller model to verify everything works, then download a larger one. Nothing is more frustrating than waiting 30 minutes for a download only to discover a configuration issue.

Checkpoint: You should now have LM Studio installed and one Q4_K_M model downloaded. The model will appear in the My Models section of the sidebar.

Step 5 – Running Your First Local LLM Conversation

Switch to the Chat tab (speech bubble icon in the sidebar)
Select your downloaded model from the dropdown at the top
Wait for the model to load into memory (typically 10–60 seconds depending on size)
Type a message and hit Enter

You’re now running AI inference entirely on your own hardware.

What Should You Expect?

Speed: With a well-matched model and hardware, expect around 20–50 tokens per second in many setups on Apple Silicon or a mid-range NVIDIA GPU. That’s fast enough to feel conversational. CPU-only will be noticeably slower but still usable for shorter prompts.

Quality: Modern 14B+ models handle coding assistance, writing, summarization, and analysis at a level that would have required GPT-4 just 18 months ago. Don’t expect perfect performance on PhD-level reasoning tasks — that’s still where cloud models like Claude or GPT-4 earn their keep. But for roughly 80% of daily tasks, local models deliver.

Checkpoint: You should now be able to have a back-and-forth conversation with your local model. If the output is garbled or extremely slow, see Troubleshooting.

Happy with the basics? You can skip to Step 6 – Local API Server for development use, or What’s Next for recommended reading.

Step 6 – Using LM Studio as a Local API Server

This is where LM Studio becomes a serious development tool — and where the hybrid LLM approach starts.

LM Studio includes a built-in server that exposes an OpenAI-compatible API on localhost:1234. This means any application, script, or tool designed for the OpenAI API can talk to your local model with a one-line configuration change.

Starting the Server

Go to the Developer tab (code icon in the sidebar)
Select a model
Click Start Server

The server runs at http://localhost:1234/v1/.

Using It in Your Code

Here’s a Python example using the standard OpenAI SDK — no changes except the base_url:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "user", "content": "Explain quicksort in Python"}
    ]
)

print(response.choices[0].message.content)

That’s the same OpenAI SDK you already know — just pointed at localhost. Your existing code works with zero refactoring.

Why This Is the Foundation of a Hybrid LLM Stack

This is the core of what we write about at HybridLLM.dev. The idea is simple: not every task needs a $0.03 cloud API call.

Here’s the routing model that can cut your AI costs by 50–70%:

Tier	Where	Tasks	Cost
Tier 1: Local (LM Studio / Ollama)	Your machine	Summarization, formatting, code completion, translation, boilerplate generation	$0
Tier 2: Cloud (GPT-4 / Claude / Gemini)	API call	Complex reasoning, multimodal analysis, frontier capabilities, tasks demanding highest accuracy	Pay per use

Three real-world routing examples:

Code review — Local model handles style checks and formatting suggestions. Cloud model handles architectural review of complex PRs.
Customer support draft — Local model generates the first draft. Cloud model handles edge cases with nuanced policy interpretation.
Document processing — Local model extracts and structures data from PDFs. Cloud model handles ambiguous fields that need judgment.

The local API server makes this routing seamless. Your application doesn’t need to know whether it’s talking to a $0 local model or a cloud endpoint. Same API. Same code. Different economics.

Checkpoint: Your local API server should be running at http://localhost:1234/v1/. Test it with the Python snippet above or a simple curl command.

Step 7 – Troubleshooting Common Issues

Quick-Reference Table

Symptom	Likely Cause	Quick Fix
“Model failed to load”	Not enough RAM/VRAM	Use smaller quantization (Q4) or smaller model (7B). Close other apps.
< 2 tokens/second	Model on CPU instead of GPU, or swapping to disk	Check GPU offloading settings. Pick a model that fits in memory with 2–4 GB headroom.
Garbled / incoherent output	Corrupted download or wrong chat template	Delete and re-download. Check that prompt format (e.g., ChatML, Llama) matches model requirements in chat settings.
App crashes on launch (Windows)	Outdated GPU drivers or missing VC++	Update NVIDIA drivers. Install latest Visual C++ Redistributable.
High memory usage, system lag	Model too large for available RAM	Switch to a smaller model or lower quantization. Monitor with Activity Monitor (Mac) or Task Manager (Windows).

Performance Tuning Tips

GPU Offloading — the single most impactful setting. In the model loading panel, look for GPU Layers (sometimes labeled n_gpu_layers). Set to maximum if your model fits in VRAM/unified memory. Reduce gradually if you hit out-of-memory errors. On Apple Silicon, LM Studio usually handles this automatically.

Context Length — determines how much text the model can “see” at once. Start at 4096 tokens. Only increase to 8192+ if you need longer documents or multi-turn conversations. Trade-off: longer context = more memory and slower generation.

Temperature — controls randomness:

Temperature	Best For
0.0–0.3	Code generation, factual Q&A, structured output
0.5–0.7	General conversation, writing assistance
0.8–1.0	Creative writing, brainstorming

Thread Count — set to physical core count minus 1 (leave one core for the OS). Example: 10-core M2 Pro → 9 threads. More threads does not always mean faster — hyperthreaded and efficiency cores can actually hurt throughput.

What’s Next

If you’re still not entirely sure which tool to start with, read these next in order:

LM Studio Setup Guide 2026 — Get LM Studio running if you haven’t already.
Best Local LLM Models for M2/M3/M4 Mac: Performance Benchmark 2026 — Find the right model for your specific hardware.

Building a hybrid LLM setup and not sure where to start? Reach out on X/Twitter.

Share on

X Facebook LinkedIn Bluesky

HybridLLM.dev