LM Studio Setup Guide 2026: How to Install and Run Local LLMs in 5 Minutes
This is a step-by-step LM Studio setup guide for Mac and Windows to install and run local LLMs — completely offline, completely free, with zero data leaving your machine.
Key Takeaways
- Who this is for: Anyone with a Mac (M1+) or Windows PC (RTX 3060+) who wants to run AI models locally
- What you’ll get: LM Studio installed, your first model downloaded and running, a local API server ready for development
- Time required: ~30 minutes from zero to a working local AI assistant
- Cost: $0 — everything in this guide is free
Step 1 – What Is LM Studio and Why Use It Instead of Cloud LLMs?
Already using Ollama? Think of LM Studio as the GUI-first alternative — same models, visual interface, built-in API server. Read our detailed Ollama vs LM Studio comparison to see which fits your workflow.
LM Studio is a desktop application that lets you discover, download, and run open-source large language models locally. Think of it as the iTunes of AI models — a clean interface on top of what would otherwise require terminal commands and manual configuration.
Why this matters in 2026:
- Privacy: Your prompts never leave your computer. For anyone working with proprietary code, medical records, legal documents, or client data, this isn’t optional — it’s a requirement.
- Cost: Cloud API calls add up fast. A team of five developers using GPT-4-level models can easily spend $500–2,000 per month. Local inference costs exactly $0 after the hardware investment.
- No rate limits: You won’t get throttled at 3 AM when you’re on a deadline.
- Offline access: Works on a plane, in a coffee shop with bad Wi-Fi, or in an air-gapped corporate network.
The catch? You need decent hardware. But if you’re reading this on a machine bought in the last two years, you probably have enough.
Already know what LM Studio is? Jump to Step 2 – System Requirements.
Step 2 – Can Your Mac or PC Run LM Studio? System Requirements
Who This Guide Is For
- First-time local LLM users on Mac or Windows who want a visual, no-terminal experience
- Ollama users looking for a GUI alternative with a built-in model browser
- Developers who want a local OpenAI-compatible API for hybrid LLM workflows
Minimum Requirements
| Component | Spec |
|---|---|
| RAM | 8 GB (runs 7B models slowly) |
| Storage | 10 GB free (models are 4–50 GB each) |
| OS | macOS 13+, Windows 10+, Ubuntu 22.04+ |
| GPU | Not strictly required, but strongly recommended |
Recommended for a Good Experience
| Component | Spec |
|---|---|
| RAM | 16–32 GB |
| GPU (NVIDIA) | RTX 3060 12 GB or better |
| GPU (Apple) | M1 Pro / M2 / M3 with 16 GB+ unified memory |
| Storage | SSD with 50+ GB free |
The Sweet Spot in 2026
- Mac users: M2/M3/M4 with 24–64 GB unified memory. Apple Silicon handles local LLMs exceptionally well because the CPU and GPU share the same memory pool. A MacBook Pro M2 with 32 GB can typically run 30B-parameter Q4 models comfortably for most workloads.
- Windows/Linux users: Any NVIDIA GPU with 8+ GB VRAM. The RTX 4060 (8 GB) is the price-to-performance champion. The RTX 3090 (24 GB) remains the enthusiast sweet spot on the used market.
No dedicated GPU? CPU-only inference works — it’s just slower. Expect around 3–8 tokens per second on a modern CPU versus 20–60+ tokens per second with a capable GPU, depending on your specific hardware and model choice.
Ready to install? Jump to Step 3 – Installation.
Step 3 – Installing LM Studio on Mac, Windows, and Linux
macOS
- Visit lmstudio.ai
- Click Download for Mac — it auto-detects Intel vs Apple Silicon
- Open the
.dmgfile and drag LM Studio to Applications - Launch LM Studio from Applications
That’s it. No Homebrew, no terminal commands, no Python environment.
Windows
- Visit lmstudio.ai
- Click Download for Windows
- Run the
.exeinstaller - Follow the standard Windows installation wizard
- Launch LM Studio from the Start menu
NVIDIA users: Make sure your GPU drivers are up to date. LM Studio will automatically detect and use your GPU if CUDA-compatible drivers are installed.
Linux
- Visit lmstudio.ai
- Download the
.AppImagefile - Make it executable:
chmod +x LM-Studio-*.AppImage - Run it:
./LM-Studio-*.AppImage
For NVIDIA GPU acceleration, ensure you have the latest NVIDIA drivers and CUDA toolkit installed.
Checkpoint: At this point you should have LM Studio installed and running on your machine. You’ll see a clean interface with a sidebar on the left. If LM Studio won’t launch, see the Troubleshooting section below.
Step 4 – How to Choose and Download Your First Model
When you first open LM Studio, the model library is empty. Here’s how to pick the right model for your hardware.
Which Model Size Fits Your Hardware?
Open the Discover tab (magnifying glass icon in the sidebar). You’ll see thousands of models. Don’t get overwhelmed. Here’s your decision framework:
| Your RAM / VRAM | Recommended Model Size | Example Models |
|---|---|---|
| 8 GB | 7B parameters, Q4 | Llama 3.2 7B, Mistral 7B |
| 16 GB | 13–14B parameters, Q4–Q5 | Llama 3.3 14B, Qwen 2.5 14B |
| 32 GB | 30–34B parameters, Q4–Q5 | Qwen 2.5 32B, Deepseek-Coder 33B |
| 64 GB+ | 70B parameters, Q4–Q6 | Llama 3.3 70B Q5, Deepseek-V3 |
What Do the Q Numbers (Quantization) Mean?
Quantization (Q4, Q5, Q6, Q8) refers to how aggressively the model is compressed. Lower numbers = smaller file, slightly lower quality. Higher numbers = larger file, closer to original quality.
- Q4_K_M: Best balance of size and quality. Start here.
- Q5_K_M: Noticeably better quality, ~25% larger.
- Q8: Near-original quality, roughly double the size of Q4.
Download Step-by-Step
- Open the Discover tab
- Search for
Llama 3.3(the current performance king for its size) - Look for a quantized version from a trusted uploader (TheBloke, bartowski, or the model creator)
- Select Q4_K_M for your first model
- Click Download
The download will take a few minutes depending on your connection. A 7B Q4 model is roughly 4 GB; a 70B Q4 is roughly 40 GB.
Pro tip: Start with a smaller model to verify everything works, then download a larger one. Nothing is more frustrating than waiting 30 minutes for a download only to discover a configuration issue.
Checkpoint: You should now have LM Studio installed and one Q4_K_M model downloaded. The model will appear in the My Models section of the sidebar.
Step 5 – Running Your First Local LLM Conversation
- Switch to the Chat tab (speech bubble icon in the sidebar)
- Select your downloaded model from the dropdown at the top
- Wait for the model to load into memory (typically 10–60 seconds depending on size)
- Type a message and hit Enter
You’re now running AI inference entirely on your own hardware.
What Should You Expect?
Speed: With a well-matched model and hardware, expect around 20–50 tokens per second in many setups on Apple Silicon or a mid-range NVIDIA GPU. That’s fast enough to feel conversational. CPU-only will be noticeably slower but still usable for shorter prompts.
Quality: Modern 14B+ models handle coding assistance, writing, summarization, and analysis at a level that would have required GPT-4 just 18 months ago. Don’t expect perfect performance on PhD-level reasoning tasks — that’s still where cloud models like Claude or GPT-4 earn their keep. But for roughly 80% of daily tasks, local models deliver.
Checkpoint: You should now be able to have a back-and-forth conversation with your local model. If the output is garbled or extremely slow, see Troubleshooting.
Happy with the basics? You can skip to Step 6 – Local API Server for development use, or What’s Next for recommended reading.
Step 6 – Using LM Studio as a Local API Server
This is where LM Studio becomes a serious development tool — and where the hybrid LLM approach starts.
LM Studio includes a built-in server that exposes an OpenAI-compatible API on localhost:1234. This means any application, script, or tool designed for the OpenAI API can talk to your local model with a one-line configuration change.
Starting the Server
- Go to the Developer tab (code icon in the sidebar)
- Select a model
- Click Start Server
The server runs at http://localhost:1234/v1/.
Using It in Your Code
Here’s a Python example using the standard OpenAI SDK — no changes except the base_url:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[
{"role": "user", "content": "Explain quicksort in Python"}
]
)
print(response.choices[0].message.content)
That’s the same OpenAI SDK you already know — just pointed at localhost. Your existing code works with zero refactoring.
Why This Is the Foundation of a Hybrid LLM Stack
This is the core of what we write about at HybridLLM.dev. The idea is simple: not every task needs a $0.03 cloud API call.
Here’s the routing model that can cut your AI costs by 50–70%:
| Tier | Where | Tasks | Cost |
|---|---|---|---|
| Tier 1: Local (LM Studio / Ollama) | Your machine | Summarization, formatting, code completion, translation, boilerplate generation | $0 |
| Tier 2: Cloud (GPT-4 / Claude / Gemini) | API call | Complex reasoning, multimodal analysis, frontier capabilities, tasks demanding highest accuracy | Pay per use |
Three real-world routing examples:
- Code review — Local model handles style checks and formatting suggestions. Cloud model handles architectural review of complex PRs.
- Customer support draft — Local model generates the first draft. Cloud model handles edge cases with nuanced policy interpretation.
- Document processing — Local model extracts and structures data from PDFs. Cloud model handles ambiguous fields that need judgment.
The local API server makes this routing seamless. Your application doesn’t need to know whether it’s talking to a $0 local model or a cloud endpoint. Same API. Same code. Different economics.
Checkpoint: Your local API server should be running at
http://localhost:1234/v1/. Test it with the Python snippet above or a simplecurlcommand.
Step 7 – Troubleshooting Common Issues
Quick-Reference Table
| Symptom | Likely Cause | Quick Fix |
|---|---|---|
| “Model failed to load” | Not enough RAM/VRAM | Use smaller quantization (Q4) or smaller model (7B). Close other apps. |
| < 2 tokens/second | Model on CPU instead of GPU, or swapping to disk | Check GPU offloading settings. Pick a model that fits in memory with 2–4 GB headroom. |
| Garbled / incoherent output | Corrupted download or wrong chat template | Delete and re-download. Check that prompt format (e.g., ChatML, Llama) matches model requirements in chat settings. |
| App crashes on launch (Windows) | Outdated GPU drivers or missing VC++ | Update NVIDIA drivers. Install latest Visual C++ Redistributable. |
| High memory usage, system lag | Model too large for available RAM | Switch to a smaller model or lower quantization. Monitor with Activity Monitor (Mac) or Task Manager (Windows). |
Performance Tuning Tips
GPU Offloading — the single most impactful setting. In the model loading panel, look for GPU Layers (sometimes labeled n_gpu_layers). Set to maximum if your model fits in VRAM/unified memory. Reduce gradually if you hit out-of-memory errors. On Apple Silicon, LM Studio usually handles this automatically.
Context Length — determines how much text the model can “see” at once. Start at 4096 tokens. Only increase to 8192+ if you need longer documents or multi-turn conversations. Trade-off: longer context = more memory and slower generation.
Temperature — controls randomness:
| Temperature | Best For |
|---|---|
| 0.0–0.3 | Code generation, factual Q&A, structured output |
| 0.5–0.7 | General conversation, writing assistance |
| 0.8–1.0 | Creative writing, brainstorming |
Thread Count — set to physical core count minus 1 (leave one core for the OS). Example: 10-core M2 Pro → 9 threads. More threads does not always mean faster — hyperthreaded and efficiency cores can actually hurt throughput.
What’s Next
If you’re still not entirely sure which tool to start with, read these next in order:
-
LM Studio Setup Guide 2026 — Get LM Studio running if you haven’t already.
-
Best Local LLM Models for M2/M3/M4 Mac: Performance Benchmark 2026 — Find the right model for your specific hardware.
Building a hybrid LLM setup and not sure where to start? Reach out on X/Twitter.