10 minute read

Most teams using LLM APIs have the same problem: their monthly bill is $500–$2,000, and they’re not sure what’s driving it. They send every task — summarization, formatting, code completion, complex reasoning — to the same frontier model.

That’s the equivalent of hiring a senior engineer to do data entry.

A hybrid LLM architecture solves this by routing tasks to the right model at the right cost. Simple tasks go to a fast, free local model. Complex tasks go to a powerful cloud API. The result: 50–70% lower costs with no meaningful quality loss.

This guide explains the architecture, shows you how to implement it, and gives you a real cost breakdown to prove the math works.


Who This Is For

This guide is for:

  • Teams spending $200+/month on LLM APIs and wondering where the money goes
  • Solo developers hitting rate limits who want an always-available local fallback
  • Engineering leads evaluating whether local models can reduce infrastructure costs without sacrificing output quality

If your API bill is under $50/month or you only use AI for frontier-level tasks, hybrid routing probably isn’t worth the setup overhead. Everyone else — keep reading.


Key Takeaways

  • 80% of typical AI tasks don’t need a frontier model. Summarization, formatting, translation, and simple code generation run fine on a local 14B model.
  • Smart routing is the core idea: classify each task by complexity, then send it to the cheapest model that can handle it well.
  • Three tiers are enough. Local small (14B), local large (70B), and cloud API cover the full spectrum.
  • The savings are real: a team spending $1,500/month on API calls can typically drop to $400–$600 with the same output quality.
  • You don’t need custom infrastructure. A Mac, Ollama, and a simple Python router is enough to start.

The Problem: Default Routing Is Expensive

Here’s what most AI workflows look like today:

Every task → GPT-4 / Claude → $$$

Whether it’s summarizing a meeting, formatting a JSON blob, or solving a complex reasoning problem — everything goes to the same endpoint. The API doesn’t care if the task is trivial. It charges the same per-token rate.

What This Costs in Practice

A typical development team of 5 generates roughly 10–20 million tokens per month across all their AI-assisted workflows. Here’s what that looks like:

Model Input Cost (per 1M tokens) Output Cost (per 1M tokens) Monthly Bill (15M tokens)
GPT-4 Turbo $10 $30 $300–600
Claude 3.5 Sonnet $3 $15 $135–270
Claude 3 Opus $15 $75 $675–1,350
GPT-4o $5 $15 $150–300

Most teams use a mix of these, and the bill lands somewhere between $500 and $2,000/month. That’s $6,000–$24,000 per year — for tasks where 80% of the work could have been done for free.


The Solution: Tiered Routing

A hybrid LLM architecture classifies tasks before they hit a model, then routes each task to the most cost-effective option that maintains acceptable quality.

Incoming Task
    │
    ├── [Simple?] → Tier 1: Local 14B model ($0)
    │                 Summarization, formatting, translation,
    │                 simple code, Q&A
    │
    ├── [Medium?] → Tier 2: Local 70B model ($0)
    │                 Complex reasoning, code review,
    │                 long-form analysis
    │
    └── [Hard?]   → Tier 3: Cloud API ($)
                     Frontier reasoning, 100k+ context,
                     specialized domains, multimodal

The Three Tiers

Tier Model Cost Speed Quality Ceiling % of Typical Workload
1 Llama 3.3 14B (local) $0 13–22 tok/s GPT-3.5 class ~65–70%
2 Llama 3.3 70B (local) $0 8–18 tok/s GPT-4 Turbo class ~15–20%
3 Claude / GPT-4 (cloud) $3–75/1M tokens API-dependent Frontier ~10–15%

Quality labels are based on my own side-by-side tests across typical dev workflows, not formal benchmarks. Your results may vary by task type.

The key insight: Tier 1 handles the bulk of the work. Most AI tasks in a typical developer workflow are not complex reasoning problems — they’re summarization, code completion, formatting, and basic Q&A. A well-quantized 14B model does these just as well as GPT-4.


Task Classification: What Goes Where

Here’s how to decide which tier handles each request:

Tier 1 Tasks (Local 14B — Free)

These are tasks where a 14B model produces output that’s difficult to distinguish from GPT-4:

  • Summarization — Meeting notes, articles, long emails, documentation
  • Code completion — Boilerplate, repetitive patterns, simple functions
  • Formatting — JSON/XML/CSV conversion, markdown cleanup, data restructuring
  • Translation — Between major language pairs
  • Simple Q&A — Factual lookups, definition questions, “how do I X” for well-known topics
  • Text rewriting — Tone changes, simplification, expansion
  • Commit messages and PR descriptions — Based on diffs
  • Regex and shell commands — Generation from natural language

Tier 2 Tasks (Local 70B — Free, Requires 64GB+ RAM)

Tasks where the step up from 14B to 70B is noticeable:

  • Multi-step reasoning — Problems requiring 3+ logical steps
  • Complex code review — Large or unfamiliar codebases, subtle bugs
  • Long-form writing — Blog posts, technical docs, reports over 1,000 words
  • Architecture decisions — System design questions with trade-offs
  • Nuanced instruction-following — Tasks with multiple constraints or edge cases

Tier 3 Tasks (Cloud API — Paid)

Tasks where even 70B falls short:

  • Frontier reasoning — PhD-level analysis, mathematical proofs, novel problem-solving
  • Very long context — 50k–200k token inputs (cloud models handle this; local models struggle beyond 8k)
  • Multimodal — Image analysis, document OCR, vision tasks
  • Highly specialized domains — Current events, niche professional knowledge
  • Maximum reliability — Tasks where even small errors are unacceptable (medical, legal, financial)

Real-World Cost Breakdown

Let’s run the numbers for a team of 5 developers using AI across their daily workflow.

Before: Everything Goes to Cloud

Task Category Monthly Tokens Model Monthly Cost
Code completion 5M GPT-4 Turbo $150
Summarization 3M Claude Sonnet $54
Code review 2M Claude Opus $180
Q&A and formatting 3M GPT-4o $60
Complex analysis 2M Claude Opus $180
Total 15M   $624/month

After: Hybrid Routing

Task Category Monthly Tokens Routed To Monthly Cost
Code completion 5M Local 14B $0
Summarization 3M Local 14B $0
Code review 2M Local 70B $0
Q&A and formatting 3M Local 14B $0
Complex analysis 2M Claude Opus $180
Total 15M   $180/month

Savings: $444/month (71%). Over a year, that’s $5,328 for a team of 5. Scale to 20 developers and you’re looking at $20,000+ in annual savings.

These numbers assume the task distribution shown above — your split will vary. But even conservative estimates (50% local routing instead of 70%) commonly show 40–50% savings.

The quality of code completion, summarization, Q&A, and formatting? Virtually identical. The only tasks still hitting the cloud are the ones that genuinely need frontier-level reasoning.


Implementation: Building a Simple Router

You don’t need a complex ML-based classifier to start. A rule-based router handles 90% of cases and takes an afternoon to build.

Architecture Overview

Your Application / Script
         │
         ▼
    [Task Router]
    ├── keyword/pattern match
    ├── token count check
    └── explicit tier flag
         │
    ┌────┼────┐
    ▼    ▼    ▼
  Ollama Ollama Cloud API
  14B    70B   (Claude/GPT-4)
  :11434 :11434 api.anthropic.com

Basic Python Router

from openai import OpenAI

# Local models via Ollama
local_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"
)

# Cloud API (OpenAI example; adapt for Anthropic SDK if using Claude)
cloud_client = OpenAI(
    api_key="your-openai-api-key"
)

# Simple routing rules
TIER1_KEYWORDS = [
    "summarize", "translate", "format", "convert",
    "rewrite", "simplify", "commit message", "regex"
]

TIER3_KEYWORDS = [
    "analyze in depth", "prove", "compare all options",
    "review this architecture", "long document"
]

def classify_task(prompt: str, token_count: int = 0) -> int:
    prompt_lower = prompt.lower()

    # Explicit tier override
    if prompt_lower.startswith("[tier3]") or prompt_lower.startswith("[cloud]"):
        return 3
    if prompt_lower.startswith("[tier2]") or prompt_lower.startswith("[heavy]"):
        return 2

    # Long context → cloud
    if token_count > 8000:
        return 3

    # Pattern matching
    if any(kw in prompt_lower for kw in TIER3_KEYWORDS):
        return 3
    if any(kw in prompt_lower for kw in TIER1_KEYWORDS):
        return 1

    # Default: Tier 1 (local small)
    return 1

def route_request(prompt: str, token_count: int = 0) -> str:
    tier = classify_task(prompt, token_count)

    if tier == 1:
        model, client = "llama3.3:14b", local_client
    elif tier == 2:
        model, client = "llama3.3:70b", local_client
    else:
        model, client = "gpt-4-turbo", cloud_client

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Usage

# Tier 1 — handled locally, free
result = route_request("Summarize this meeting transcript: ...")

# Tier 2 — local 70B
result = route_request("[heavy] Review this codebase for architectural issues: ...")

# Tier 3 — cloud API
result = route_request("[cloud] Analyze the trade-offs between microservices and monolith for our specific case: ...")

# Auto-classified as Tier 3 (long context)
result = route_request("Analyze this document: ...", token_count=15000)

This is intentionally simple. A keyword-and-rule router gets you 80% of the way there. You can add ML-based classification later if needed — but most teams never need to.


Advanced Routing Strategies

Once the basic router is working, these patterns further optimize cost and quality.

Fallback Chains

If a local model produces low-confidence output, automatically escalate:

def route_with_fallback(prompt: str) -> str:
    # Try local first
    result = route_request(prompt)

    # Simple quality check: if response is too short or contains
    # "I don't know" / "I'm not sure", escalate
    if len(result) < 50 or "i'm not sure" in result.lower():
        return route_request("[cloud] " + prompt)

    return result

Cost Tracking

Log every request with its tier and token count. After a week, you’ll have hard data on your routing efficiency:

import json
from datetime import datetime

def log_request(tier: int, model: str, tokens: int, task_type: str):
    entry = {
        "timestamp": datetime.now().isoformat(),
        "tier": tier,
        "model": model,
        "tokens": tokens,
        "task_type": task_type,
        "estimated_cost": 0 if tier <= 2 else tokens * 0.00003  # adjust per your actual model pricing
    }
    with open("llm_usage.jsonl", "a") as f:
        f.write(json.dumps(entry) + "\n")

Review the log weekly. If certain task types are consistently routed to Tier 3, ask: could a better prompt make this work on Tier 1?

Prompt Optimization for Local Models

Local models respond better to:

  • Direct instructions — “Summarize in 3 bullet points” beats “Can you please provide a summary?”
  • Explicit format — “Output as JSON with keys: title, summary, tags” beats “Format this nicely”
  • Constrained scope — “Answer in 2 sentences” keeps small models on track

Spending 10 minutes tuning prompts for Tier 1 often moves tasks from “needs Tier 3” to “works fine on Tier 1.”


When NOT to Use Hybrid Routing

This approach isn’t for everyone. Skip it if:

  • Your total API bill is under $50/month — the complexity isn’t worth the savings
  • You only use AI for frontier tasks — if every request genuinely needs GPT-4, there’s nothing to route locally
  • You don’t have local hardware — without at least a 16GB Mac or a GPU with 8GB+ VRAM, the local tier isn’t practical
  • Latency tolerance is zero — cloud APIs with large compute are faster than local models on first-token latency for complex tasks

For everyone else — especially teams spending $200+/month on API calls — hybrid routing pays for itself in the first week.


Getting Started: This Week

If you’re ready to implement this:

Day 1: Set Up Your Local Tier

Install Ollama and pull a 14B model:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.3:14b

If you have 64GB+ RAM, also pull the 70B:

ollama pull llama3.3:70b

Day 2: Audit Your Current Usage

Look at your API logs or billing dashboard. Categorize your last 100 API calls:

  • How many were summarization, formatting, or simple Q&A?
  • How many genuinely needed frontier reasoning?
  • What’s the average token count per request?

Most teams find that 60–80% of calls are Tier 1 material.

Day 3: Build Your Router

Copy the Python router from this article. Adapt the keyword lists to your actual task types. Start routing Tier 1 tasks locally.

Day 7: Measure

Compare your API bill for this week against last week. The savings should be immediately visible.


Key Metrics to Track

Metric Target Why It Matters
Tier 1 routing % 60–70% This is where the savings come from
Quality complaints <5% increase If users notice degraded quality, routing is too aggressive
Cloud API spend 50–70% reduction The bottom line
Local model latency (p95) <10s for 14B If local is too slow, people will bypass the router
Fallback rate <15% High fallback means your classification needs tuning

What’s Next

Get your local environment running first:

Follow @hybridllm for weekly cost optimization tips and architecture patterns.