Skip to content
Back to AI & LLM
AI & LLM5 min read

Local LLMs Are Good Enough

aillmlocal-modelsollama
Share

I pay roughly $200/month in API costs across OpenAI, Anthropic, and Cohere. Most of that spend is justified — production systems need reliability, speed, and the best available models. But a surprising amount of it isn't.

I started running models locally three months ago. The gap between local and cloud is narrower than most people think, and for a whole class of tasks, local wins outright.

the hardware reality

I'm running a MacBook Pro M3 Max with 64GB unified memory. Here's what that gets you:

| Model | Size | Tokens/sec | RAM Usage | |-------|------|-----------|-----------| | Llama 3.1 8B (Q4) | 4.7 GB | ~45 tok/s | ~6 GB | | Mistral 7B (Q4) | 4.4 GB | ~50 tok/s | ~5.5 GB | | Phi-3 Mini (Q4) | 2.3 GB | ~65 tok/s | ~3.5 GB | | Llama 3.1 70B (Q4) | 40 GB | ~8 tok/s | ~42 GB | | Qwen 2.5 32B (Q4) | 18 GB | ~18 tok/s | ~20 GB |

The 8B models are fast enough for interactive use. The 70B model is slow but works for batch processing. Qwen 2.5 32B hits a sweet spot — genuinely good quality at usable speed.

setup takes 5 minutes

Ollama made this trivially easy:

# Install
brew install ollama
 
# Pull models
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull phi3:mini
ollama pull qwen2.5:32b
 
# Run with OpenAI-compatible API
ollama serve  # localhost:11434

The OpenAI-compatible API means you can swap local models into existing code by changing the base URL:

from openai import OpenAI
 
# Just point to Ollama instead of OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
 
response = client.chat.completions.create(
    model="qwen2.5:32b",
    messages=[{"role": "user", "content": "Explain rate limiting strategies"}],
)

No code changes beyond the base URL. Your existing OpenAI SDK code works as-is.

where local models win

Development and testing. Every API call during development costs money. When I'm iterating on prompts, testing edge cases, or building new features, I run against local models first. Switch to cloud for production. This alone saved me ~$60/month.

Data sensitivity. Some data shouldn't leave your machine. Customer PII, internal codebases, proprietary business logic. Running Llama locally means zero data leaves the network. No privacy policy to read, no DPA to sign.

Batch processing. I needed to classify 50,000 support tickets. GPT-4o-mini would've cost ~$15 and taken 2 hours with rate limiting. Qwen 2.5 32B locally took 6 hours but cost $0 and I could rerun it freely while tuning the prompt.

Offline use. Planes, trains, spotty WiFi. The model works everywhere your laptop does.

where local models lose

Be honest about the gaps:

Complex reasoning. Claude Opus and GPT-4o are still meaningfully better at multi-step reasoning, nuanced analysis, and long-context tasks. The 8B models can't touch them. The 70B models get closer but aren't there yet.

Speed at scale. For production services with concurrent users, a $0.15/1K token API call beats running your own GPU cluster every time until you're at serious volume.

Code generation. For real code generation — not completions, but "build this feature" — the gap between Claude Sonnet and local models is still wide. I wouldn't trust Llama 8B with architectural decisions.

Tool calling and structured output. Cloud models handle function calling and JSON mode reliably. Local models are inconsistent. Llama 3.1 supports tool use in theory but drops arguments or hallucinates function names in practice.

my actual workflow

Here's how I split things:

| Task | Model | Why | |------|-------|-----| | Prompt iteration | Llama 3.1 8B (local) | Fast, free, good enough for testing | | Text classification | Qwen 2.5 32B (local) | Accurate enough, no data leaves machine | | Summarization | Mistral 7B (local) | Fast, handles long inputs well | | Code generation | Claude Sonnet (API) | Still the best for real code | | Complex analysis | Claude Opus (API) | When reasoning quality matters | | Production eval | GPT-4o (API) | Reliable structured output | | Quick prototyping | Phi-3 Mini (local) | Tiny, instant responses |

The key insight: it's not local or cloud. It's using the cheapest model that meets the quality bar for each specific task. Sometimes that's a 2B parameter model on your laptop. Sometimes it's Opus.

the cost math

Rough monthly comparison for my usage patterns:

  • All cloud: ~$200/month
  • Hybrid (local dev + cloud prod): ~$120/month
  • Electricity for local inference: ~$8/month

The $80/month savings aren't life-changing. But the real win is iteration speed. When inference is free, I test more aggressively, try more prompt variations, and catch edge cases earlier. That's worth more than the dollar savings.

what's coming

The local model ecosystem is improving fast. Every few months, a new open model closes another gap. Llama 3.1 was a step change from Llama 2. Qwen 2.5 competes with GPT-3.5 on most benchmarks. At this trajectory, local 32B models will match GPT-4-level quality within a year.

I'm betting that by late 2026, cloud APIs will only be necessary for the hardest 20% of tasks. For everything else, local inference will be the default.


Share

More in AI & LLM