Cloud-based AI is convenient. Upload your data, get results back, pay by the token. The model lives somewhere else, and so does your context. That trade-off works until it doesn’t.
Running models locally changes the equation. Your data stays on your machine. Your context window belongs to you. Latency drops to milliseconds. Cost structure flips from per-token billing to one-time hardware investment.
The Hardware Reality
Local inference hardware has improved dramatically. A mid-range consumer laptop now runs 3-billion-parameter models in real time. Larger models, up to 70B parameters, run on desktop hardware with discrete GPUs or high-memory configurations.
The Intel Core Ultra 9 185H, a laptop-class processor, handles 3B-8B parameter models at acceptable speeds without a discrete GPU. Adding a dedicated GPU shifts the ceiling significantly higher. The practical constraint isn’t hardware — it’s knowing which model fits your hardware and your task.
What You Actually Gain
Privacy is the obvious benefit. Code, documents, conversations — none of it leaves your machine. For enterprise users, this eliminates a category of compliance overhead. For individuals, it means your personal context isn’t training someone else’s model.
Less discussed: latency changes how you interact with AI. When response times drop below 100ms, you stop treating AI as a separate workflow. It becomes part of your existing tools. The interaction model shifts from “submit prompt, wait, read response” to “iterate rapidly on ideas.”
Offline capability matters more than it should. Presentations without wifi, flights, conference calls in venues with bad connectivity — the model still works. This isn’t theoretical. It changes which problems you attempt to solve with AI.
The Trade-offs Are Real
Smaller models have lower capability ceilings. A 3B parameter model won’t reason through complex multi-step problems the way a frontier model does. The gap closes for specific tasks — summarization, extraction, classification — but it doesn’t disappear.
Maintenance overhead increases. Local models need updates, hardware upgrades, and troubleshooting. Cloud providers handle this invisibly. Self-hosting means you own the full stack.
Context window management becomes your problem. Cloud providers abstract this away with retrieval-augmented generation or extended context windows. Running locally means you manage chunking, retrieval, and context overflow yourself.
When It Makes Sense
Local-first works when data sensitivity is high, when you need offline capability, or when usage volume would make cloud costs prohibitive. Development workflows with proprietary codebases fit this profile. Research workflows with sensitive documents fit it too.
The sweet spot is tasks that don’t require frontier model capability. Summarization, extraction, classification, code completion — these work well at 3B-8B parameters. The moment you need multi-step reasoning on novel problems, cloud models still win.
Most teams will end up using both. Local for privacy-sensitive, high-volume, latency-critical tasks. Cloud for capability-intensive tasks. The interesting question is how to build workflows that switch between them intelligently.
What’s your current setup? Are you running models locally, or is everything cloud-based?