← All posts May 2026 By Victor Purcallas Marchesi

Local Models Are Not Frontier. They Are Enough.

On April 22, 2026, two labs released frontier-adjacent open-weight models on the same day. Alibaba dropped Qwen 3.6-27B, a dense 27-billion-parameter model that beats a 397B mixture-of-experts on agentic coding benchmarks. Hours later, DeepSeek released V4 under MIT license. Simon Willison put it plainly: "almost on the frontier, a fraction of the price."

That was a Wednesday. By the following Tuesday, NVIDIA had shipped Nemotron 3 Nano Omni, a 30B multimodal model that runs in 25GB of RAM at 4-bit quantization and handles vision, audio, and text in a single architecture. The day after that, IBM published Granite 4.1. Google added Multi-Token Prediction to Gemma 4: a small four-layer drafter that runs ahead of the target model so on-device generation is faster.

Five labs. Ten days. Not coordinated. Convergent.

The number that matters

The clearest measure of the shift is not a benchmark, it is adoption. In Q1 2023, Ollama did 100,000 monthly downloads. By Q1 2026, that number was 52 million, a 520x jump in three years. HuggingFace, which hosted around 200 GGUF-formatted models for local inference at the start of 2023, hosts 135,000 today.

The questions people type into Google have shifted in lockstep. In 2023 the search was "can I run an LLM locally?" In 2024 it became "which model should I run?" By now it is "how do I build a real system with this?"

That is the entire shift. Not a benchmark, not a partnership. The way the question itself changed.

Intelligence density

Andrej Karpathy put it well in January: "you are not optimizing for a single specific model but for a family of models controlled by a single dial." The dial is compute. The output gets better as you turn it up. This is a more useful framing than "small versus frontier" because it acknowledges the spectrum is continuous and you get to choose where on it you operate.

I have been running local models for a while, and they still surprise me.

Phi-4-reasoning is a 14-billion parameter model that runs on a laptop. On AIME 2025 math problems, it scores 78%, within ten points of DeepSeek-R1, a model fifty times its size.

Five years ago, that sentence would have been delusional.

The capability ceiling is still in the cloud. Frontier models keep doing things smaller open-weight models cannot match: long-horizon reasoning, agentic chains where one wrong step compounds for an hour, the largest context windows. But the gap stopped being a single line you either crossed or did not. It is a gradient now, and where you sit on it is a choice.

Enough for what?

Simon Willison has been running local models on his MacBook in public for over a year, and his arc is worth tracing.

Across 2025, his MacBook caught up. By December he was running Llama 3.3 70B on a 64GB MacBook Pro and felt for the first time that he had GPT-4 class capability on his own machine. Then April 2026 brought DeepSeek V4 and Qwen 3.6, and the price-to-capability ratio dropped another order of magnitude.

The dial keeps turning.

Willison is also clear about what local cannot do, at least not yet. He still uses frontier hosted models as his daily driver, because coding agents like Claude Code need more than a smart model. They need reliable tool-call invocations, and he has not yet found a local model that handles Bash tool calls reliably enough to trust it to operate an agent on his own machine. His next laptop will have at least 128GB of RAM, in the hope that one of the 2026 open-weight models will close that last gap.

That is the honest version of the story. The floor moved enormously and is still moving. The frontier moved too. For some kinds of work, the kind where one hallucinated tool call wastes an hour, frontier still wins. For most other work, the dial setting on a laptop is already where it needs to be.

What does "most other work" look like in practice?

A startup with twenty engineers ships a feature that needs an LLM in the loop. They run the per-token cost against their projected scale, the numbers do not work, and they end up running Qwen on their own GPUs. The product ships.

A hospital wants automated summaries of patient consultations, but the legal team will not approve sending transcripts to any cloud API. They run Granite in a private network and the doctors get their summaries.

A team in any country where half the cloud providers no longer service them runs the model on their own hardware. The work continues.

None of those situations needs frontier. They need enough.

What runs on your machine

The interesting thing about the April rush is who showed up. NVIDIA, the company whose entire revenue runs on selling cloud-grade GPUs, released a 30B model that fits on a workstation. IBM, an enterprise vendor that profits from managed services, shipped models small enough to run in a browser tab. Alibaba published Qwen as dense open weights anyone can fork. DeepSeek released V4 under MIT and almost closed the gap with the closed frontier. Google made Gemma faster on laptops.

They are not doing this out of generosity. They are doing it because the demand is real, the hardware caught up, and the open-weight competition does not stop. Every release pushes the others.

The future of AI is not just what runs in someone else's data center. It is also what you can run on your machine, today, with no API key, no telemetry, no permission asked.

The frontier keeps moving. The floor moves faster. For most of the work most people do, the floor is already enough.

The next ten days are going to look like the last ten.