Wiki topic

Local Models

Last updated 2026-07-10

Summary

Mr. Nayak is following the gap between “local models are possible” and “local models feel finished.” The frustration isn’t with model quality — it’s with the fragmented tooling, missing polish, and configuration overhead that makes local inference still worse than it should be for practical coding workflows. Two pieces directly address this UX gap; a third offers the hardware angle. This week, Qwen 3.7 landed as a release event — the model release cadence continues to accelerate, adding capable options to the local model ecosystem. W23 adds a conceptually important new angle: skill distillation (Tom Tunguz) uses frontier models to write SKILL.md procedure files that smaller local models execute — a technique that makes local models viable for agentic workflows without requiring them to match frontier capability, because the procedure is externalized into the skill file rather than encoded in weights. W25 adds a manifesto-level argument: Ahmad Osman’s Opensource AI Must Win frames locally-deployable AI as civilizational infrastructure — operational freedom requires the ability to run, inspect, and modify intelligence systems without permission; the Fable/Mythos export ban (cataloged the same day) provides the precise case study. W26 adds a concrete benchmark: GLM-5.2 vs Claude Opus (3D WebGL game build) — Opus was faster and produced cleaner code; GLM-5.2 cost 4x less; the decisive argument for open weights in the arsenal isn’t capability parity, it’s that weights you download can’t be revoked. W27 provides the most practical “it works” signal yet: Migdal (Quesma) declares Qwen 3.6 27B the first local model that “makes sense as a general intelligence” — achieves 32 tok/s on an M5 Max with llama.cpp + MTP, integrates cleanly with OpenCode via a localhost OpenAI-compatible API, and outperforms the faster 35B A3B MoE variant on a coding task by following an instruction the MoE variant ignored; the llama.cpp + OpenCode stack is now a documented, complete local development workflow.

W28 adds Martin Alderson’s margin-collapse argument: the open-weight story is no longer only sovereignty or resilience. If GLM 5.2-class models are competitive for agentic work at 15–20% of closed-frontier API prices, inference margins become a strategic pressure point for the entire hosted-model business model.

Key Sources

W28 2026 · 04-Jul-26 → 10-Jul-26

GLM 5.2 and the coming AI margin collapse — Martin Alderson: GLM 5.2 is framed as the first open-weights model he would call a genuine competitor to Opus/GPT for agentic work at roughly 15–20% of the price; argues the real DeepSeek moment is inference-margin collapse, because training is fixed capex while inference marginal cost scales with demand (thought-leadership · #local-models, #open-weights, #glm, #ai-economics, #inference-costs)

W27 2026 · 27-Jun-26 → 03-Jul-26

Qwen 3.6 27B is the sweet spot for local development — Piotr Migdal (Quesma): hands-on benchmark and setup guide; Qwen 3.6 27B (dense) vs 35B A3B (MoE) on MacBook M5 Max 128GB with llama.cpp + MTP: 27B reaches 32 tok/s Q8, 35B A3B reaches 105 tok/s — yet 27B is recommended because it followed a Node package constraint the 35B A3B ignored; complete ~/.config/opencode/opencode.jsonc integration config shown; both variants fit in 48GB Apple Silicon shared RAM (Q8), 4-bit fits on 32GB; llama.cpp recommended over Ollama (on ethical grounds); at 32 tok/s, local inference is now within the typical frontier API performance band (engineering-blog · #local-models, #qwen, #llama-cpp, #benchmarks, #apple-silicon, #ai-coding)

W26 2026 · 20-Jun-26 → 26-Jun-26

GLM-5.2 vs Claude Opus — techstackups.com head-to-head: same prompt (build a 3D WebGL platformer from scratch, no game engine); Opus finished in 33m (vs GLM-5.2’s 70m), shipped cleaner and more correct code, and can check its own visual output; GLM-5.2 cost $5.39 vs ~$21.92 — a 4x cost reduction; GLM-5.2 lacks visual output capability (text-only); the decisive argument for keeping open weights in the arsenal: “a closed model can be retired or restricted with little warning (Fable was a recent reminder); weights you can download can’t be taken away” (news · #local-models, #model-comparison, #open-weights, #glm, #cost)

W25 2026 · 13-Jun-26 → 19-Jun-26

Opensource AI Must Win — Ahmad Osman: if AI becomes something people can only rent from closed institutions, the public loses operational freedom — the ability to run, inspect, modify, audit, teach, and preserve intelligence systems without asking permission; frames open-source AI as civilizational infrastructure; “America should not fall behind on the freedom to run, inspect, modify, benchmark, teach, and preserve intelligence infrastructure” — makes a national security argument for open weights; written the same day the Fable/Mythos export control directive was issued (thought-leadership · #local-models, #open-source-ai, #digital-sovereignty, #ai-policy)

W23 2026 · 30-May-26 → 05-Jun-26

Skill Distillation — Tom Tunguz: frontier models (Opus 4.7, GPT-5.1) author and refine SKILL.md procedure files; local models (Qwen 35B, Gemma 26B) execute them; the executing model is swappable quarterly by cost; the skill library is institutional knowledge that doesn’t live in weights; distinct from classical model distillation (no weight transfer), RAG (no fact retrieval), instruction tuning (no weight baking) — this distills procedures; evaluation is also automated by the frontier model; makes local models viable for complex agentic tasks (thought-leadership · #local-models, #agent-skills, #skill-distillation, #frontier-models, #ai-agents)

W22 2026 · 23-May-26 → 26-May-26

Local LLMs perform so much better when you teach them to ask before they answer — XDA Developers: local models lag behind cloud models and ambiguous prompts make it worse; the fix is a system prompt instructing the model to ask clarifying questions before attempting complex tasks; particularly valuable for local models where cold-start into a wrong direction wastes more context than it would on a hosted model; a practical UX workaround for local model quality gaps (engineering-blog · #local-models, #prompting, #llm-ux, #system-prompts)

W21 2026 · 16-May-26 → 22-May-26

Qwen 3.7 — Qwen 3.7 model release announcement; page was JavaScript-rendered and content unavailable; noted from title; Qwen continues to be a competitive option in the local/open-weight model space

W20 2026 · 09-May-26 → 15-May-26

Pushing Local Models With Focus And Polish — Armin Ronacher (Flask/Jinja creator); diagnoses the end-to-end UX gap: tool parameter streaming broken, JSON config scattered, quantization choices that silently degrade output quality; core argument: “runnable is not finished”
Running local models on an M4 with 24GB memory — hands-on experience on Apple Silicon; practical notes on what works, what doesn’t, memory constraints, and model selection on high-end consumer hardware

Open Questions / Tensions

Open weights as price discipline: Prior entries emphasized revocability and sovereignty. GLM 5.2 adds a market-structure angle: capable open weights may cap closed-provider inference margins even when hosted models remain better in absolute quality.
Ecosystem fragmentation: Inference engines, quantization formats, context configs, and tool protocol support are all separate choices that compound. The “boring operation” of using a hosted API (paste key → done) has no local equivalent yet.
Quality vs. availability: Local models protect privacy and enable experimentation for developers who can’t lock everything into hosted APIs. The question is how long the UX gap persists as the ecosystem matures.
Hardware access: The M4 24GB report shows even high-end consumer hardware has meaningful limits. The release cadence (Qwen 3.7 this week) keeps adding new capable models, but the tooling has to keep up.
Skill distillation as a local model enabler: If procedures are externalized into SKILL.md files authored by frontier models, local models don’t need to be as capable as frontiers to perform complex tasks — they just need to follow structured procedures reliably. This changes the calculus for local deployment: the bottleneck shifts from raw capability to procedure quality and instruction-following, where smaller models are increasingly competitive.
Open weights as resilience, not just cost: The Fable ban provides an empirical case for a theoretical concern. The techstackups.com comparison makes this practical: a 4x cost reduction with acceptable quality is available today for open weights — and closed models carry geopolitical risk that open weights don’t. The question is when cost+quality cross such that open weights become the default choice, not just a hedge.
The sovereignty argument: Osman’s manifesto frames open-source AI as a national capacity question, not just a developer preference. If the government can revoke access to frontier models for any foreign national overnight, the implicit assumption that cloud-hosted frontier models are reliable infrastructure is wrong. The Fable case turned the sovereignty argument from theoretical to practical.
Instruction-following vs. raw throughput: The 27B dense vs. 35B A3B MoE finding (Migdal, W27) surfaces a non-obvious tradeoff: the MoE variant is 3x faster but failed a coding task by ignoring an instruction constraint. For agentic and coding workloads, instruction-following reliability may matter more than tokens-per-second — the choice of variant must be task-aware, not just benchmark-aware.
“Runnable is not finished” revisited: Ronacher’s W20 concern about ecosystem fragmentation has a 2026 answer in the llama.cpp + OpenCode stack (Migdal, W27). Setup still requires CLI comfort, but a reproducible, complete integration pattern now exists — localhost server, OpenAI-compatible API, published config snippets, and a real coding task validated end-to-end.