Local Agentic Coding¶
Local Agentic Coding is agentic software engineering on open models running on your own hardware: an agent that reads, edits, tests, and ships code, where the model weights, the context, and the transcript never leave your machine.
SuperQode is built to be the first choice for it. Cloud-first harnesses treat local models as a degraded fallback. SuperQode treats them as the design center, and treats the cloud as the thing you connect to when you choose to.
Why local, and why now¶
Three things changed:
- Open models got serious. Current open-weight coders solve a majority of real software engineering benchmark tasks on hardware a developer can own. MoE architectures deliver large-model quality at small-model decode cost.
- Local serving got fast. MLX on Apple Silicon, continuous batching, prefix caching, and accelerator-aware runtimes turned local inference from a demo into a daily driver.
- The privacy and cost math flipped. An agent loop resends a growing context every turn. On metered APIs that compounds; on your own hardware it is free, and the code never leaves the building.
What did not change: local models are less forgiving than frontier APIs. They have smaller loaded context windows, weaker tool-calling heads, tighter prompt budgets, and wildly different sweet spots per family. A harness that ignores this produces a bad agent. SuperQode is engineered for exactly these realities.
From zero to a tuned local agent¶
One command starts the guided MVP path:
superqode local init --repo .
superqode --harness superqode.local.yaml
local init runs the Local Stack Doctor, generates superqode.local.yaml, and, when a local server is running, runs the same non-destructive readiness probe as superqode local smoke --repo ..
Use the individual pieces when you want more control:
superqode local doctor --repo .
superqode local serve ollama
superqode local models
superqode local warm ollama --model qwen3:8b
superqode local smoke --repo .
superqode local doctor --generate harness.yaml
superqode --harness harness.yaml -p "your task"
The generated harness routes to the right provider for where the model lives, references the matching model policy pack, and switches small machines to prompt-based tool calling. The smoke test verifies server reachability, chat model availability, context-window detection, TTFT, read-file tool calls, patch-format behavior, shell tool calls, and long-context recall.
You own this harness. Build a custom one by answering a few questions with superqode harness wizard, start from a model-family template such as qwen-coding or glm-coding, and read exactly what any harness does in plain English with superqode harness explain --spec harness.yaml. The Bring Your Own Harness guide walks the whole flow.
To browse model labs before downloading weights, use the models.dev-backed Labs view:
superqode local labs
superqode local labs zhipuai
superqode local labs alibaba --refresh
The Labs view highlights local-friendly families such as GLM, Qwen, Gemma, DeepSeek, and Devstral, then shows open-weight, tool-capable, long-context candidates with Hugging Face download hints where models.dev provides them. SuperQode's own recommendations are intentionally narrower than generic model search: they come from curated models.dev Labs or vetted community namespaces such as mlx-community, so random Hub results do not become default guidance.
What SuperQode does differently for local models¶
Every layer of the harness has a local-first answer:
| Reality of local models | SuperQode's answer |
|---|---|
| The loaded context window is smaller than the model card says | Live window detection from the running server, adaptive compaction sized to it |
| Model families need different prompts, temperatures, and formats | Model policy packs: tuned defaults per family, user-overridable |
| Many models have no reliable native tool head | tool_call_format: prompt renders tools into the prompt and parses calls from text |
| Tool schemas eat the prompt budget | Deferred tools: heavy schemas hidden until the model activates them via tool_search |
| Keyword search misses conceptual matches | Optional semantic code search via superqode[semantic], backed by a local CocoIndex daemon and local Ollama embeddings |
| A simple "hello" should not feel like a repo-sized agent run | Local fast-chat path: obvious greetings/basic non-code questions skip coding history, reminders, context probing, and tool schemas |
| Small models loop and emit malformed calls | Doom-loop guard, tool-argument repair, and dangling tool-call repair in the agent loop |
| Engine choice is hardware-dependent and changes monthly | The recommendation matrix ships as updatable data with user overrides |
| Model discovery should not depend on stale README lists | superqode local labs reads models.dev Labs metadata and points to Hugging Face artifacts |
| Speed claims are unverifiable | superqode local warm preloads one model and reports first-token latency; superqode local bench measures TTFT and decode rate; --agentic also scores tool-call, edit-format, shell-call, and context-recall probes |
| Even utility calls cost main-model time | Utility routing: grading and memory extraction on a small local model or the free on-device Apple model |
| Long runs chain you to the terminal | Chat channels: superqode daemon relays approvals and steering to Telegram, Slack, or Discord |
The local engines SuperQode speaks¶
| Engine | Best on | SuperQode integration |
|---|---|---|
| Ollama | everywhere, easiest start | provider, window detection, MLX-runtime detection, keep-alive shaping |
MLX (mlx_lm.server) | Apple Silicon, fastest path | superqode providers mlx server, HF cache inventory |
| LM Studio | desktop, GUI management | provider, loaded-window detection, model inventory |
| llama.cpp | CPU and constrained hardware | OpenAI-compatible provider, window detection |
| vLLM | NVIDIA, throughput | provider, max_model_len detection |
| SGLang | NVIDIA, agentic pipelines | OpenAI-compatible provider, matrix-ranked on CUDA tiers |
| DS4 | DeepSeek V4 Flash | dedicated provider, KV-cache guidance, thinking modes |
All of them are detected by the doctor, benchable with superqode local bench, and usable from the same harness contract.
Fast first response¶
Local inference has two different first-response costs:
- The model server may need to page weights into memory on the first request.
- A coding-agent turn must prefill the system prompt, tool schemas, restored session history, and any active repository context.
SuperQode avoids paying the second cost for obvious chat prompts. For local providers other than DS4, greetings and basic non-code questions such as hello or what is 2+2? use a fast-chat path: no tool schemas, no restored coding history, no reminders, and no live context-window probe. Real coding requests still use the full harness.
Warm the model before a coding session when you want to remove cold-load from the first real prompt:
superqode local warm ollama --model qwen3:8b
superqode local warm lmstudio
superqode local warm mlx --model mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit
local warm sends one tiny streamed request and reports time to first token (TTFT), decode speed, and total time. If TTFT stays high after warming, reduce the loaded context (--ctx / num_ctx) or use a smaller quantized model for interactive coding.
Local does not mean isolated¶
Local Agentic Coding is the center, not a wall. The same harness connects outward when you want it to: BYOK keys for hosted frontier models, ACP for editor agents, agent SDK runtimes, MCP for tools, and A2A for agent-to-agent workflows. Your harness spec stays identical; only the route changes. That is the point of a portable harness: local first, connected to everything.
Start here¶
- Install SuperQode and run
superqode local init --repo .. - Follow any smoke-test next steps: start the server, load a chat model, or reduce context if TTFT is high.
- Run
superqode --harness superqode.local.yamland start coding. - When you want numbers, run
superqode local bench. When you want deeper agent-readiness numbers, runsuperqode local bench --agentic. When you have multiple local/open candidates, runsuperqode local optimizeto generate a role-routed harness. For deeper control, read the Local Stack Doctor guide and the harness system.