Local Context & Compaction¶
SuperQode is tuned to get the best out of local models (โ10Bโ120B), where the single biggest failure mode is running out of context. It solves this automatically: it detects each model's real loaded context window and compacts the conversation before it overflows: no configuration required.
Why this matters for local models¶
A local model "supports 128K context" on its model card, but the server may have loaded it with a much smaller window (e.g. Ollama num_ctx=4096). The loaded window is the only number that's true: and on a large model (120B) there's less VRAM for the KV cache, so the practical window is often smaller than on a 10B model. Overflowing it produces garbage output or hard errors.
SuperQode reads the loaded window directly from the server and sizes everything to it.
Automatic: detection + adaptive compaction¶
On the first coding-agent message of a session, SuperQode:
-
Detects the loaded window from the live server (per backend):
Backend Endpoint Field Ollama GET /api/pscontext_length(loadednum_ctx)llama.cpp GET /propsn_ctxLM Studio GET /api/v1/modelsloaded_context_lengthvLLM / DS4 / OpenAI-compatible GET /v1/modelsmax_model_len/context_lengthServer URLs come from each provider's env override (
OLLAMA_HOST,LMSTUDIO_HOST,DS4_HOST, ...) or its default port. If the window can't be read, SuperQode stays conservative (8K) rather than risk an overflow: it never assumes the model-card maximum for a local model. -
Compacts adaptively as the conversation grows. Compaction triggers at
window โ reserveand keeps a token-budgeted tail of recent turns, replacing older turns with a structured summary. Both the threshold and the kept-recent budget scale to the model's window, so a 4K model and a 200K model each behave sensibly.
This runs for local and BYOK models, on both streaming and non-streaming paths.
Simple local chat skips context work¶
For local providers other than DS4, obvious chat prompts such as hello, hi, or basic non-code questions use a fast-chat path. SuperQode sends a tiny direct request and skips the expensive coding-agent scaffolding for that turn:
- no live context-window probe
- no restored session history
- no reminder messages
- no tool schemas
This is intentionally narrow. Any prompt that mentions files, code, the repo, or a concrete development task uses the normal coding harness and context management. DS4 is excluded because it benefits from a stable rendered prefix for KV-cache reuse.
Inspect and override: :context¶
:context # show detected window, source, and compaction budgets
:context 8192 # pin the window (also accepts 16k)
:context auto # clear the override and re-detect from the server
Example:
๐ช Context window
Window: 16,384 tokens (loaded (/api/ps))
Compact at: 13,108 tokens
Keep recent: ~6,553 tokens
Auto-compact: ON
The source tells you where the number came from: loaded (<endpoint>), configured (you pinned it), local-fallback (couldn't detect โ conservative), or model-info (BYOK catalog). The live status-bar meter shows fill % against this window as you work.
Environment variables¶
| Variable | Effect |
|---|---|
SUPERQODE_AUTO_COMPACT=0 | Disable adaptive auto-compaction (on by default) |
OLLAMA_HOST / LMSTUDIO_HOST / DS4_HOST / ... | Where to probe for the loaded window |
You can also set the window per session in code via AgentConfig.context_window (0 = auto-detect), with compaction_reserve_tokens and keep_recent_tokens for fine control (0 = auto).
Choosing a good num_ctx¶
If you're picking a context size when loading a local model:
- 8Kโ16K is the sweet spot for most local coding models: enough for real work, small enough to stay fast and fit in VRAM.
- Going larger only helps if the machine has the VRAM. If the KV cache spills to CPU RAM, inference can drop 20โ50ร (e.g. 50โ100 tok/s โ 2โ5 tok/s).
- K/V cache quantization (q8/q4) lets you fit a larger window in the same VRAM.
Whatever you choose, SuperQode detects it and adapts: you don't have to tell it.
Controlling what local models show: :thinking¶
Local models can be noisy. The thinking-log detail is a three-way toggle (Ctrl+T cycles, or use the command):
:thinking # show current detail + how to change it
:thinking normal # default: iterations fold into a live status, reasoning trimmed
:thinking verbose # full per-iteration reasoning + tool detail
:thinking off # only tool calls and the final answer
In normal mode the agent loop's bookkeeping and raw reasoning are folded into a single live throbber with a tidy per-tool trace: calm by default, full detail on demand.