Local Providers¶
Run AI models locally for privacy, cost savings, and offline use.
Overview¶
Local providers offer:
- Privacy: Data stays on your machine
- Cost savings: No API fees
- Offline use: Work without internet
- Full control: Model selection and tuning
Quick Start: Zero To Local Coding¶
SuperQode bundles a guided path from "I want local coding" to a harness you can trust on your repo. You pay once in hardware, not forever in token bills. Local is slower than frontier labs and quality depends on your model and machine, so SuperQode focuses on measurement, control, and ownership of the harness.
Run one command from inside your repository:
superqode local init --repo .
local init will:
- Detect your hardware tier and installed local engines.
- Recommend trusted models (sourced from models.dev Labs and vetted communities only).
- Run a non-destructive smoke test against the running server.
- Write a transparent harness to
superqode.local.yaml. - Print the next command to run.
If no server is running yet, start one first, then rerun init:
superqode local serve ollama
superqode local init --repo .
Once init reports the harness is ready, start coding:
superqode --harness superqode.local.yaml
Find The Right Model To Download¶
You download models with each engine's own tool (ollama pull, Hugging Face, a GGUF). To find the right one for your hardware without guessing or wasting a download, search the trusted catalog:
superqode local search qwen
superqode local search qwen3-coder --json
Add --hub to also query the Hugging Face Hub live, filtered to trusted publishers (the model labs plus vetted quant communities like mlx-community and lmstudio-community), so you see the latest releases:
superqode local search glm --hub
superqode local search qwen3-coder --hub --gguf
In the TUI, type :hub to enter model-search mode, then just type a model name (no :local search prefix needed). :hub <name> does a one-shot search, and :hub off exits. Add --hub on a line for live Hugging Face results.
For each match it lists the real native download command for every engine the model can run on, plus a one-line superqode models download alternative that auto-picks the engine:
Qwen3-Coder 30B-A3B [~17.4 GB ยท likely fits]
ollama ollama pull qwen3-coder:30b-a3b
llama.cpp llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
LM Studio lms get https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
MLX hf download lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit
SuperQode superqode models download lmstudio-community/...-MLX-4bit (any engine)
The MLX and GGUF repositories come from the live Hugging Face API (trusted publishers only), and the command syntaxes are the current ones: hf download (not the deprecated huggingface-cli), llama-server -hf <repo>, and lms get <full HF URL>. Each match also shows an approximate model size, whether you already have it, and a rough memory-fit verdict for your hardware. The size is estimated from the parameter count and quantization, so treat it as a guide, not a guarantee. (Works offline too: without a network it falls back to the catalog command only.) SuperQode prints the command; you run it in your tool, then :connect local.
Verify Readiness Anytime¶
local smoke runs the same non-destructive readiness probe on demand. It never reads or edits your repo:
superqode local smoke --repo .
It checks that the server is reachable, a chat model (not an embedding model) is loaded, the context window is detected, and that a tiny prompt returns clean tool-call and patch output. It also measures TTFT and decode speed. The verdict is ready, usable with warnings, or not ready yet, and every failure prints the exact next command to run.
Common Failure Messages¶
local init and local smoke diagnose problems instead of just erroring. Typical messages and their fix:
| Message | What to do |
|---|---|
no response from <endpoint> | Start the server with superqode local serve <engine> or check the --endpoint URL. |
server returned no models | Load or install a chat model, then run superqode local models. |
only embedding/reranker models found | Load a chat/coding model, not an embedding model. |
High TTFT; model is cold | Warm it: superqode local warm <engine> --model <model>. |
Native tool calls look unreliable | The generated harness will fall back to prompt tool-call format. |
Long-context recall probe failed | Use a smaller context window or let SuperQode compact sooner. |
Supported Providers¶
| Provider | Best For | Setup Complexity |
|---|---|---|
| DS4 | DeepSeek V4 Flash, coding agents, long-context work | Medium |
| Ollama | Easy setup, many models | Easy |
| LM Studio | GUI interface, beginners | Easy |
| MLX | General Apple Silicon model serving | Medium |
| vLLM | Production, high throughput | Advanced |
DS4 / DwarfStar 4¶
DS4 runs DeepSeek V4 Flash locally and exposes OpenAI-compatible, Responses, and Anthropic-style endpoints. SuperQode treats it as a local provider named ds4, so it can be used from the CLI, TUI, provider doctor, and model recommendation flow.
Use DS4 instead of MLX when your target model is DeepSeek V4 Flash. MLX is a good general Apple Silicon runner; DS4 is a purpose-built DeepSeek V4 Flash engine with DS4-specific prompt rendering, tool-call handling, long-context behavior, and disk KV cache support.
Prerequisites¶
- A working DS4 checkout or release directory.
- The
ds4-serverbinary available in that directory or onPATH. - A compatible model file available to DS4, commonly
ds4flash.gguf.
See the upstream project for installation and model details: antirez/ds4.
Start DS4¶
From the directory that contains ds4-server and the model file:
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
If you launch the server from another directory, pass the DS4 checkout path so runtime files resolve correctly:
./ds4-server --chdir /path/to/ds4 --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
By default, SuperQode expects DS4 at:
http://127.0.0.1:8000/v1
If your DS4 server runs somewhere else, set:
export DS4_HOST=http://127.0.0.1:8000/v1
Check SuperQode Connectivity¶
superqode providers guide ds4
superqode providers models ds4
superqode providers recommend local
superqode doctor
providers guide ds4 checks whether the local server is reachable. If DS4 is not running, SuperQode will show a setup hint instead of treating the provider as ready.
Run a Headless Coding Task¶
superqode -p --provider ds4 --model deepseek-v4-flash "summarize this repository"
For a harness run, use the DS4 example or template:
superqode harness run --spec examples/harnesses/ds4.yaml --prompt "review this repository"
superqode harness init my-ds4 --template ds4-fast-local
Use deepseek-chat when you want the DS4 non-thinking/direct alias:
superqode -p --provider ds4 --model deepseek-chat "review the current git diff"
Connect From The TUI¶
In the SuperQode TUI command input, open the local provider picker and select DS4 by number:
:connect local
You can also jump straight to the DS4 model list:
:connect local ds4
Direct model selection is still supported:
:connect local ds4/deepseek-v4-flash
Notes¶
- DS4 is local, so no API key is required.
- SuperQode supplies a placeholder OpenAI API key for OpenAI-compatible local clients that require one.
- SuperQode uses DS4's Anthropic-style
/v1/messagespath for direct local runs so tool and thinking blocks stay in the shape DS4 expects. - DS4 uses a smaller DS4-specific tool profile and disables parallel tool execution by default.
- DS4 uses direct tool gating by default: SuperQode sends tools for repo, file, test, command, and code-change requests, but skips tools for ordinary questions and standalone code-generation prompts. This reduces unnecessary agent iterations.
deepseek-v4-flashis the recommended default for coding and long-context local work.deepseek-chatis useful when you want the non-thinking mode exposed by DS4-compatible clients.
DS4 Tool Mode¶
The default DS4 tool mode is auto. Override it when you need different behavior:
# Default: send tools only for project/file/codebase work
export SUPERQODE_DS4_TOOL_MODE=auto
# Restore eager tool use
export SUPERQODE_DS4_TOOL_MODE=always
# Disable tools for DS4
export SUPERQODE_DS4_TOOL_MODE=never
Cold Start & Warm-up¶
DS4 mmaps a large GGUF (the IQ2XXS DeepSeek V4 Flash build is ~81GB) and pays a one-time cost paging it in from disk on the first inference. Once warm, responses are fast (sub-second for short prompts). To keep your first real prompt fast, SuperQode warms the model on connect: it sends a tiny 1-token request and shows a live elapsed-time indicator, then reports when DS4 is warm.
โ DS4 server ready at http://127.0.0.1:8000/v1
โณ Loading model into memory (first start can be slow on a cold cache)...
...still loading the model (10s)
โ DS4 ready (warm) - 24s
Tips to avoid cold starts:
- Keep
ds4-serverrunning between sessions (the OS page cache stays warm). - Use the disk KV cache (
--kv-disk-dir) so prompt prefixes survive restarts.
Disable the connect-time warm-up with:
export SUPERQODE_DS4_WARMUP=0 # 0/false/no/off - skip warm-up on connect
Local Code Search (No Web Access)¶
Local models have no internet access, so web_search is intentionally not part of the DS4/local tool profile - and asking a local model to "search the web" will not work. Instead, local models are tuned to answer from local code using repo_search (broad, ranked files + content + symbols in one pass), grep, code_search (symbols/definitions/references), and read_file.
To let a local model search a repo you downloaded outside your project (for reference, API examples, etc.), point SuperQode at it with SUPERQODE_SEARCH_ROOTS (os.pathsep-separated - : on macOS/Linux):
export SUPERQODE_SEARCH_ROOTS="$HOME/refs/react:$HOME/refs/linux"
superqode -p --provider ds4 "how does this project's router compare to react's? search the react ref"
- Search and read tools (
repo_search,grep,glob,code_search,read_file,list_directory) may access those roots read-only. - Writes/edits/shell stay confined to your working directory - reference repos cannot be modified.
- Address a reference repo by its absolute path. The configured roots are listed in the local model's system prompt so it knows they're available.
Ollama¶
The easiest way to run local models.
Installation¶
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download from ollama.com
Quick Start¶
# Start Ollama
ollama serve
# Pull a model
ollama pull qwen3:8b
# Connect in SuperQode
superqode connect local ollama qwen3:8b
Recommended Models¶
These are recommendations, not a model store. Keep them constrained to models.dev Labs, LM Studio Community, or MLX Community provenance so users do not get a biased or confusing grab bag of arbitrary model names.
| Model | Size | Best For |
|---|---|---|
qwen3.6:35b-a3b | varies | Alibaba Labs Qwen agentic coding |
glm-4.5-air | varies | Zhipu AI Labs GLM long-context coding |
gemma4:e4b | ~3GB | Google Labs small local utility work |
deepseek-v4-flash | server-class | DeepSeek Labs via DS4/server routes |
Configuration¶
providers:
ollama:
base_url: http://localhost:11434
type: openai-compatible
recommended_models:
- qwen3.6:35b-a3b
- glm-4.5-air
- gemma4:e4b
LM Studio¶
GUI-based local model runner.
Installation¶
- Download from lmstudio.ai
- Install and open LM Studio
- Download a model (search for "qwen", "glm", or "gemma")
- Load the model
- Start Local Server (port 1234)
Connect¶
superqode connect local lmstudio local-model
Configuration¶
providers:
lmstudio:
base_url: http://localhost:1234
type: openai-compatible
Tips¶
- Keep LM Studio running in background
- Load model before connecting
- Check "Local Server" tab for status
MLX (Apple Silicon)¶
Optimized for M1/M2/M3 Macs.
Installation¶
pip install mlx-lm
Quick Start¶
# Download model
mlx_lm.download mlx-community/Qwen2.5-Coder-3B-4bit
# Start server (in separate terminal)
mlx_lm.server --model mlx-community/Qwen2.5-Coder-3B-4bit
# Connect in SuperQode
superqode connect local mlx mlx-community/Qwen2.5-Coder-3B-4bit
Recommended Models¶
MLX recommendations use the vetted mlx-community namespace or a model family also present in models.dev Labs.
| Model | RAM | Quality |
|---|---|---|
mlx-community/Qwen2.5-Coder-0.5B-Instruct-4bit | 2GB | Basic |
mlx-community/Qwen2.5-Coder-3B-4bit | 4GB | Good |
mlx-community/Qwen2.5-Coder-7B-4bit | 8GB | Better |
mlx-community/Qwen3-30B-A3B-4bit | 16GB | Best |
MLX Commands¶
# List available models
superqode providers mlx list
# Show suggested models
superqode providers mlx models
# Check installation
superqode providers mlx check
# Full setup guide
superqode providers mlx setup
Configuration¶
providers:
mlx:
base_url: http://localhost:8080
type: openai-compatible
Limitations¶
- One server per model
- Single request at a time
- MoE models not supported
vLLM¶
High-performance inference for production.
Installation¶
pip install vllm
Quick Start¶
# Start server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--port 8000
# Connect in SuperQode
superqode connect local vllm Qwen/Qwen2.5-Coder-7B-Instruct
Configuration¶
providers:
vllm:
base_url: http://localhost:8000
type: openai-compatible
Benefits¶
- High throughput
- Continuous batching
- PagedAttention
SGLang¶
Fast structured generation framework optimized for complex prompts.
Installation¶
pip install "sglang[all]"
Quick Start¶
# Start server
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-Coder-7B-Instruct \
--port 30000
# Connect in SuperQode
superqode connect local sglang Qwen/Qwen2.5-Coder-7B-Instruct
Configuration¶
providers:
sglang:
base_url: http://localhost:30000/v1
type: openai-compatible
Features¶
- RadixAttention: Fast KV cache reuse for better performance
- Compressed FSM: Efficient structured output generation
- OpenAI-compatible API: Drop-in replacement for OpenAI endpoints
Benefits¶
- Faster inference for complex prompts
- Efficient structured generation
- Good for code analysis tasks
Recommended Models¶
| Model | Size | Best For |
|---|---|---|
Qwen/Qwen3-Coder-30B-A3B-FP8 | varies | Alibaba Labs coder route |
THUDM/GLM-4.5-Air | varies | Zhipu AI Labs long-context coding |
TGI (Text Generation Inference)¶
HuggingFace's production-grade inference server with multi-GPU support.
Installation¶
# Using Docker (recommended)
docker run --gpus all \
-p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id Qwen/Qwen2.5-Coder-7B-Instruct
# Or using Python
pip install text-generation
Quick Start¶
# Using Docker
docker run -d --gpus all \
-p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id Qwen/Qwen2.5-Coder-7B-Instruct \
--port 80
# Connect in SuperQode
superqode connect local tgi Qwen/Qwen2.5-Coder-7B-Instruct
Configuration¶
providers:
tgi:
base_url: http://localhost:8080
type: huggingface
Features¶
- Flash Attention & Paged Attention: Memory-efficient attention
- Continuous Batching: Efficient request handling
- Tensor Parallelism: Multi-GPU support
- Token Streaming: Real-time token output
- Tool/Function Calling: Built-in tool support
Benefits¶
- Production-ready server
- Multi-GPU scaling
- Memory efficient
- Good for high-load scenarios
Multi-GPU Setup¶
docker run --gpus all \
-p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id Qwen/Qwen2.5-Coder-7B-Instruct \
--num-shard 4 # Use 4 GPUs
llama.cpp¶
CPU/GPU inference server for GGUF format models.
Installation¶
# Clone and build
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
# Or use pre-built server
# Download from releases
Quick Start¶
# Start server
./llama-server \
-m models/qwen2.5-coder-7b.Q8_0.gguf \
--port 8080 \
--host 0.0.0.0
# Connect in SuperQode
superqode connect local llamacpp local-model
Configuration¶
providers:
llamacpp:
base_url: http://localhost:8080/v1
type: openai-compatible
Features¶
- GGUF Format: Efficient model format
- CPU/GPU Support: Works on both CPU and GPU
- Low Memory: Efficient memory usage
- OpenAI-compatible: Standard API interface
Benefits¶
- Runs on CPU efficiently
- Works with quantized models
- Low resource requirements
- Good for older hardware
Model Format¶
llama.cpp uses GGUF format models:
# Convert model to GGUF
python convert.py --outfile model.gguf --outtype f16 model/
# Quantize model
./quantize model.gguf model-q8_0.gguf Q8_0
Quantization Levels¶
| Level | Size | Quality | Speed |
|---|---|---|---|
F16 | 100% | Best | Medium |
Q8_0 | 50% | Very Good | Fast |
Q4_K_M | 25% | Good | Very Fast |
Q2_K | 12.5% | Basic | Fastest |
Performance Tips¶
1. Choose Right Model Size¶
| RAM | Recommended Size |
|---|---|
| 8GB | 3B-7B models |
| 16GB | 7B-13B models |
| 32GB+ | 13B+ models |
2. Use Quantized Models¶
Quantized models (4-bit, 8-bit) use less RAM:
# MLX example
mlx_lm.download mlx-community/Qwen2.5-Coder-7B-4bit # vs 8bit
3. Keep Server Running¶
Start servers before SuperQode sessions to avoid startup delays.
4. Use Appropriate Context Length¶
Shorter context = faster inference:
# Ollama example with context length
ollama run qwen3:8b --num-ctx 4096
Troubleshooting¶
Connection Refused¶
[INCORRECT] Connection failed: Connection refused
Solution: Ensure server is running:
# Ollama
ollama serve
# MLX
mlx_lm.server --model <model>
# LM Studio
# Check Local Server tab
Model Not Found¶
[INCORRECT] Model 'qwen3:8b' not found
Solution: Pull/download the model first:
# Ollama
ollama pull qwen3:8b
# MLX
mlx_lm.download mlx-community/Qwen2.5-Coder-3B-4bit
Out of Memory¶
[INCORRECT] CUDA out of memory / MPS out of memory
Solutions: - Use smaller model - Use quantized model - Close other applications - Reduce context length
Slow Inference¶
Solutions: - Use quantized models - Reduce context length - Use GPU acceleration - Consider faster hardware
Comparison¶
| Provider | Setup | Speed | GUI | Best For |
|---|---|---|---|---|
| DS4 | Medium | Fast | No | DeepSeek V4 Flash coding and long context |
| Ollama | Easy | Fast | No | General use |
| LM Studio | Easy | Medium | Yes | Beginners |
| MLX | Medium | Fast | No | General Apple Silicon models |
| vLLM | Advanced | Very Fast | No | Production |
| SGLang | Medium | Very Fast | No | Structured generation |
| TGI | Advanced | Very Fast | No | Multi-GPU production |
| llama.cpp | Medium | Medium | No | CPU inference |
Next Steps¶
- BYOK Providers - Cloud alternatives
- Provider Commands - CLI reference