Benchmarks¶

TurboAgents ships lightweight synthetic benchmark surfaces that are practical to run on a normal development machine. Their job is not to imitate every real-world workload. Their job is to validate the package surface, compare bit-width trends, catch regressions, and make the tradeoffs visible before you move on to larger model or dataset runs.

Commands¶

turboagents bench kv
turboagents bench kv --format json
turboagents bench rag --format markdown
turboagents bench paper

Built-In Datasets¶

The built-in datasets are small by design. tiny-kv covers deterministic vector batches for KV-style reconstruction tests. tiny-rag is the smallest TurboRAG sanity check. medium-rag is the larger synthetic retrieval pass used for the adapter matrix. paper-sim is the synthetic paper-style MSE comparison set. All of them are generated by turboagents/bench/datasets.py.

What The Metrics Mean¶

mean_payload_bytes tells you how large the serialized compressed payload is on average. mse and mean_cosine_similarity tell you how well the vector survives quantize/dequantize. ip_mae measures how much inner-product quality drifts against a held-out query. recall_at_1 and recall_at_10 tell you how well the compressed retrieval path agrees with exact dot-product ranking.

What These Benchmarks Are Not¶

These benchmarks are not LongBench, BEIR, or MTEB. They are not final paper-faithful reproductions, and they are not full end-to-end serving numbers for large production models. Those larger evaluations should run on hardware with the right memory, runtimes, and datasets installed.

Full Benchmark Workflow¶

The repository also includes a reproducible harness for fuller runs:

uv sync --extra rag --extra mlx
uv run python scripts/run_benchmark_matrix.py --output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)

Optional real MLX run:

uv run python scripts/run_benchmark_matrix.py \
  --mlx-model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)

Optional live pgvector validation:

uv run python scripts/run_benchmark_matrix.py \
  --pgvector-dsn postgresql://localhost/turboagents \
  --output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)

The supporting entry points are scripts/run_benchmark_matrix.py, scripts/benchmark_mlx.py, scripts/benchmark_rag_adapters.py, scripts/benchmark_needle.py, scripts/summarize_benchmark_results.py, and benchmarks/README.md.

Latest Benchmark Run¶

The latest checked-in artifact set includes:

benchmark-results/20260326-128gb-run/summary.md
benchmark-results/20260326-128gb-run/mlx-benchmark.json
benchmark-results/20260326-128gb-run/rag-adapters.json
benchmark-results/20260327-chroma-local/rag-adapters.json

MLX Sweep¶

The current MLX sweep uses mlx-community/Llama-3.2-3B-Instruct-4bit. In that run, 2.5 bits was the fastest at about 152 tok/s, while 3.5 and 4.0 were close behind at about 139 tok/s. The lower 2.0 and 2.5 settings produced visibly worse completions than 3.0+, which is why 3.5 currently looks like the best quality and performance balance from this sweep.

Bits	Elapsed Seconds	Completion Tokens	Tokens / Second
2.0	1.3136	129	98.2015
2.5	0.8479	129	152.1374
3.0	0.4926	64	129.9294
3.5	0.4606	64	138.9408
4.0	0.4630	64	138.2228

RAG Adapter Matrix¶

The current adapter matrix uses medium-rag with 1024 base vectors, 32 queries, and 128 dimensions.

FAISS is the strongest current retrieval path in this benchmark. It held recall@1 = 1.0 and recall@10 = 1.0 across the full tested bit-width range, with query times around 0.0015s to 0.0023s.

Adapter	Bits	Build Seconds	Query Seconds	Recall@1	Recall@10
FAISS	2.0	1.6098	0.002290	1.0	1.0
FAISS	2.5	2.0668	0.001462	1.0	1.0
FAISS	3.0	2.7302	0.001497	1.0	1.0
FAISS	3.5	4.0106	0.001527	1.0	1.0
FAISS	4.0	4.7658	0.001499	1.0	1.0

Chroma also held recall@1 = 1.0 and recall@10 = 1.0 across the tested sweep. It is slower than FAISS in this path, but still materially faster than the current LanceDB and pgvector runs.

Adapter	Bits	Build Seconds	Query Seconds	Recall@1	Recall@10
Chroma	2.0	1.823308	0.024900	1.0	1.0
Chroma	2.5	2.179364	0.017113	1.0	1.0
Chroma	3.0	2.708539	0.015125	1.0	1.0
Chroma	3.5	3.806835	0.016054	1.0	1.0
Chroma	4.0	4.611422	0.015874	1.0	1.0

LanceDB keeps recall@1 = 1.0, but its recall@10 is materially weaker than FAISS in this benchmark, ranging from about 0.70 to 0.75, with query time around 0.063s.

Adapter	Bits	Build Seconds	Query Seconds	Recall@1	Recall@10
LanceDB	2.0	8.0862	0.065101	1.0	0.750000
LanceDB	2.5	2.1825	0.063661	1.0	0.743750
LanceDB	3.0	2.8432	0.063563	1.0	0.700000
LanceDB	3.5	4.1354	0.063443	1.0	0.721875
LanceDB	4.0	4.9028	0.063786	1.0	0.721875

Synthetic KV Trend Snapshot¶

The synthetic KV report shows the expected monotonic behavior. Higher bit-widths reduce mse, improve cosine similarity, and reduce inner-product error. In this run, b2.0_mse = 0.192804, b3.5_mse = 0.029482, and b4.0_mse = 0.014735.

Live pgvector Validation¶

The current pgvector artifact is benchmark-results/20260326-pgvector-adapters-fixed.json. This run used PostgreSQL 17.9 with the vector extension enabled, and the adapter benchmark was fixed to use isolated tables and normalized database row ids. On medium-rag, pgvector held recall@1 = 1.0, improved recall@10 monotonically as bit-width increased, and beat LanceDB on recall@10 at 3.0+ bits. It is still much slower than FAISS and LanceDB in the current path, at about 0.87s for the 32-query batch.

Adapter	Bits	Build Seconds	Query Seconds	Recall@1	Recall@10
pgvector	2.0	1.7564	0.865659	1.0	0.656250
pgvector	2.5	2.1937	0.866529	1.0	0.718750
pgvector	3.0	2.8430	0.874390	1.0	0.796875
pgvector	3.5	4.1119	0.873350	1.0	0.837500
pgvector	4.0	4.8137	0.872895	1.0	0.896875

Minimal Long-Context Eval¶

The repository now includes a minimal Needle-style long-context harness for MLX:

uv run python scripts/benchmark_needle.py \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --context-tokens 2048 4096 8192 \
  --output benchmark-results/needle-$(date +%Y%m%d-%H%M%S).json

The current scope is deliberately narrow: deterministic exact-match retrieval of a single secret string, configurable context lengths, configurable insertion positions, and a bit-width sweep across the MLX path. It is still not a full Needle-in-a-Haystack reproduction, not a multi-document or multi-hop task, and not yet integrated into a serving latency matrix.

The latest artifact is benchmark-results/needle-20260326-20260326-234159.json. On mlx-community/Llama-3.2-3B-Instruct-4bit, exact-match retrieval succeeded at insertion fraction 0.1 across several prompt lengths, but failed consistently at 0.5 and 0.9. contains_needle stays positive in some cases where exact match fails, but that is a weaker criterion. The current result should therefore be read as early-position recall, not robust long-context retrieval.

Bits	Context Tokens	Insertion Fraction	Elapsed Seconds	Exact Match	Contains Needle
3.0	2048	0.1	1.1014	true	true
3.0	2048	0.5	1.1488	false	true
3.0	2048	0.9	1.1465	false	false
3.0	4096	0.1	2.5599	false	true
3.0	4096	0.5	2.6351	false	true
3.0	4096	0.9	2.7681	false	false
3.0	8192	0.1	6.1731	true	true
3.0	8192	0.5	6.7962	false	false
3.0	8192	0.9	6.3600	false	false
3.0	16384	0.1	15.6448	true	true
3.0	16384	0.5	16.1179	false	false
3.0	16384	0.9	15.6592	false	false
3.5	2048	0.1	1.2985	true	true
3.5	2048	0.5	1.4403	false	true
3.5	2048	0.9	1.4370	false	false
3.5	4096	0.1	2.9586	false	true
3.5	4096	0.5	2.9688	false	true
3.5	4096	0.9	2.9501	false	false
3.5	8192	0.1	6.3312	true	true
3.5	8192	0.5	6.5699	false	false
3.5	8192	0.9	6.5293	false	false
3.5	16384	0.1	18.4821	true	true
3.5	16384	0.5	18.9856	false	false
3.5	16384	0.9	19.2857	false	false
4.0	2048	0.1	1.5542	true	true
4.0	2048	0.5	1.6880	false	true
4.0	2048	0.9	1.6921	false	false
4.0	4096	0.1	3.5982	false	true
4.0	4096	0.5	3.6594	false	true
4.0	4096	0.9	3.8123	false	false
4.0	8192	0.1	7.8715	true	true
4.0	8192	0.5	8.3909	false	false
4.0	8192	0.9	8.4198	false	false
4.0	16384	0.1	20.2147	true	true
4.0	16384	0.5	21.5121	false	false
4.0	16384	0.9	20.6090	false	false