Benchmarks¶
TurboAgents ships lightweight synthetic benchmark surfaces that are practical to run on a normal development machine. Their job is not to imitate every real-world workload. Their job is to validate the package surface, compare bit-width trends, catch regressions, and make the tradeoffs visible before you move on to larger model or dataset runs.
Commands¶
turboagents bench kv
turboagents bench kv --format json
turboagents bench rag --format markdown
turboagents bench paper
Built-In Datasets¶
The built-in datasets are small by design. tiny-kv covers deterministic
vector batches for KV-style reconstruction tests. tiny-rag is the smallest
TurboRAG sanity check. medium-rag is the larger synthetic retrieval pass used
for the adapter matrix. paper-sim is the synthetic paper-style MSE
comparison set. All of them are generated by turboagents/bench/datasets.py.
What The Metrics Mean¶
mean_payload_bytes tells you how large the serialized compressed payload is on
average. mse and mean_cosine_similarity tell you how well the vector
survives quantize/dequantize. ip_mae measures how much inner-product quality
drifts against a held-out query. recall_at_1 and recall_at_10 tell you how
well the compressed retrieval path agrees with exact dot-product ranking.
What These Benchmarks Are Not¶
These benchmarks are not LongBench, BEIR, or MTEB. They are not final paper-faithful reproductions, and they are not full end-to-end serving numbers for large production models. Those larger evaluations should run on hardware with the right memory, runtimes, and datasets installed.
Full Benchmark Workflow¶
The repository also includes a reproducible harness for fuller runs:
uv sync --extra rag --extra mlx
uv run python scripts/run_benchmark_matrix.py --output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)
Optional real MLX run:
uv run python scripts/run_benchmark_matrix.py \
--mlx-model mlx-community/Llama-3.2-3B-Instruct-4bit \
--output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)
Optional live pgvector validation:
uv run python scripts/run_benchmark_matrix.py \
--pgvector-dsn postgresql://localhost/turboagents \
--output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)
The supporting entry points are scripts/run_benchmark_matrix.py,
scripts/benchmark_mlx.py, scripts/benchmark_rag_adapters.py,
scripts/benchmark_needle.py, scripts/summarize_benchmark_results.py, and
benchmarks/README.md.
Latest Benchmark Run¶
The latest checked-in artifact set includes:
benchmark-results/20260326-128gb-run/summary.mdbenchmark-results/20260326-128gb-run/mlx-benchmark.jsonbenchmark-results/20260326-128gb-run/rag-adapters.jsonbenchmark-results/20260327-chroma-local/rag-adapters.json
MLX Sweep¶
The current MLX sweep uses mlx-community/Llama-3.2-3B-Instruct-4bit. In that
run, 2.5 bits was the fastest at about 152 tok/s, while 3.5 and 4.0
were close behind at about 139 tok/s. The lower 2.0 and 2.5 settings
produced visibly worse completions than 3.0+, which is why 3.5 currently
looks like the best quality and performance balance from this sweep.
| Bits | Elapsed Seconds | Completion Tokens | Tokens / Second |
|---|---|---|---|
| 2.0 | 1.3136 | 129 | 98.2015 |
| 2.5 | 0.8479 | 129 | 152.1374 |
| 3.0 | 0.4926 | 64 | 129.9294 |
| 3.5 | 0.4606 | 64 | 138.9408 |
| 4.0 | 0.4630 | 64 | 138.2228 |
RAG Adapter Matrix¶
The current adapter matrix uses medium-rag with 1024 base vectors, 32
queries, and 128 dimensions.
FAISS is the strongest current retrieval path in this benchmark. It held
recall@1 = 1.0 and recall@10 = 1.0 across the full tested bit-width range,
with query times around 0.0015s to 0.0023s.
| Adapter | Bits | Build Seconds | Query Seconds | Recall@1 | Recall@10 |
|---|---|---|---|---|---|
| FAISS | 2.0 | 1.6098 | 0.002290 | 1.0 | 1.0 |
| FAISS | 2.5 | 2.0668 | 0.001462 | 1.0 | 1.0 |
| FAISS | 3.0 | 2.7302 | 0.001497 | 1.0 | 1.0 |
| FAISS | 3.5 | 4.0106 | 0.001527 | 1.0 | 1.0 |
| FAISS | 4.0 | 4.7658 | 0.001499 | 1.0 | 1.0 |
Chroma also held recall@1 = 1.0 and recall@10 = 1.0 across the tested
sweep. It is slower than FAISS in this path, but still materially faster than
the current LanceDB and pgvector runs.
| Adapter | Bits | Build Seconds | Query Seconds | Recall@1 | Recall@10 |
|---|---|---|---|---|---|
| Chroma | 2.0 | 1.823308 | 0.024900 | 1.0 | 1.0 |
| Chroma | 2.5 | 2.179364 | 0.017113 | 1.0 | 1.0 |
| Chroma | 3.0 | 2.708539 | 0.015125 | 1.0 | 1.0 |
| Chroma | 3.5 | 3.806835 | 0.016054 | 1.0 | 1.0 |
| Chroma | 4.0 | 4.611422 | 0.015874 | 1.0 | 1.0 |
LanceDB keeps recall@1 = 1.0, but its recall@10 is materially weaker than
FAISS in this benchmark, ranging from about 0.70 to 0.75, with query time
around 0.063s.
| Adapter | Bits | Build Seconds | Query Seconds | Recall@1 | Recall@10 |
|---|---|---|---|---|---|
| LanceDB | 2.0 | 8.0862 | 0.065101 | 1.0 | 0.750000 |
| LanceDB | 2.5 | 2.1825 | 0.063661 | 1.0 | 0.743750 |
| LanceDB | 3.0 | 2.8432 | 0.063563 | 1.0 | 0.700000 |
| LanceDB | 3.5 | 4.1354 | 0.063443 | 1.0 | 0.721875 |
| LanceDB | 4.0 | 4.9028 | 0.063786 | 1.0 | 0.721875 |
Synthetic KV Trend Snapshot¶
The synthetic KV report shows the expected monotonic behavior. Higher
bit-widths reduce mse, improve cosine similarity, and reduce inner-product
error. In this run, b2.0_mse = 0.192804, b3.5_mse = 0.029482, and
b4.0_mse = 0.014735.
Live pgvector Validation¶
The current pgvector artifact is benchmark-results/20260326-pgvector-adapters-fixed.json.
This run used PostgreSQL 17.9 with the vector extension enabled, and the
adapter benchmark was fixed to use isolated tables and normalized database row
ids. On medium-rag, pgvector held recall@1 = 1.0, improved recall@10
monotonically as bit-width increased, and beat LanceDB on recall@10 at
3.0+ bits. It is still much slower than FAISS and LanceDB in the current
path, at about 0.87s for the 32-query batch.
| Adapter | Bits | Build Seconds | Query Seconds | Recall@1 | Recall@10 |
|---|---|---|---|---|---|
| pgvector | 2.0 | 1.7564 | 0.865659 | 1.0 | 0.656250 |
| pgvector | 2.5 | 2.1937 | 0.866529 | 1.0 | 0.718750 |
| pgvector | 3.0 | 2.8430 | 0.874390 | 1.0 | 0.796875 |
| pgvector | 3.5 | 4.1119 | 0.873350 | 1.0 | 0.837500 |
| pgvector | 4.0 | 4.8137 | 0.872895 | 1.0 | 0.896875 |
Minimal Long-Context Eval¶
The repository now includes a minimal Needle-style long-context harness for MLX:
uv run python scripts/benchmark_needle.py \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--context-tokens 2048 4096 8192 \
--output benchmark-results/needle-$(date +%Y%m%d-%H%M%S).json
The current scope is deliberately narrow: deterministic exact-match retrieval of a single secret string, configurable context lengths, configurable insertion positions, and a bit-width sweep across the MLX path. It is still not a full Needle-in-a-Haystack reproduction, not a multi-document or multi-hop task, and not yet integrated into a serving latency matrix.
The latest artifact is benchmark-results/needle-20260326-20260326-234159.json.
On mlx-community/Llama-3.2-3B-Instruct-4bit, exact-match retrieval succeeded
at insertion fraction 0.1 across several prompt lengths, but failed
consistently at 0.5 and 0.9. contains_needle stays positive in some
cases where exact match fails, but that is a weaker criterion. The current
result should therefore be read as early-position recall, not robust
long-context retrieval.
| Bits | Context Tokens | Insertion Fraction | Elapsed Seconds | Exact Match | Contains Needle |
|---|---|---|---|---|---|
| 3.0 | 2048 | 0.1 | 1.1014 | true | true |
| 3.0 | 2048 | 0.5 | 1.1488 | false | true |
| 3.0 | 2048 | 0.9 | 1.1465 | false | false |
| 3.0 | 4096 | 0.1 | 2.5599 | false | true |
| 3.0 | 4096 | 0.5 | 2.6351 | false | true |
| 3.0 | 4096 | 0.9 | 2.7681 | false | false |
| 3.0 | 8192 | 0.1 | 6.1731 | true | true |
| 3.0 | 8192 | 0.5 | 6.7962 | false | false |
| 3.0 | 8192 | 0.9 | 6.3600 | false | false |
| 3.0 | 16384 | 0.1 | 15.6448 | true | true |
| 3.0 | 16384 | 0.5 | 16.1179 | false | false |
| 3.0 | 16384 | 0.9 | 15.6592 | false | false |
| 3.5 | 2048 | 0.1 | 1.2985 | true | true |
| 3.5 | 2048 | 0.5 | 1.4403 | false | true |
| 3.5 | 2048 | 0.9 | 1.4370 | false | false |
| 3.5 | 4096 | 0.1 | 2.9586 | false | true |
| 3.5 | 4096 | 0.5 | 2.9688 | false | true |
| 3.5 | 4096 | 0.9 | 2.9501 | false | false |
| 3.5 | 8192 | 0.1 | 6.3312 | true | true |
| 3.5 | 8192 | 0.5 | 6.5699 | false | false |
| 3.5 | 8192 | 0.9 | 6.5293 | false | false |
| 3.5 | 16384 | 0.1 | 18.4821 | true | true |
| 3.5 | 16384 | 0.5 | 18.9856 | false | false |
| 3.5 | 16384 | 0.9 | 19.2857 | false | false |
| 4.0 | 2048 | 0.1 | 1.5542 | true | true |
| 4.0 | 2048 | 0.5 | 1.6880 | false | true |
| 4.0 | 2048 | 0.9 | 1.6921 | false | false |
| 4.0 | 4096 | 0.1 | 3.5982 | false | true |
| 4.0 | 4096 | 0.5 | 3.6594 | false | true |
| 4.0 | 4096 | 0.9 | 3.8123 | false | false |
| 4.0 | 8192 | 0.1 | 7.8715 | true | true |
| 4.0 | 8192 | 0.5 | 8.3909 | false | false |
| 4.0 | 8192 | 0.9 | 8.4198 | false | false |
| 4.0 | 16384 | 0.1 | 20.2147 | true | true |
| 4.0 | 16384 | 0.5 | 21.5121 | false | false |
| 4.0 | 16384 | 0.9 | 20.6090 | false | false |