Skip to content

Benchmarks

TurboAgents ships lightweight synthetic benchmark surfaces that are practical to run on a normal development machine. Their job is not to imitate every real-world workload. Their job is to validate the package surface, compare bit-width trends, catch regressions, and make the tradeoffs visible before you move on to larger model or dataset runs.

Commands

turboagents bench kv
turboagents bench kv --format json
turboagents bench rag --format markdown
turboagents bench paper

Built-In Datasets

The built-in datasets are small by design. tiny-kv covers deterministic vector batches for KV-style reconstruction tests. tiny-rag is the smallest TurboRAG sanity check. medium-rag is the larger synthetic retrieval pass used for the adapter matrix. paper-sim is the synthetic paper-style MSE comparison set. All of them are generated by turboagents/bench/datasets.py.

What The Metrics Mean

mean_payload_bytes tells you how large the serialized compressed payload is on average. mse and mean_cosine_similarity tell you how well the vector survives quantize/dequantize. ip_mae measures how much inner-product quality drifts against a held-out query. recall_at_1 and recall_at_10 tell you how well the compressed retrieval path agrees with exact dot-product ranking.

What These Benchmarks Are Not

These benchmarks are not LongBench, BEIR, or MTEB. They are not final paper-faithful reproductions, and they are not full end-to-end serving numbers for large production models. Those larger evaluations should run on hardware with the right memory, runtimes, and datasets installed.

Full Benchmark Workflow

The repository also includes a reproducible harness for fuller runs:

uv sync --extra rag --extra mlx
uv run python scripts/run_benchmark_matrix.py --output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)

Optional real MLX run:

uv run python scripts/run_benchmark_matrix.py \
  --mlx-model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)

Optional live pgvector validation:

uv run python scripts/run_benchmark_matrix.py \
  --pgvector-dsn postgresql://localhost/turboagents \
  --output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)

The supporting entry points are scripts/run_benchmark_matrix.py, scripts/benchmark_mlx.py, scripts/benchmark_rag_adapters.py, scripts/benchmark_needle.py, scripts/summarize_benchmark_results.py, and benchmarks/README.md.

Latest Benchmark Run

The latest checked-in artifact set includes:

  • benchmark-results/20260326-128gb-run/summary.md
  • benchmark-results/20260326-128gb-run/mlx-benchmark.json
  • benchmark-results/20260326-128gb-run/rag-adapters.json
  • benchmark-results/20260327-chroma-local/rag-adapters.json

MLX Sweep

The current MLX sweep uses mlx-community/Llama-3.2-3B-Instruct-4bit. In that run, 2.5 bits was the fastest at about 152 tok/s, while 3.5 and 4.0 were close behind at about 139 tok/s. The lower 2.0 and 2.5 settings produced visibly worse completions than 3.0+, which is why 3.5 currently looks like the best quality and performance balance from this sweep.

Bits Elapsed Seconds Completion Tokens Tokens / Second
2.0 1.3136 129 98.2015
2.5 0.8479 129 152.1374
3.0 0.4926 64 129.9294
3.5 0.4606 64 138.9408
4.0 0.4630 64 138.2228

RAG Adapter Matrix

The current adapter matrix uses medium-rag with 1024 base vectors, 32 queries, and 128 dimensions.

FAISS is the strongest current retrieval path in this benchmark. It held recall@1 = 1.0 and recall@10 = 1.0 across the full tested bit-width range, with query times around 0.0015s to 0.0023s.

Adapter Bits Build Seconds Query Seconds Recall@1 Recall@10
FAISS 2.0 1.6098 0.002290 1.0 1.0
FAISS 2.5 2.0668 0.001462 1.0 1.0
FAISS 3.0 2.7302 0.001497 1.0 1.0
FAISS 3.5 4.0106 0.001527 1.0 1.0
FAISS 4.0 4.7658 0.001499 1.0 1.0

Chroma also held recall@1 = 1.0 and recall@10 = 1.0 across the tested sweep. It is slower than FAISS in this path, but still materially faster than the current LanceDB and pgvector runs.

Adapter Bits Build Seconds Query Seconds Recall@1 Recall@10
Chroma 2.0 1.823308 0.024900 1.0 1.0
Chroma 2.5 2.179364 0.017113 1.0 1.0
Chroma 3.0 2.708539 0.015125 1.0 1.0
Chroma 3.5 3.806835 0.016054 1.0 1.0
Chroma 4.0 4.611422 0.015874 1.0 1.0

LanceDB keeps recall@1 = 1.0, but its recall@10 is materially weaker than FAISS in this benchmark, ranging from about 0.70 to 0.75, with query time around 0.063s.

Adapter Bits Build Seconds Query Seconds Recall@1 Recall@10
LanceDB 2.0 8.0862 0.065101 1.0 0.750000
LanceDB 2.5 2.1825 0.063661 1.0 0.743750
LanceDB 3.0 2.8432 0.063563 1.0 0.700000
LanceDB 3.5 4.1354 0.063443 1.0 0.721875
LanceDB 4.0 4.9028 0.063786 1.0 0.721875

Synthetic KV Trend Snapshot

The synthetic KV report shows the expected monotonic behavior. Higher bit-widths reduce mse, improve cosine similarity, and reduce inner-product error. In this run, b2.0_mse = 0.192804, b3.5_mse = 0.029482, and b4.0_mse = 0.014735.

Live pgvector Validation

The current pgvector artifact is benchmark-results/20260326-pgvector-adapters-fixed.json. This run used PostgreSQL 17.9 with the vector extension enabled, and the adapter benchmark was fixed to use isolated tables and normalized database row ids. On medium-rag, pgvector held recall@1 = 1.0, improved recall@10 monotonically as bit-width increased, and beat LanceDB on recall@10 at 3.0+ bits. It is still much slower than FAISS and LanceDB in the current path, at about 0.87s for the 32-query batch.

Adapter Bits Build Seconds Query Seconds Recall@1 Recall@10
pgvector 2.0 1.7564 0.865659 1.0 0.656250
pgvector 2.5 2.1937 0.866529 1.0 0.718750
pgvector 3.0 2.8430 0.874390 1.0 0.796875
pgvector 3.5 4.1119 0.873350 1.0 0.837500
pgvector 4.0 4.8137 0.872895 1.0 0.896875

Minimal Long-Context Eval

The repository now includes a minimal Needle-style long-context harness for MLX:

uv run python scripts/benchmark_needle.py \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --context-tokens 2048 4096 8192 \
  --output benchmark-results/needle-$(date +%Y%m%d-%H%M%S).json

The current scope is deliberately narrow: deterministic exact-match retrieval of a single secret string, configurable context lengths, configurable insertion positions, and a bit-width sweep across the MLX path. It is still not a full Needle-in-a-Haystack reproduction, not a multi-document or multi-hop task, and not yet integrated into a serving latency matrix.

The latest artifact is benchmark-results/needle-20260326-20260326-234159.json. On mlx-community/Llama-3.2-3B-Instruct-4bit, exact-match retrieval succeeded at insertion fraction 0.1 across several prompt lengths, but failed consistently at 0.5 and 0.9. contains_needle stays positive in some cases where exact match fails, but that is a weaker criterion. The current result should therefore be read as early-position recall, not robust long-context retrieval.

Bits Context Tokens Insertion Fraction Elapsed Seconds Exact Match Contains Needle
3.0 2048 0.1 1.1014 true true
3.0 2048 0.5 1.1488 false true
3.0 2048 0.9 1.1465 false false
3.0 4096 0.1 2.5599 false true
3.0 4096 0.5 2.6351 false true
3.0 4096 0.9 2.7681 false false
3.0 8192 0.1 6.1731 true true
3.0 8192 0.5 6.7962 false false
3.0 8192 0.9 6.3600 false false
3.0 16384 0.1 15.6448 true true
3.0 16384 0.5 16.1179 false false
3.0 16384 0.9 15.6592 false false
3.5 2048 0.1 1.2985 true true
3.5 2048 0.5 1.4403 false true
3.5 2048 0.9 1.4370 false false
3.5 4096 0.1 2.9586 false true
3.5 4096 0.5 2.9688 false true
3.5 4096 0.9 2.9501 false false
3.5 8192 0.1 6.3312 true true
3.5 8192 0.5 6.5699 false false
3.5 8192 0.9 6.5293 false false
3.5 16384 0.1 18.4821 true true
3.5 16384 0.5 18.9856 false false
3.5 16384 0.9 19.2857 false false
4.0 2048 0.1 1.5542 true true
4.0 2048 0.5 1.6880 false true
4.0 2048 0.9 1.6921 false false
4.0 4096 0.1 3.5982 false true
4.0 4096 0.5 3.6594 false true
4.0 4096 0.9 3.8123 false false
4.0 8192 0.1 7.8715 true true
4.0 8192 0.5 8.3909 false false
4.0 8192 0.9 8.4198 false false
4.0 16384 0.1 20.2147 true true
4.0 16384 0.5 21.5121 false false
4.0 16384 0.9 20.6090 false false