Benchmarks¶
Overview¶
metaharness currently includes three example targets.
Two are real coding-tool benchmarks:
python_fixture_benchmarkpython_cli_benchmark
One is a smaller deterministic example:
ticket_router
python_fixture_benchmark¶
Path:
examples/python_fixture_benchmark
What it exercises:
- a real
python -m venvbootstrap flow - a real
unittestsuite over a fixture package - deterministic instruction-file checks
- helper script correctness
What can change:
AGENTS.mdGEMINI.mdscripts/bootstrap.shscripts/validate.shscripts/test.sh
Configured write scope:
AGENTS.mdGEMINI.mdscripts/
Typical runs:
uv run metaharness run examples/python_fixture_benchmark --backend fake --budget 1
uv run metaharness run \
examples/python_fixture_benchmark \
--backend codex \
--hosted \
--budget 1
uv run metaharness run \
examples/python_fixture_benchmark \
--backend codex \
--oss \
--local-provider ollama \
--model gpt-oss:120b \
--proposal-timeout 420 \
--budget 1
python_cli_benchmark¶
Path:
examples/python_cli_benchmark
What it exercises:
- a real
python -m venvbootstrap flow - a real
unittestsuite - a real CLI smoke command against fixture data
- deterministic instruction-file checks
What can change:
AGENTS.mdGEMINI.mdscripts/bootstrap.shscripts/validate.shscripts/test.sh
Configured write scope:
AGENTS.mdGEMINI.mdscripts/
Typical runs:
uv run metaharness run examples/python_cli_benchmark --backend fake --budget 1
uv run metaharness run \
examples/python_cli_benchmark \
--backend codex \
--hosted \
--budget 1
uv run metaharness run \
examples/python_cli_benchmark \
--backend codex \
--oss \
--local-provider ollama \
--model gpt-oss:20b \
--proposal-timeout 240 \
--budget 1
ticket_router¶
Path:
examples/ticket_router
This is a smaller deterministic example that optimizes a Python router against a fixed dataset. It is useful for fast development checks and basic API examples.
Run it:
Scaffold Profiles¶
The CLI scaffold also includes profiles for users who want to bring their own coding-tool project:
standardlocal-oss-smokelocal-oss-medium
These are useful for starting a real project, but the benchmark directories are the clearest examples of how to structure a reusable target.