Getting Started¶
This page walks through the fastest path from a clean checkout to a real metaharness run that you can inspect.
It is written for newcomers first.
Prerequisites¶
- Python 3.11 or newer
uv- optional:
codexorgeminiCLI for live provider runs - optional: Ollama with
gpt-oss:20borgpt-oss:120bfor local runs
Install¶
Recommended newcomer path
If you only want to use the released CLI, install from PyPI with uv tool install.
If you want to run the built-in examples from this repository, use a source checkout with uv sync.
Published package:
- PyPI distribution:
superagentic-metaharness - CLI command:
metaharness - import package:
metaharness
Install the CLI from PyPI:
Check the installed command:
If you want to add the library to another Python project:
Command formatting note
Long commands on this page are wrapped with \ so they stay readable on narrower screens.
You can copy them exactly as written.
If you are working from a source checkout of this repository, create the project environment with:
If you want the docs toolchain too:
Check the CLI:
The Fastest First Run¶
Recommended first run
Use the fake backend on a real benchmark. This exercises the full loop without needing provider auth, network access, or a local model server.
uv run metaharness run \
examples/python_fixture_benchmark \
--backend fake \
--budget 1 \
--run-name first-run
Expected result:
- a run directory under
examples/python_fixture_benchmark/runs/first-run best_candidate_id=c0001best_objective=1.000
What To Inspect Next¶
Inspect A Single Run¶
Use this when you want a quick human-readable summary of the candidates and outcomes.
Export The Candidate Ledger¶
Use this when you want one row per candidate with outcomes, changed-file counts, and validation or evaluation summaries. When candidates write AHE-style change manifests, the ledger also includes manifest validity, component labels, and attribution verdict counts.
Summarize A Whole Benchmark¶
Use this when you want one row per run and a compact view of score, duration, and failure patterns.
Run A Saved Experiment Matrix¶
Once the single-run flow makes sense, move to repeated trials:
This writes:
experiment.jsontrials.jsonaggregates.jsontrials.tsvaggregates.tsv
Use this path when you want reproducible benchmarking rather than ad hoc manual runs.
Use Hosted Codex¶
Requirements:
codexCLI installed- authenticated Codex session or API key setup
- outbound network access
Use --hosted if a project config defaults to local Ollama.
Hosted Codex is the strongest current path for real benchmark runs in this repository.
Use Trace Evidence¶
If you have a HALO-style trace diagnosis report, pass it to the run with
--trace-evidence. A common workflow is to generate trace_evidence.md with
rlm-code's trace_analysis environment, then use that report to guide
metaharness candidate proposals.
uv run metaharness run \
examples/python_fixture_benchmark \
--backend codex \
--hosted \
--trace-evidence ./trace_evidence.md \
--budget 1 \
--run-name trace-grounded-codex
The report is copied into each candidate workspace at
.metaharness/evidence/trace_evidence.md and embedded in the proposer prompt.
Use this when trace analysis has surfaced concrete harness failures such as
hallucinated tool calls, redundant arguments, refusal loops, or semantic
correctness issues.
Each proposer is also instructed to write .metaharness/change_manifest.json
before finishing. MetaHarness archives that file under
candidates/<id>/proposal/change_manifest.json. If your evaluator returns
EvaluationResult(metadata={"task_results": {...}}), MetaHarness writes
proposal/change_attribution.json by comparing predicted fixes and risk tasks
against the candidate's actual task-level deltas.
Use Gemini CLI¶
Gemini is an experimental backend in the current release. Use it if Gemini CLI is already part of your local workflow and you are comfortable with a try-it-yourself path.
Requirements:
geminiCLI installed- Gemini authentication configured in your local environment
The integration is real, but it is not part of the main validated Codex-first release path.
Use Local Codex Over Ollama¶
Requirements:
- Ollama server reachable on
127.0.0.1:11434 - a local model such as
gpt-oss:20borgpt-oss:120b
Create Your Own Project¶
If you want to optimize your own coding-agent harness, scaffold a project:
If you want to use a closed-source or internal harness, add a plugin backend in metaharness.json under backend_plugins and run it with --backend <name>.
See Extensions for the factory contract.
Available scaffold profiles:
standardlocal-oss-smokelocal-oss-medium
Examples:
uv run metaharness scaffold \
coding-tool \
./my-local-oss-smoke \
--profile local-oss-smoke
uv run metaharness scaffold \
coding-tool \
./my-local-oss-medium \
--profile local-oss-medium
If you are defining a brand-new domain and want an official-style planning workflow first, create a domain onboarding pack:
This writes ONBOARDING.md and domain_spec.md so search/test splits, metrics, and leakage risks are defined before implementation.
If you want a checked-in experiment workflow for your own project, add a small JSON spec and run:
What A Successful First Session Looks Like¶
By the end of a first session, you should be able to:
- run a benchmark with the fake backend
- inspect the winning candidate
- export a candidate ledger
- run a saved experiment matrix
- decide whether to use hosted Codex or a local Ollama model for the next step
Build The Docs¶
Serve locally:
Build the site: