CLI Reference¶
The metaharness CLI covers five workflows:
- scaffold a project
- scaffold a domain onboarding pack
- run or probe a backend
- inspect and export results
- execute repeated experiment matrices
Show help:
Long commands below are wrapped with \ so they stay readable and copy cleanly.
Many reporting commands support:
- plain text output by default
--jsonfor machine-readable output--tsvfor spreadsheet-friendly export where supported
scaffold¶
Create a new coding-tool project:
Profiles:
standardlocal-oss-smokelocal-oss-medium
Fast Local Smoke¶
Smaller harness aimed at local OSS smoke runs.
Medium Local OSS¶
Restores bootstrap and test scripts while staying lighter than the full scaffold.
onboard¶
Create an official-style onboarding pack for a new domain:
This command writes:
ONBOARDING.mdwith required questions and guardrailsdomain_spec.mdwith a concrete template for domain, harness, evaluation, and artifact design
Use this before implementing a new adapter so search/test splits, metrics, budget, and leakage risks are defined up front.
run¶
Run one optimization project:
Use this when you want a single benchmark or project run and care about the winning candidate, not aggregate trial statistics.
Important options:
--backend--budget--run-name--hosted--oss--local-provider--model--proposal-timeout--search-mode--proposal-batch-size--selection-policy--trace-evidence
--backend accepts built-ins (fake, codex, gemini) and any plugin backend name defined in backend_plugins.
Use --trace-evidence path/to/trace_evidence.md to inject a HALO/RLM trace diagnosis report into each candidate proposal.
The file is copied to .metaharness/evidence/trace_evidence.md inside the candidate workspace and embedded in the backend prompt.
Proposers are instructed to write .metaharness/change_manifest.json; MetaHarness archives it with the proposal and can attribute predicted fixes/regressions when evaluator metadata includes task_results.
Fake Backend¶
Best for smoke checks and development.
Hosted Codex¶
Best current path for real benchmark quality.
Local Codex Over Ollama¶
Local-only path for OSS model runs.
Gemini CLI¶
Use Gemini as an experimental proposer backend.
Trace-Grounded Run¶
Use a HALO/RLM trace evidence report to ground harness edits in observed failures.
Plugin Backend¶
Use a custom adapter from backend_plugins, for example cursor.
experiment¶
Run a benchmark x backend x budget x trial matrix:
Use this when you want repeatable benchmark results instead of one-off runs.
Saved Config¶
The most reusable path for teams.
Multiple Budgets¶
Compare how much improvement you get from a larger search budget.
TSV Export¶
Send aggregate results straight to a spreadsheet or notebook.
This command writes:
experiment.jsontrials.jsonaggregates.jsontrials.tsvaggregates.tsv
Config files can contain:
project_dirsbackendsbudgetstrial_countmodelsresults_dirbackend_overrides
If a config file is provided, relative paths are resolved from the config file location. CLI flags override the corresponding config values.
smoke codex¶
Probe the Codex path before spending model calls:
Probe the local Ollama path:
uv run metaharness smoke codex \
./my-coding-tool-optimizer \
--probe-only \
--oss \
--local-provider ollama \
--model gpt-oss:20b
Use this when you want to verify the environment, provider, and model path before running a benchmark.
smoke gemini¶
Probe the Gemini CLI path before spending model calls:
Run one Gemini-backed smoke iteration:
inspect¶
Inspect one completed run:
This is the quickest human-readable view of:
- candidate outcomes
- validity
- proposal application
- scope violations
- objective scores
ledger¶
Export the per-candidate ledger for one run:
TSV export:
Use this when you want one row per candidate with outcomes, changed-file counts, summaries, scope violations, change-manifest status, changed component labels, and attribution verdict counts.
summarize¶
Summarize all runs in a project:
TSV export:
Use this when you want a project-wide view of scores, durations, and outcome counts.
compare¶
Compare specific run directories:
uv run metaharness compare \
./examples/python_fixture_benchmark/runs/hosted-codex-20260401 \
./examples/python_fixture_benchmark/runs/ollama-20b-20260401 \
./examples/python_fixture_benchmark/runs/ollama-120b-20260401
TSV export:
uv run metaharness compare \
./examples/python_fixture_benchmark/runs/hosted-codex-20260401 \
./examples/python_fixture_benchmark/runs/ollama-120b-20260401 \
--tsv
Use this when you want an explicit side-by-side comparison between selected runs rather than every run in a project.
Output Files To Know¶
The most useful stored artifacts are usually:
run_config.jsonindexes/leaderboard.jsonmanifest.jsonproposal/result.jsonproposal/workspace.diffvalidation/result.jsonevaluation/result.json