Skip to content

Getting Started

This page walks through the fastest path from a clean checkout to a real metaharness run that you can inspect. It is written for newcomers first.

Prerequisites

  • Python 3.11 or newer
  • uv
  • optional: codex, gemini, pi, or opencode CLI for live provider runs
  • optional: Ollama with gpt-oss:20b or gpt-oss:120b for local runs

Install

Recommended newcomer path

If you only want to use the released CLI, install from PyPI with uv tool install. If you want to run the built-in examples from this repository, use a source checkout with uv sync.

Published package:

  • PyPI distribution: superagentic-metaharness
  • CLI command: metaharness
  • import package: metaharness

Install the CLI from PyPI:

uv tool install superagentic-metaharness

Check the installed command:

metaharness --help

If you want to add the library to another Python project:

uv add superagentic-metaharness

Command formatting note

Long commands on this page are wrapped with \ so they stay readable on narrower screens. You can copy them exactly as written.

If you are working from a source checkout of this repository, create the project environment with:

uv sync

If you want the docs toolchain too:

uv sync --group dev

Check the CLI:

uv run metaharness --help

The Fastest First Run

Recommended first run

Use the fake backend on a real benchmark. This exercises the full loop without needing provider auth, network access, or a local model server.

uv run metaharness run \
  examples/python_fixture_benchmark \
  --backend fake \
  --budget 1 \
  --run-name first-run

Expected result:

  • a run directory under examples/python_fixture_benchmark/runs/first-run
  • best_candidate_id=c0001
  • best_objective=1.000

What To Inspect Next

Inspect A Single Run

Use this when you want a quick human-readable summary of the candidates and outcomes.

uv run metaharness inspect \
  examples/python_fixture_benchmark/runs/first-run

Export The Candidate Ledger

Use this when you want one row per candidate with outcomes, changed-file counts, and validation or evaluation summaries.

uv run metaharness ledger \
  examples/python_fixture_benchmark/runs/first-run \
  --tsv

Summarize A Whole Benchmark

Use this when you want one row per run and a compact view of score, duration, and failure patterns.

uv run metaharness summarize examples/python_fixture_benchmark

Run A Saved Experiment Matrix

Once the single-run flow makes sense, move to repeated trials:

uv run metaharness experiment \
  --config examples/experiment_configs/fake-benchmarks.json

This writes:

  • experiment.json
  • trials.json
  • aggregates.json
  • trials.tsv
  • aggregates.tsv

Use this path when you want reproducible benchmarking rather than ad hoc manual runs.

Use Hosted Codex

Requirements:

  • codex CLI installed
  • authenticated Codex session or API key setup
  • outbound network access

Probe The CLI

uv run metaharness smoke codex examples/python_fixture_benchmark --probe-only

Run Hosted Codex

uv run metaharness run \
  examples/python_fixture_benchmark \
  --backend codex \
  --hosted \
  --budget 1 \
  --run-name hosted-codex

Use --hosted if a project config defaults to local Ollama. Hosted Codex is the strongest current path for real benchmark runs in this repository.

Use Gemini CLI

Gemini is an experimental backend in the current release. Use it if Gemini CLI is already part of your local workflow and you are comfortable with a try-it-yourself path.

Requirements:

  • gemini CLI installed
  • Gemini authentication configured in your local environment

Probe The CLI

uv run metaharness smoke gemini examples/python_fixture_benchmark --probe-only

Run Gemini

uv run metaharness run \
  examples/python_fixture_benchmark \
  --backend gemini \
  --model gemini-2.5-pro \
  --proposal-timeout 180 \
  --budget 1 \
  --run-name gemini-run

The integration is real, but it is not part of the main validated Codex-first release path.

Use Pi

Pi is an experimental backend in the current release. Use it if Pi is already part of your local workflow and you are comfortable with a try-it-yourself path.

Requirements:

  • pi CLI installed
  • Pi authentication configured for the model you want to use

Probe The CLI

uv run metaharness smoke pi examples/python_fixture_benchmark --probe-only

Run Pi

uv run metaharness run \
  examples/python_fixture_benchmark \
  --backend pi \
  --model anthropic/claude-sonnet-4-5 \
  --proposal-timeout 180 \
  --budget 1 \
  --run-name pi-run

Pi runs through its JSON print mode and defaults to ephemeral --no-session behavior inside metaharness. This keeps optimization runs isolated from Pi's normal interactive session workflow. It is not part of the main validated Codex-first release path.

Use Local Codex Over Ollama

Requirements:

  • Ollama server reachable on 127.0.0.1:11434
  • a local model such as gpt-oss:20b or gpt-oss:120b

Probe The Local Path

uv run metaharness smoke codex \
  examples/python_fixture_benchmark \
  --probe-only \
  --oss \
  --local-provider ollama \
  --model gpt-oss:20b

Run gpt-oss:20b

uv run metaharness run \
  examples/python_fixture_benchmark \
  --backend codex \
  --oss \
  --local-provider ollama \
  --model gpt-oss:20b \
  --proposal-timeout 240 \
  --budget 1 \
  --run-name ollama-20b

Run gpt-oss:120b

uv run metaharness run \
  examples/python_fixture_benchmark \
  --backend codex \
  --oss \
  --local-provider ollama \
  --model gpt-oss:120b \
  --proposal-timeout 420 \
  --budget 1 \
  --run-name ollama-120b

Create Your Own Project

If you want to optimize your own coding-agent harness, scaffold a project:

uv run metaharness scaffold coding-tool ./my-coding-tool-optimizer

Available scaffold profiles:

  • standard
  • local-oss-smoke
  • local-oss-medium

Examples:

uv run metaharness scaffold \
  coding-tool \
  ./my-local-oss-smoke \
  --profile local-oss-smoke

uv run metaharness scaffold \
  coding-tool \
  ./my-local-oss-medium \
  --profile local-oss-medium

If you want a checked-in experiment workflow for your own project, add a small JSON spec and run:

uv run metaharness experiment \
  --config ./my-experiment.json

What A Successful First Session Looks Like

By the end of a first session, you should be able to:

  • run a benchmark with the fake backend
  • inspect the winning candidate
  • export a candidate ledger
  • run a saved experiment matrix
  • decide whether to use hosted Codex or a local Ollama model for the next step

Build The Docs

Serve locally:

uv run mkdocs serve

Build the site:

uv run mkdocs build --strict