Filesystem first harness optimization

metaharness

metaharness is an open source Python library for optimizing executable harnesses around agentic coding systems. It is inspired by the Meta Harness paper and is an unofficial open source implementation of the core ideas in that work. The current benchmark evidence in this repository is centered on the Codex CLI path, including hosted Codex and Codex over local Ollama models. Codex is the primary and validated backend in this repository today. Gemini CLI, Pi, and OpenCode are experimental integrations. It treats the harness itself as the optimization target, not just the prompt. That includes instruction files, bootstrap scripts, validation scripts, test flows, routing logic, and other executable support code.

Get Started Explore The CLI See Experiment Results

2 real benchmarks Real coding-tool targets built around fixture repos, shell scripts, and deterministic acceptance checks.

Codex-first today Hosted Codex and local Codex over Ollama have both been exercised in real runs.

Filesystem evidence Every run stores prompts, manifests, diffs, evaluation output, and candidate history on disk.

Experiment runner included Batch runs, candidate ledgers, JSON output, and TSV exports are already part of the CLI.

Published on PyPI Install the released CLI with uv tool install superagentic-metaharness and run metaharness.

New Here¶

If you are seeing `metaharness` for the first time, start with the fake backend on a built-in benchmark. That gives you the full loop with no model auth, no provider setup, and no local model server.

1. Install The CLI

Install the published package from PyPI and confirm the command is available.

2. Run One Benchmark

Use the fake backend first so you can learn the workflow before spending model calls.

3. Inspect The Evidence

Look at the run summary, candidate ledger, and winning workspace before trying hosted Codex or Ollama.

Command formatting note

Long commands in the docs are wrapped with \ so they stay readable on smaller screens. You can copy them exactly as shown.

Why This Exists¶

Most agent workflows do not fail because the base model is incapable. They fail because the surrounding harness is weak, incomplete, or inconsistent. The problem is often outside the core model call.

Weak repository instructions

Agents start with incomplete context, make risky assumptions, or waste time rediscovering basics.

Broken setup and validation

Bootstrap scripts, validation steps, and test flows drift away from the workflow they are supposed to guard.

No durable experiment record

Teams try improvements, but they cannot easily compare what changed, what improved, and what failed.

No write-scope discipline

An optimizer may edit the wrong files and still produce noisy or misleading results.

metaharness addresses these problems by making the harness executable, inspectable, and benchmarkable. It captures a compact environment bootstrap before each proposal, stores every candidate on disk, and can enforce an explicit write scope through allowed_write_paths.

Lineage And Inspiration¶

metaharness is inspired first by the Meta Harness paper, which motivated the overall idea of optimizing executable harness code instead of treating the prompt as the only optimization surface.

Two other projects were also useful reference points while shaping this library:

GEPA, especially as a reference for packaging and reusable optimization tooling
Autoresearch by Andrej Karpathy, especially for explicit experiment loops, keep or discard thinking, and constrained mutable scope

What It Optimizes¶

The optimized object is the harness, not only the prompt.

Typical targets include AGENTS.md, GEMINI.md, bootstrap scripts, validation scripts, test scripts, routing code, benchmark glue, and other files that shape how an agent actually works in a repository.

How The Loop Works¶

1 Materialize a baseline

Start from a baseline workspace that already represents a real harness.

2 Capture context

Collect a compact environment snapshot and parent-candidate feedback before the proposer edits anything.

3 Propose, validate, evaluate

Ask a coding agent to improve the workspace, then validate and score the result with deterministic checks.

4 Keep evidence

Store diffs, manifests, ledgers, outcomes, and summaries on disk so the run can be audited and compared later.

Core Capabilities¶

Optimization engine

A small outer loop that keeps the best candidate according to a deterministic objective.

Filesystem-backed run store

Run configs, candidate workspaces, manifests, diffs, and stage results are stored in a stable on-disk layout.

Environment bootstrap snapshots

Each proposal gets a compact view of the workspace, tools, package files, git state, and system facts before it starts editing.

Write-scope enforcement

Projects can declare the files or directories that are allowed to change and reject scope violations automatically.

Explicit candidate outcomes

Runs classify candidates as keep, discard, crash, timeout, no-change, or scope-violation.

Experiment runner

Run repeated trial matrices across benchmarks, providers, budgets, and models with JSON and TSV outputs.

Supported Release Shape¶

The current package is strongest in a Codex-first setup. Hosted Codex is the most reliable current path for real benchmark runs in this repository. Local Codex over Ollama has also been exercised with `gpt-oss:20b` and `gpt-oss:120b`.

All real provider runs currently documented in this repository were produced through Codex. Other coding-agent benchmark writeups may emphasize Claude Code or Opus, but those are not the provider paths currently documented in this repository.

Gemini, Pi, and OpenCode are available as experimental backends, but the documented benchmark evidence is still centered on Codex.

Built-In Targets¶

examples/python_fixture_benchmark
examples/python_cli_benchmark
examples/ticket_router

The two Python benchmarks are the main release-quality examples. They use real shell scripts, real fixture repositories, and deterministic acceptance checks rather than placeholder text-only scoring.

Start Here¶

First Useful Run¶

Run the fake backend on a real benchmark to see the full loop without provider dependencies.

uv run metaharness run \
  examples/python_fixture_benchmark \
  --backend fake \
  --budget 1 \
  --run-name first-run

Inspect What Happened¶

Look at the winning candidate, run summary, and candidate ledger.

uv run metaharness inspect \
  examples/python_fixture_benchmark/runs/first-run

uv run metaharness ledger \
  examples/python_fixture_benchmark/runs/first-run \
  --tsv

Run An Experiment Matrix¶

Use a saved config to run repeated trials and write JSON plus TSV outputs.

uv run metaharness experiment \
  --config examples/experiment_configs/fake-benchmarks.json