Experiments¶

This page is the full experiment registry for metaharness development so far.

It covers three kinds of work:

real provider benchmark runs that shaped the current product position
provider smoke runs that prove integrations launch and produce artifacts
fake-backend validation runs that proved the framework and benchmark targets before spending model calls
early development experiments that were run in temporary workspaces and are summarized here from development notes

Important artifact policy:

run directories under examples/*/runs/ are local artifacts and are gitignored
this page and the top-level BENCHMARK_RESULTS.md file are the checked-in summaries of the important named runs
anonymous temporary test runs created during unit tests are not listed here one by one

For repeated benchmark runs, use metaharness experiment so the results are saved as both JSON and TSV.

The experiment workflow is also influenced by Autoresearch by Andrej Karpathy, especially in the emphasis on explicit experiment records, repeatable runs, and outcome-driven iteration.

Experiment Registry¶

Category	Target	Backend / Model	Baseline -> Best	Status	Notes
Early development	`local-oss-smoke` scaffold	Codex over Ollama `gpt-oss:20b`	`0.200 -> 1.000`	historical	first successful local OSS scaffold run, temporary workspace only
Early development	`local-oss-medium` scaffold	Codex over Ollama `gpt-oss:20b`	`0.0625 -> 1.000`	historical	richer scaffold with bootstrap and test scripts, temporary workspace only
Framework validation	`ticket_router` `smoke`	fake backend	`0.750 -> 1.000`	local artifact	early end to end deterministic example run
Framework validation	`ticket_router` `fake-run`	fake backend	`0.750 -> 1.000`	local artifact	later regression run after outcome ledger changes
Real benchmark validation	`python_fixture_benchmark` `smoke-real-target`	fake backend	`0.050 -> 1.000`	local artifact	first real benchmark target proved end to end without live model calls
Real benchmark	`python_fixture_benchmark` `hosted-codex-20260401`	hosted Codex	`0.050 -> 1.000`	documented	solved in one proposal iteration
Real benchmark	`python_fixture_benchmark` `ollama-20b-20260401`	Codex over Ollama `gpt-oss:20b`	`0.050 -> 0.050`	documented	proposal timed out at `240s`
Real benchmark	`python_fixture_benchmark` `ollama-120b-20260401`	Codex over Ollama `gpt-oss:120b`	`0.050 -> 1.000`	documented	solved in one proposal iteration, slower than hosted Codex
Real benchmark	`python_cli_benchmark` `hosted-codex-20260401`	hosted Codex	`0.045 -> 1.000`	documented	solved in one proposal iteration
Real benchmark	`python_cli_benchmark` `ollama-20b-20260401`	Codex over Ollama `gpt-oss:20b`	`0.045 -> 0.045`	documented	proposal timed out at `240s`
Provider smoke	`python_fixture_benchmark` `opencode-smoke`	OpenCode	`0.050 -> crash`	documented	sandboxed run failed before proposal execution because OpenCode attempted to write under its user log directory
Provider smoke	`python_fixture_benchmark` `opencode-smoke-escalated`	OpenCode	`0.050 -> 0.050`	documented	completed cleanly but produced a `no-change` candidate
Provider smoke	`python_fixture_benchmark` `gemini-smoke`	Gemini CLI	`0.050 -> crash`	documented	Gemini launched but failed because `GEMINI_API_KEY` was not set
Provider smoke	`python_fixture_benchmark` `pi-smoke`	Pi	`0.050 -> crash`	documented	Pi launched but no models were configured
Release validation	`python_fixture_benchmark` `ci-fixture-local-check`	fake backend	`0.050 -> 1.000`	local artifact	used to validate uv-first CI and CLI flow
Release validation	`python_cli_benchmark` `ci-cli-local-check`	fake backend	`0.045 -> 1.000`	local artifact	used to validate uv-first CI and CLI flow

Early Development Experiments¶

These runs happened before the real benchmark targets existed. They were important because they proved that the outer loop, Codex integration, local Ollama path, and on-disk run structure all worked together.

`local-oss-smoke`¶

Provider:

Codex over Ollama
model: gpt-oss:20b

Result:

baseline objective: 0.200
best candidate objective: 1.000

Winning changes:

AGENTS.md
GEMINI.md
scripts/validate.sh

Main takeaway:

the local OSS path worked end to end and could solve a small scaffold profile

Artifact note:

this run was created in a temporary workspace during development and is not committed

`local-oss-medium`¶

Provider:

Codex over Ollama
model: gpt-oss:20b

Result:

baseline objective: 0.0625
best candidate objective: 1.000

Winning changes:

AGENTS.md
GEMINI.md
scripts/bootstrap.sh
scripts/validate.sh
scripts/test.sh

Main takeaway:

the local OSS path remained viable on a richer scaffold with real bootstrap and test scripts

Artifact note:

this run was created in a temporary workspace during development and is not committed

Checked Local Validation Runs¶

These runs are important because they prove the framework and example targets work even without live provider calls. Their directories are local artifacts under examples/*/runs/ and are gitignored.

`ticket_router`¶

Named local runs:

examples/ticket_router/runs/smoke
examples/ticket_router/runs/fake-run

Observed result:

baseline objective: 0.750
best candidate objective: 1.000

Why it matters:

it is the smallest deterministic end to end example in the repository
it proved the early engine and later served as a regression check after ledger changes

`python_fixture_benchmark` `smoke-real-target`¶

Named local run:

examples/python_fixture_benchmark/runs/smoke-real-target

Observed result:

baseline objective: 0.050
best candidate objective: 1.000

Why it matters:

this was the first real coding-tool benchmark target
it proved that the benchmark was no longer just a placeholder scaffold

Release validation runs¶

Named local runs:

examples/python_fixture_benchmark/runs/ci-fixture-local-check
examples/python_cli_benchmark/runs/ci-cli-local-check

Observed result:

both fake-backend runs reached 1.000

Why they matter:

they validated the uv-first release flow
they mirrored the important fake smoke paths used in CI

Real Provider Benchmark Experiments¶

All real provider runs documented here were executed through the Codex CLI path. This is why the current public benchmark evidence for metaharness is Codex-first.

`python_fixture_benchmark`¶

Provider	Run ID	Best Objective	Improved	Duration (s)	Notes
Hosted Codex	`hosted-codex-20260401`	`1.000`	yes	`153.231`	solved in 1 proposal iteration
Ollama `gpt-oss:20b`	`ollama-20b-20260401`	`0.050`	no	`240.149`	proposal timed out at `240s`
Ollama `gpt-oss:120b`	`ollama-120b-20260401`	`1.000`	yes	`274.820`	solved in 1 proposal iteration

Winning harness changes:

AGENTS.md
GEMINI.md
scripts/bootstrap.sh
scripts/test.sh
scripts/validate.sh

Observed quality:

hosted Codex wrote stronger repository guidance and explicitly pointed the agent at .metaharness feedback
local gpt-oss:120b solved the benchmark too, but with simpler script changes and slower turnaround
local gpt-oss:20b did not improve the baseline within the configured timeout

Artifact note:

the run summaries are documented in this repository
the corresponding run directories are local artifacts and are not committed

`python_cli_benchmark`¶

Provider	Run ID	Best Objective	Improved	Duration (s)	Notes
Hosted Codex	`hosted-codex-20260401`	`1.000`	yes	`155.489`	solved in 1 proposal iteration
Ollama `gpt-oss:20b`	`ollama-20b-20260401`	`0.045`	no	`240.052`	proposal timed out at `240s`

Winning hosted harness changes:

AGENTS.md
GEMINI.md
scripts/bootstrap.sh
scripts/test.sh
scripts/validate.sh

Observed quality:

hosted Codex correctly completed the instruction and script set needed for both the unit test flow and the CLI smoke command
local gpt-oss:20b again timed out before producing an improving candidate

Artifact note:

the run summaries are documented in this repository
the corresponding run directories are local artifacts and are not committed

Provider Smoke Experiments¶

These runs are important because they prove a provider integration actually launches through metaharness, writes proposal artifacts, and can be inspected after the fact. They are not yet benchmark-quality evidence unless the provider produces a meaningful candidate on a real target.

`python_fixture_benchmark` with OpenCode¶

Provider	Run ID	Best Objective	Outcome	Notes
OpenCode	`opencode-smoke`	baseline only	crash	sandboxed run failed before proposal execution because OpenCode attempted to write under `~/.local/share/opencode/`
OpenCode	`opencode-smoke-escalated`	`0.050`	no-change	rerun outside the sandbox completed successfully but made no harness edits

Observed behavior:

the first run showed that OpenCode has environment assumptions around its own log or state directories
the rerun proved the backend integration itself works and stores proper proposal artifacts
the candidate only read .metaharness/INSTRUCTIONS.md and the parent manifest, then stopped without editing files
stderr showed permission requested: external_directory (/src/*); auto-rejecting, which appears to be the most important immediate blocker for useful benchmark behavior

What this means:

OpenCode support is real in the library
OpenCode is not benchmark-validated yet in this repository
the next OpenCode step is permission and workspace-behavior tuning, not more parser work

`python_fixture_benchmark` with Gemini¶

Provider	Run ID	Best Objective	Outcome	Notes
Gemini CLI	`gemini-smoke`	baseline only	crash	Gemini started but exited with `When using Gemini API, you must specify the GEMINI_API_KEY environment variable.`

What this means:

Gemini support is real in the library
the current blocker is authentication in the runtime environment
Gemini is not benchmark-validated yet in this repository

`python_fixture_benchmark` with Pi¶

Provider	Run ID	Best Objective	Outcome	Notes
Pi	`pi-smoke`	baseline only	crash	Pi started but exited with `No models available.` and requested provider API keys or `~/.pi/agent/models.json`

What this means:

Pi support is real in the library
the current blocker is provider model configuration in the runtime environment
Pi is not benchmark-validated yet in this repository

What Is Still Missing¶

The registry is still incomplete in two meaningful ways:

There is no documented python_cli_benchmark run yet for local gpt-oss:120b.
There is no successful real Gemini benchmark run in this repository yet.
There is no successful real Pi benchmark run in this repository yet.
There is no documented Claude Code or Opus result set in this repository yet.

Those are the most obvious next experiments if the goal is to broaden the evidence base.

Overall Conclusions So Far¶

Hosted Codex is the strongest current path for the real benchmarks in this repository.
Local gpt-oss:20b is useful for small smoke checks, but it is not yet reliable enough for the current real benchmarks with the present timeout settings.
Local gpt-oss:120b is capable on at least one real benchmark, but it is slower than hosted Codex.
Fake-backend validation runs were valuable because they let the framework, CLI, and benchmarks mature before burning live model time.
Reporting is now good enough to compare providers by score, duration, and changed harness files without getting buried in transient .venv churn.

Experiments¶

Experiment Registry¶

Early Development Experiments¶

local-oss-smoke¶

local-oss-medium¶

Checked Local Validation Runs¶

ticket_router¶

python_fixture_benchmark smoke-real-target¶

Release validation runs¶

Real Provider Benchmark Experiments¶

python_fixture_benchmark¶

python_cli_benchmark¶

Provider Smoke Experiments¶

python_fixture_benchmark with OpenCode¶

python_fixture_benchmark with Gemini¶

python_fixture_benchmark with Pi¶

What Is Still Missing¶

Overall Conclusions So Far¶

`local-oss-smoke`¶

`local-oss-medium`¶

`ticket_router`¶

`python_fixture_benchmark` `smoke-real-target`¶

`python_fixture_benchmark`¶

`python_cli_benchmark`¶

`python_fixture_benchmark` with OpenCode¶

`python_fixture_benchmark` with Gemini¶

`python_fixture_benchmark` with Pi¶