Experiments¶
This page is the full experiment registry for metaharness development so far.
It covers three kinds of work:
- real provider benchmark runs that shaped the current product position
- provider smoke runs that prove integrations launch and produce artifacts
- fake-backend validation runs that proved the framework and benchmark targets before spending model calls
- early development experiments that were run in temporary workspaces and are summarized here from development notes
Important artifact policy:
- run directories under
examples/*/runs/are local artifacts and are gitignored - this page and the top-level
BENCHMARK_RESULTS.mdfile are the checked-in summaries of the important named runs - anonymous temporary test runs created during unit tests are not listed here one by one
For repeated benchmark runs, use metaharness experiment so the results are saved as both JSON and TSV.
The experiment workflow is also influenced by Autoresearch by Andrej Karpathy, especially in the emphasis on explicit experiment records, repeatable runs, and outcome-driven iteration.
Experiment Registry¶
| Category | Target | Backend / Model | Baseline -> Best | Status | Notes |
|---|---|---|---|---|---|
| Early development | local-oss-smoke scaffold |
Codex over Ollama gpt-oss:20b |
0.200 -> 1.000 |
historical | first successful local OSS scaffold run, temporary workspace only |
| Early development | local-oss-medium scaffold |
Codex over Ollama gpt-oss:20b |
0.0625 -> 1.000 |
historical | richer scaffold with bootstrap and test scripts, temporary workspace only |
| Framework validation | ticket_router smoke |
fake backend | 0.750 -> 1.000 |
local artifact | early end to end deterministic example run |
| Framework validation | ticket_router fake-run |
fake backend | 0.750 -> 1.000 |
local artifact | later regression run after outcome ledger changes |
| Real benchmark validation | python_fixture_benchmark smoke-real-target |
fake backend | 0.050 -> 1.000 |
local artifact | first real benchmark target proved end to end without live model calls |
| Real benchmark | python_fixture_benchmark hosted-codex-20260401 |
hosted Codex | 0.050 -> 1.000 |
documented | solved in one proposal iteration |
| Real benchmark | python_fixture_benchmark ollama-20b-20260401 |
Codex over Ollama gpt-oss:20b |
0.050 -> 0.050 |
documented | proposal timed out at 240s |
| Real benchmark | python_fixture_benchmark ollama-120b-20260401 |
Codex over Ollama gpt-oss:120b |
0.050 -> 1.000 |
documented | solved in one proposal iteration, slower than hosted Codex |
| Real benchmark | python_cli_benchmark hosted-codex-20260401 |
hosted Codex | 0.045 -> 1.000 |
documented | solved in one proposal iteration |
| Real benchmark | python_cli_benchmark ollama-20b-20260401 |
Codex over Ollama gpt-oss:20b |
0.045 -> 0.045 |
documented | proposal timed out at 240s |
| Provider smoke | python_fixture_benchmark opencode-smoke |
OpenCode | 0.050 -> crash |
documented | sandboxed run failed before proposal execution because OpenCode attempted to write under its user log directory |
| Provider smoke | python_fixture_benchmark opencode-smoke-escalated |
OpenCode | 0.050 -> 0.050 |
documented | completed cleanly but produced a no-change candidate |
| Provider smoke | python_fixture_benchmark gemini-smoke |
Gemini CLI | 0.050 -> crash |
documented | Gemini launched but failed because GEMINI_API_KEY was not set |
| Provider smoke | python_fixture_benchmark pi-smoke |
Pi | 0.050 -> crash |
documented | Pi launched but no models were configured |
| Release validation | python_fixture_benchmark ci-fixture-local-check |
fake backend | 0.050 -> 1.000 |
local artifact | used to validate uv-first CI and CLI flow |
| Release validation | python_cli_benchmark ci-cli-local-check |
fake backend | 0.045 -> 1.000 |
local artifact | used to validate uv-first CI and CLI flow |
Early Development Experiments¶
These runs happened before the real benchmark targets existed. They were important because they proved that the outer loop, Codex integration, local Ollama path, and on-disk run structure all worked together.
local-oss-smoke¶
Provider:
- Codex over Ollama
- model:
gpt-oss:20b
Result:
- baseline objective:
0.200 - best candidate objective:
1.000
Winning changes:
AGENTS.mdGEMINI.mdscripts/validate.sh
Main takeaway:
- the local OSS path worked end to end and could solve a small scaffold profile
Artifact note:
- this run was created in a temporary workspace during development and is not committed
local-oss-medium¶
Provider:
- Codex over Ollama
- model:
gpt-oss:20b
Result:
- baseline objective:
0.0625 - best candidate objective:
1.000
Winning changes:
AGENTS.mdGEMINI.mdscripts/bootstrap.shscripts/validate.shscripts/test.sh
Main takeaway:
- the local OSS path remained viable on a richer scaffold with real bootstrap and test scripts
Artifact note:
- this run was created in a temporary workspace during development and is not committed
Checked Local Validation Runs¶
These runs are important because they prove the framework and example targets work even without live provider calls.
Their directories are local artifacts under examples/*/runs/ and are gitignored.
ticket_router¶
Named local runs:
examples/ticket_router/runs/smokeexamples/ticket_router/runs/fake-run
Observed result:
- baseline objective:
0.750 - best candidate objective:
1.000
Why it matters:
- it is the smallest deterministic end to end example in the repository
- it proved the early engine and later served as a regression check after ledger changes
python_fixture_benchmark smoke-real-target¶
Named local run:
examples/python_fixture_benchmark/runs/smoke-real-target
Observed result:
- baseline objective:
0.050 - best candidate objective:
1.000
Why it matters:
- this was the first real coding-tool benchmark target
- it proved that the benchmark was no longer just a placeholder scaffold
Release validation runs¶
Named local runs:
examples/python_fixture_benchmark/runs/ci-fixture-local-checkexamples/python_cli_benchmark/runs/ci-cli-local-check
Observed result:
- both fake-backend runs reached
1.000
Why they matter:
- they validated the uv-first release flow
- they mirrored the important fake smoke paths used in CI
Real Provider Benchmark Experiments¶
All real provider runs documented here were executed through the Codex CLI path.
This is why the current public benchmark evidence for metaharness is Codex-first.
python_fixture_benchmark¶
| Provider | Run ID | Best Objective | Improved | Duration (s) | Notes |
|---|---|---|---|---|---|
| Hosted Codex | hosted-codex-20260401 |
1.000 |
yes | 153.231 |
solved in 1 proposal iteration |
Ollama gpt-oss:20b |
ollama-20b-20260401 |
0.050 |
no | 240.149 |
proposal timed out at 240s |
Ollama gpt-oss:120b |
ollama-120b-20260401 |
1.000 |
yes | 274.820 |
solved in 1 proposal iteration |
Winning harness changes:
AGENTS.mdGEMINI.mdscripts/bootstrap.shscripts/test.shscripts/validate.sh
Observed quality:
- hosted Codex wrote stronger repository guidance and explicitly pointed the agent at
.metaharnessfeedback - local
gpt-oss:120bsolved the benchmark too, but with simpler script changes and slower turnaround - local
gpt-oss:20bdid not improve the baseline within the configured timeout
Artifact note:
- the run summaries are documented in this repository
- the corresponding run directories are local artifacts and are not committed
python_cli_benchmark¶
| Provider | Run ID | Best Objective | Improved | Duration (s) | Notes |
|---|---|---|---|---|---|
| Hosted Codex | hosted-codex-20260401 |
1.000 |
yes | 155.489 |
solved in 1 proposal iteration |
Ollama gpt-oss:20b |
ollama-20b-20260401 |
0.045 |
no | 240.052 |
proposal timed out at 240s |
Winning hosted harness changes:
AGENTS.mdGEMINI.mdscripts/bootstrap.shscripts/test.shscripts/validate.sh
Observed quality:
- hosted Codex correctly completed the instruction and script set needed for both the unit test flow and the CLI smoke command
- local
gpt-oss:20bagain timed out before producing an improving candidate
Artifact note:
- the run summaries are documented in this repository
- the corresponding run directories are local artifacts and are not committed
Provider Smoke Experiments¶
These runs are important because they prove a provider integration actually launches through metaharness, writes proposal artifacts, and can be inspected after the fact.
They are not yet benchmark-quality evidence unless the provider produces a meaningful candidate on a real target.
python_fixture_benchmark with OpenCode¶
| Provider | Run ID | Best Objective | Outcome | Notes |
|---|---|---|---|---|
| OpenCode | opencode-smoke |
baseline only | crash | sandboxed run failed before proposal execution because OpenCode attempted to write under ~/.local/share/opencode/ |
| OpenCode | opencode-smoke-escalated |
0.050 |
no-change | rerun outside the sandbox completed successfully but made no harness edits |
Observed behavior:
- the first run showed that OpenCode has environment assumptions around its own log or state directories
- the rerun proved the backend integration itself works and stores proper proposal artifacts
- the candidate only read
.metaharness/INSTRUCTIONS.mdand the parent manifest, then stopped without editing files - stderr showed
permission requested: external_directory (/src/*); auto-rejecting, which appears to be the most important immediate blocker for useful benchmark behavior
What this means:
- OpenCode support is real in the library
- OpenCode is not benchmark-validated yet in this repository
- the next OpenCode step is permission and workspace-behavior tuning, not more parser work
python_fixture_benchmark with Gemini¶
| Provider | Run ID | Best Objective | Outcome | Notes |
|---|---|---|---|---|
| Gemini CLI | gemini-smoke |
baseline only | crash | Gemini started but exited with When using Gemini API, you must specify the GEMINI_API_KEY environment variable. |
What this means:
- Gemini support is real in the library
- the current blocker is authentication in the runtime environment
- Gemini is not benchmark-validated yet in this repository
python_fixture_benchmark with Pi¶
| Provider | Run ID | Best Objective | Outcome | Notes |
|---|---|---|---|---|
| Pi | pi-smoke |
baseline only | crash | Pi started but exited with No models available. and requested provider API keys or ~/.pi/agent/models.json |
What this means:
- Pi support is real in the library
- the current blocker is provider model configuration in the runtime environment
- Pi is not benchmark-validated yet in this repository
What Is Still Missing¶
The registry is still incomplete in two meaningful ways:
- There is no documented
python_cli_benchmarkrun yet for localgpt-oss:120b. - There is no successful real Gemini benchmark run in this repository yet.
- There is no successful real Pi benchmark run in this repository yet.
- There is no documented Claude Code or Opus result set in this repository yet.
Those are the most obvious next experiments if the goal is to broaden the evidence base.
Overall Conclusions So Far¶
- Hosted Codex is the strongest current path for the real benchmarks in this repository.
- Local
gpt-oss:20bis useful for small smoke checks, but it is not yet reliable enough for the current real benchmarks with the present timeout settings. - Local
gpt-oss:120bis capable on at least one real benchmark, but it is slower than hosted Codex. - Fake-backend validation runs were valuable because they let the framework, CLI, and benchmarks mature before burning live model time.
- Reporting is now good enough to compare providers by score, duration, and changed harness files without getting buried in transient
.venvchurn.