Skip to content

Benchmark Runner

Overview

The benchmark harness compares coding agent performance across multiple targets (superqode, opencode, pi, deepagents) on the same task prompts. Each task in a tasks file is run against every selected target, and results are collected into a JSON report for analysis. The harness does not evaluate output correctness; it delegates pass/fail determination to each target CLI's exit code.

Task File Format

Tasks are defined in a JSON file with a tasks array:

{
  "tasks": [
    {
      "id": "task-001",
      "prompt": "Implement a fibonacci function in Python",
      "cwd": "/tmp/workspace",
      "timeout_seconds": 300
    }
  ]
}

Field descriptions:

Field Required Default Description
id yes Unique task identifier
prompt yes Prompt text, passed as the last argument to the target CLI
cwd no . Working directory for the task
timeout_seconds no 300 Per task timeout in seconds

CLI Usage

superqode benchmark run tasks.json
superqode benchmark run tasks.json --target superqode --target opencode

The --target flag is repeatable. When omitted, all built-in targets are used.

Targets

Four built-in targets are available:

Target CLI Command
superqode superqode -p
opencode opencode run
pi pi -p
deepagents deepagents

If a target's binary is not found on PATH, that target is reported as "skipped" in the results. Only targets present on PATH are executed.

Result Format

Results are output as a JSON array. Each entry contains:

Field Type Description
target string Name of the target that ran
task_id string ID of the task that was run
status string One of: passed, failed, skipped, timeout
returncode int Exit code of the target CLI (absent on skip/timeout)
duration_seconds float Wall clock time in seconds (absent on skip)
stdout_chars int Character count of stdout (absent on skip/timeout)
stderr_chars int Character count of stderr (absent on skip/timeout)

Pass/fail status is determined solely by the target CLI's exit code: exit code 0 produces passed, any non-zero exit produces failed. The benchmark harness does not inspect output for correctness.

Writing Custom Tasks

Create a JSON file with a tasks array. Each task object must contain at least id and prompt. The prompt is appended as the last argument to the target's CLI command. The target is expected to solve the prompt and exit 0 on success. Use cwd to control where the target process runs and timeout_seconds to prevent runaway agents.