Benchmark Runner¶

Overview¶

The benchmark harness compares coding agent performance across multiple targets (superqode, opencode, pi, deepagents) on the same task prompts. Each task in a tasks file is run against every selected target, and results are collected into a JSON report for analysis. The harness does not evaluate output correctness; it delegates pass/fail determination to each target CLI's exit code.

Task File Format¶

Tasks are defined in a JSON file with a tasks array:

{
  "tasks": [
    {
      "id": "task-001",
      "prompt": "Implement a fibonacci function in Python",
      "cwd": "/tmp/workspace",
      "timeout_seconds": 300
    }
  ]
}

Field descriptions:

Field	Required	Default	Description
`id`	yes		Unique task identifier
`prompt`	yes		Prompt text, passed as the last argument to the target CLI
`cwd`	no	`.`	Working directory for the task
`timeout_seconds`	no	`300`	Per task timeout in seconds

CLI Usage¶

superqode benchmark run tasks.json
superqode benchmark run tasks.json --target superqode --target opencode

The --target flag is repeatable. When omitted, all built-in targets are used.

Targets¶

Four built-in targets are available:

Target	CLI Command
superqode	`superqode -p`
opencode	`opencode run`
pi	`pi -p`
deepagents	`deepagents`

If a target's binary is not found on PATH, that target is reported as "skipped" in the results. Only targets present on PATH are executed.

Result Format¶

Results are output as a JSON array. Each entry contains:

Field	Type	Description
`target`	string	Name of the target that ran
`task_id`	string	ID of the task that was run
`status`	string	One of: `passed`, `failed`, `skipped`, `timeout`
`returncode`	int	Exit code of the target CLI (absent on skip/timeout)
`duration_seconds`	float	Wall clock time in seconds (absent on skip)
`stdout_chars`	int	Character count of stdout (absent on skip/timeout)
`stderr_chars`	int	Character count of stderr (absent on skip/timeout)

Pass/fail status is determined solely by the target CLI's exit code: exit code 0 produces passed, any non-zero exit produces failed. The benchmark harness does not inspect output for correctness.

Writing Custom Tasks¶

Create a JSON file with a tasks array. Each task object must contain at least id and prompt. The prompt is appended as the last argument to the target's CLI command. The target is expected to solve the prompt and exit 0 on success. Use cwd to control where the target process runs and timeout_seconds to prevent runaway agents.