Benchmarks & Leaderboard¶

RLM Code includes a complete benchmarking and evaluation framework designed for research reproducibility and systematic performance tracking. The system provides 10 preset benchmark suites covering 33+ test cases, a multi-metric leaderboard for ranking and comparison, and session replay with full time-travel debugging.

Architecture¶

flowchart TD
    Presets["Preset Benchmarks\n(10 suites, 33+ cases)"] --> Runner["RLMRunner\n.run_benchmark()"]
    YAML["Custom YAML Packs"] --> Runner
    Runner --> Results["Benchmark Results\n(JSON)"]
    Runner --> JSONL["Step Traces\n(JSONL)"]
    Results --> LB["Leaderboard\n(ranking, statistics)"]
    JSONL --> Replay["Session Replay\n(time-travel debugging)"]
    LB --> Export["Export\n(JSON / CSV / Markdown / Rich)"]
    Replay --> Compare["Session Comparison\n(diff two runs)"]

Key Components¶

Preset Benchmarks¶

10 built-in benchmark suites that cover the full spectrum of RLM capabilities:

Category	Presets	Total Cases	Focus
DSPy	`dspy_quick`, `dspy_extended`	8	DSPy coding loop: signatures, modules, tests
Generic	`generic_smoke`	2	Basic Python execution and error recovery
Pure RLM	`pure_rlm_smoke`, `pure_rlm_context`	7	Paper-compliant mode, context-as-variable
Advanced	`deep_recursion`, `paradigm_comparison`	6	Depth > 1 recursion, cross-paradigm comparison
Paper-Compatible	`oolong_style`, `browsecomp_style`, `token_efficiency`	10	OOLONG, BrowseComp-Plus, token efficiency

See Preset Benchmarks for full details on every suite and case.

Multi-Metric Leaderboard¶

The leaderboard aggregates results from all benchmark runs and ranks them across 7 metrics:

Metric	Direction	Description
`REWARD`	Higher is better	Average cumulative reward
`COMPLETION_RATE`	Higher is better	Percentage of completed runs
`STEPS`	Lower is better	Average steps to completion
`TOKENS`	Lower is better	Total tokens consumed
`COST`	Lower is better	Estimated cost in USD
`DURATION`	Lower is better	Execution time in seconds
`EFFICIENCY`	Higher is better	Reward per 1000 tokens

See Leaderboard for filtering, statistics, aggregation, and export.

Session Replay¶

Every RLM run can be recorded and replayed step by step:

SessionRecorder captures actions, observations, rewards, memory, and variables at each step
SessionReplayer provides forward/backward navigation (step_forward, step_backward, goto_step)
SessionStore persists snapshots and checkpoints to disk
SessionComparison diffs two sessions to find the divergence point

See Session Replay for the full API and time-travel debugging workflows.

Quick Start¶

Run a Preset Benchmark¶

rlm-code bench preset=dspy_quick

View the Leaderboard¶

rlm-code leaderboard --metric reward --limit 10

Replay a Session¶

from rlm_code.rlm.session_replay import load_session

replayer = load_session(".rlm_code/rlm/observability/steps/abc12345.jsonl")
for step in replayer.iterate_steps():
    print(f"Step {step.step}: {step.action_type} -> reward={step.reward}")

Custom Benchmarks¶

You can extend the built-in presets by loading custom YAML, JSON, or JSONL benchmark packs:

# my_benchmarks.yaml
presets:
  my_suite:
    description: "Custom evaluation suite"
    cases:
      - id: test_1
        task: "Write a function that reverses a string"
        environment: generic
        max_steps: 3
        exec_timeout: 30
      - id: test_2
        task: "Build a REST API client"
        environment: generic
        max_steps: 5
        exec_timeout: 60

rlm-code bench preset=my_suite --pack my_benchmarks.yaml

See Preset Benchmarks for all supported pack formats including Google ADK eval sets and generic datasets.

What's Next¶

Page	Description
Preset Benchmarks	All 10 presets in detail, custom pack loading, YAML format
Leaderboard	Ranking, filtering, statistics, trend analysis, export
Session Replay	Recording, replaying, time-travel debugging, session comparison