Benchmarks & Leaderboard¶
RLM Code includes a complete benchmarking and evaluation framework designed for research reproducibility and systematic performance tracking. The system provides 10 preset benchmark suites covering 33+ test cases, a multi-metric leaderboard for ranking and comparison, and session replay with full time-travel debugging.
Architecture¶
flowchart TD
Presets["Preset Benchmarks\n(10 suites, 33+ cases)"] --> Runner["RLMRunner\n.run_benchmark()"]
YAML["Custom YAML Packs"] --> Runner
Runner --> Results["Benchmark Results\n(JSON)"]
Runner --> JSONL["Step Traces\n(JSONL)"]
Results --> LB["Leaderboard\n(ranking, statistics)"]
JSONL --> Replay["Session Replay\n(time-travel debugging)"]
LB --> Export["Export\n(JSON / CSV / Markdown / Rich)"]
Replay --> Compare["Session Comparison\n(diff two runs)"] Key Components¶
Preset Benchmarks¶
10 built-in benchmark suites that cover the full spectrum of RLM capabilities:
| Category | Presets | Total Cases | Focus |
|---|---|---|---|
| DSPy | dspy_quick, dspy_extended | 8 | DSPy coding loop: signatures, modules, tests |
| Generic | generic_smoke | 2 | Basic Python execution and error recovery |
| Pure RLM | pure_rlm_smoke, pure_rlm_context | 7 | Paper-compliant mode, context-as-variable |
| Advanced | deep_recursion, paradigm_comparison | 6 | Depth > 1 recursion, cross-paradigm comparison |
| Paper-Compatible | oolong_style, browsecomp_style, token_efficiency | 10 | OOLONG, BrowseComp-Plus, token efficiency |
See Preset Benchmarks for full details on every suite and case.
Multi-Metric Leaderboard¶
The leaderboard aggregates results from all benchmark runs and ranks them across 7 metrics:
| Metric | Direction | Description |
|---|---|---|
REWARD | Higher is better | Average cumulative reward |
COMPLETION_RATE | Higher is better | Percentage of completed runs |
STEPS | Lower is better | Average steps to completion |
TOKENS | Lower is better | Total tokens consumed |
COST | Lower is better | Estimated cost in USD |
DURATION | Lower is better | Execution time in seconds |
EFFICIENCY | Higher is better | Reward per 1000 tokens |
See Leaderboard for filtering, statistics, aggregation, and export.
Session Replay¶
Every RLM run can be recorded and replayed step by step:
- SessionRecorder captures actions, observations, rewards, memory, and variables at each step
- SessionReplayer provides forward/backward navigation (
step_forward,step_backward,goto_step) - SessionStore persists snapshots and checkpoints to disk
- SessionComparison diffs two sessions to find the divergence point
See Session Replay for the full API and time-travel debugging workflows.
Quick Start¶
Run a Preset Benchmark¶
View the Leaderboard¶
Replay a Session¶
from rlm_code.rlm.session_replay import load_session
replayer = load_session(".rlm_code/rlm/observability/steps/abc12345.jsonl")
for step in replayer.iterate_steps():
print(f"Step {step.step}: {step.action_type} -> reward={step.reward}")
Custom Benchmarks¶
You can extend the built-in presets by loading custom YAML, JSON, or JSONL benchmark packs:
# my_benchmarks.yaml
presets:
my_suite:
description: "Custom evaluation suite"
cases:
- id: test_1
task: "Write a function that reverses a string"
environment: generic
max_steps: 3
exec_timeout: 30
- id: test_2
task: "Build a REST API client"
environment: generic
max_steps: 5
exec_timeout: 60
See Preset Benchmarks for all supported pack formats including Google ADK eval sets and generic datasets.
What's Next¶
| Page | Description |
|---|---|
| Preset Benchmarks | All 10 presets in detail, custom pack loading, YAML format |
| Leaderboard | Ranking, filtering, statistics, trend analysis, export |
| Session Replay | Recording, replaying, time-travel debugging, session comparison |