Leaderboard¶
The leaderboard aggregates results from benchmark runs and individual RLM runs, providing multi-metric ranking, filtering, statistical analysis, trend tracking, and export to multiple formats.
Module: rlm_code.rlm.leaderboard
Core Classes¶
Leaderboard¶
The central manager that loads results, applies rankings, and exports data.
from rlm_code.rlm.leaderboard import Leaderboard
lb = Leaderboard(workdir=Path(".rlm_code"), auto_load=True)
| Parameter | Type | Default | Description |
|---|---|---|---|
workdir | Path | None | Path.cwd() / ".rlm_code" | Working directory containing results |
auto_load | bool | True | Automatically load all results on construction |
LeaderboardEntry¶
A single entry in the leaderboard representing one benchmark run or individual run.
@dataclass
class LeaderboardEntry:
# Identification
entry_id: str # Short ID (first 16 chars of benchmark_id)
benchmark_id: str # Full benchmark identifier
run_id: str | None = None # Individual run ID (if from runs.jsonl)
# Metadata
environment: str = "" # Environment name (dspy, generic, pure_rlm)
model: str = "" # Model identifier
preset: str = "" # Benchmark preset name
timestamp: str = "" # ISO timestamp
description: str = "" # Human-readable description
# Core metrics
avg_reward: float = 0.0 # Average reward across cases
completion_rate: float = 0.0 # Fraction of completed cases (0.0 - 1.0)
total_cases: int = 0 # Number of cases in the benchmark
completed_cases: int = 0 # Number that completed
avg_steps: float = 0.0 # Average steps per case
# Token metrics
total_tokens: int = 0 # Total tokens consumed
prompt_tokens: int = 0 # Prompt/input tokens
completion_tokens: int = 0 # Completion/output tokens
# Cost and time
estimated_cost: float = 0.0 # Estimated cost in USD
duration_seconds: float = 0.0 # Total execution time
# Computed metrics (auto-calculated in __post_init__)
efficiency: float = 0.0 # reward per 1000 tokens
tokens_per_step: float = 0.0 # tokens / avg_steps
# Raw data
source_path: str = "" # Path to the source file
tags: list[str] = field(default_factory=list)
metadata: dict[str, Any] = field(default_factory=dict)
Computed Metrics¶
Two metrics are automatically calculated in __post_init__:
def __post_init__(self) -> None:
if self.total_tokens > 0:
self.efficiency = (self.avg_reward * 1000) / self.total_tokens
if self.avg_steps > 0:
self.tokens_per_step = self.total_tokens / self.avg_steps
Factory Methods¶
| Method | Source | Description |
|---|---|---|
from_benchmark_json(data, source_path) | Benchmark JSON file | Creates entry from full benchmark result data |
from_run_jsonl(data, source_path) | runs.jsonl line | Creates entry from a single run record |
RankingMetric¶
Enum of 7 available metrics for ranking:
class RankingMetric(Enum):
REWARD = "reward" # Average reward (higher is better)
COMPLETION_RATE = "completion_rate" # % completed runs (higher is better)
STEPS = "steps" # Average steps (lower is better)
TOKENS = "tokens" # Total tokens used (lower is better)
COST = "cost" # Estimated cost (lower is better)
DURATION = "duration" # Execution time (lower is better)
EFFICIENCY = "efficiency" # Reward per token (higher is better)
Each metric has a default sort direction:
| Metric | Higher is Better? | Default Sort |
|---|---|---|
REWARD | Yes | Descending |
COMPLETION_RATE | Yes | Descending |
STEPS | No | Ascending |
TOKENS | No | Ascending |
COST | No | Ascending |
DURATION | No | Ascending |
EFFICIENCY | Yes | Descending |
LeaderboardFilter¶
Filters for narrowing down leaderboard queries.
@dataclass
class LeaderboardFilter:
environments: list[str] | None = None # Filter by environment names
models: list[str] | None = None # Filter by model identifiers
presets: list[str] | None = None # Filter by preset names
tags: list[str] | None = None # Filter by tags (any match)
min_reward: float | None = None # Minimum average reward
max_reward: float | None = None # Maximum average reward
min_completion_rate: float | None = None # Minimum completion rate
date_from: datetime | None = None # Earliest timestamp
date_to: datetime | None = None # Latest timestamp
min_cases: int | None = None # Minimum number of cases
The matches(entry) method checks all non-None filter fields against the entry. All conditions must pass (AND logic). For tags, any tag match suffices (OR within tags).
Example¶
from datetime import datetime, timezone
from rlm_code.rlm.leaderboard import LeaderboardFilter
filter = LeaderboardFilter(
environments=["dspy", "pure_rlm"],
min_reward=0.5,
min_completion_rate=0.8,
date_from=datetime(2025, 1, 1, tzinfo=timezone.utc),
)
Ranking¶
Leaderboard.rank()¶
The primary ranking method.
def rank(
self,
metric: RankingMetric = RankingMetric.REWARD,
order: SortOrder | None = None,
limit: int | None = None,
filter: LeaderboardFilter | None = None,
) -> RankingResult:
| Parameter | Type | Default | Description |
|---|---|---|---|
metric | RankingMetric | REWARD | Metric to rank by |
order | SortOrder | None | Auto (based on metric) | ASCENDING or DESCENDING |
limit | int | None | None (all) | Maximum entries to return |
filter | LeaderboardFilter | None | None | Filter to apply |
Returns a RankingResult containing the ranked entries and statistics.
RankingResult¶
@dataclass
class RankingResult:
entries: list[LeaderboardEntry] # Ranked entries
metric: RankingMetric # The metric used for ranking
order: SortOrder # The sort order applied
total_count: int # Total entries (before filter)
filtered_count: int # Entries after filter
# Statistics (auto-computed in __post_init__)
mean: float = 0.0
median: float = 0.0
std_dev: float = 0.0
min_value: float = 0.0
max_value: float = 0.0
Example¶
from rlm_code.rlm.leaderboard import Leaderboard, RankingMetric, LeaderboardFilter
lb = Leaderboard(workdir=Path(".rlm_code"))
result = lb.rank(
metric=RankingMetric.EFFICIENCY,
limit=10,
filter=LeaderboardFilter(environments=["pure_rlm"]),
)
print(f"Showing {len(result.entries)}/{result.filtered_count} entries")
print(f"Mean efficiency: {result.mean:.4f}")
print(f"Median: {result.median:.4f}, Std Dev: {result.std_dev:.4f}")
for rank, entry in enumerate(result.entries, 1):
print(f"#{rank} {entry.entry_id}: efficiency={entry.efficiency:.4f}, "
f"reward={entry.avg_reward:.3f}, tokens={entry.total_tokens:,}")
Statistics¶
get_statistics()¶
Compute statistical summary for any metric.
stats = lb.get_statistics(
metric=RankingMetric.REWARD,
filter=LeaderboardFilter(environments=["dspy"]),
)
Returns:
{
"count": 15,
"mean": 0.7234,
"median": 0.7500,
"std_dev": 0.1523,
"min": 0.3000,
"max": 1.0000,
"sum": 10.8510,
}
Trend Analysis¶
compute_trend()¶
Compute a moving-average trend over time for any metric.
from rlm_code.rlm.leaderboard import compute_trend, RankingMetric
trend = compute_trend(
entries=lb.entries,
metric=RankingMetric.REWARD,
window=5, # 5-entry moving average
)
for point in trend:
print(f"{point['timestamp']}: value={point['value']:.3f}, "
f"moving_avg={point['moving_avg']:.3f}")
Returns a list of dicts:
[
{
"timestamp": "2025-05-15T10:00:00+00:00",
"entry_id": "abc12345",
"value": 0.75,
"moving_avg": 0.75,
},
...
]
The entries are sorted by timestamp, and the moving average window slides from the start.
Aggregation¶
aggregate_by_field()¶
Group entries by any field and compute per-group statistics.
from rlm_code.rlm.leaderboard import aggregate_by_field, RankingMetric
# Aggregate by environment
by_env = aggregate_by_field(
entries=lb.entries,
field="environment",
metric=RankingMetric.REWARD,
)
for env, stats in by_env.items():
print(f"{env}: count={stats['count']}, mean={stats['mean']:.3f}, "
f"median={stats['median']:.3f}")
Returns:
{
"dspy": {"count": 8, "mean": 0.72, "median": 0.75, "min": 0.3, "max": 1.0},
"pure_rlm": {"count": 7, "mean": 0.68, "median": 0.70, "min": 0.2, "max": 0.95},
}
Useful field values: "environment", "model", "preset".
Comparison¶
Leaderboard.compare()¶
Compare specific entries across multiple metrics side by side.
comparison = lb.compare(
entry_ids=["abc12345", "def67890"],
metrics=[RankingMetric.REWARD, RankingMetric.TOKENS, RankingMetric.EFFICIENCY],
)
for entry_id, data in comparison.items():
print(f"{entry_id}:")
for metric, value in data["metrics"].items():
print(f" {metric}: {value}")
Export¶
JSON¶
Output structure:
{
"exported_at": "2025-05-15T10:30:00+00:00",
"metric": "reward",
"order": "desc",
"total_entries": 50,
"filtered_entries": 50,
"statistics": {
"mean": 0.72,
"median": 0.75,
"std_dev": 0.15,
"min": 0.30,
"max": 1.00
},
"entries": [...]
}
CSV¶
Columns: rank, entry_id, environment, model, preset, avg_reward, completion_rate, avg_steps, total_tokens, efficiency, timestamp.
Markdown¶
Produces a full Markdown document with a table and statistics section:
# RLM Leaderboard
**Ranked by**: reward | **Entries**: 50/50
| Rank | ID | Environment | Reward | Completion | Steps | Tokens | Efficiency |
|------|-----|-------------|--------|------------|-------|--------|------------|
| 1 | abc12345 | dspy | 0.950 | 100% | 3.0 | 1,200 | 0.792 |
| 2 | def67890 | pure_rlm | 0.900 | 100% | 4.0 | 1,500 | 0.600 |
...
## Statistics
- **Mean**: 0.7234
- **Median**: 0.7500
- **Std Dev**: 0.1523
- **Range**: 0.3000 - 1.0000
Rich Table (Terminal)¶
table = lb.format_rich_table(
metric=RankingMetric.REWARD,
limit=10,
title="RLM Leaderboard",
)
from rich.console import Console
Console().print(table)
The Rich table features:
- Color-coded reward values (green >= 0.7, yellow >= 0.4, red < 0.4)
- Color-coded completion rates (green >= 80%, yellow >= 50%, red < 50%)
- Caption showing entry count and mean value
- Right-aligned numeric columns
CLI Usage¶
# Default: show top 10 by reward
rlm-code leaderboard
# Rank by efficiency, show top 20
rlm-code leaderboard --metric efficiency --limit 20
# Filter by environment
rlm-code leaderboard --metric reward --environment dspy
# Filter by model
rlm-code leaderboard --metric tokens --model gpt-4o
# Export to JSON
rlm-code leaderboard --metric reward --format json --output-path results.json
# Export to CSV
rlm-code leaderboard --metric reward --format csv --output-path results.csv
# Export to Markdown
rlm-code leaderboard --metric reward --format markdown --output-path results.md
Data Loading¶
The Leaderboard automatically loads data from two sources:
Benchmark JSON Files¶
Located in .rlm_code/rlm/benchmarks/*.json. Each file contains a full benchmark result with case-level detail. The leaderboard uses LeaderboardEntry.from_benchmark_json() to parse these.
runs.jsonl¶
Located in .rlm_code/observability/runs.jsonl. Each line contains a single run result from the LocalJSONLSink. The leaderboard uses LeaderboardEntry.from_run_jsonl() to parse these. Runs that already appear in a benchmark file (matched by run_id) are skipped to avoid duplicates.
lb = Leaderboard(workdir=Path(".rlm_code"))
print(f"Loaded {len(lb.entries)} entries")
# Manual reload
count = lb.load_all()
print(f"Loaded {count} new entries")
# Load from specific paths
lb.load_benchmarks(benchmarks_dir=Path("custom/benchmarks"))
lb.load_runs(runs_file=Path("custom/runs.jsonl"))
Utility Methods¶
| Method | Returns | Description |
|---|---|---|
add_entry(entry) | None | Manually add an entry |
remove_entry(entry_id) | bool | Remove by ID |
get_entry(entry_id) | LeaderboardEntry | None | Retrieve by ID |
get_unique_values(field) | list[str] | Get unique values for a field (useful for filter suggestions) |