CodeMode Evaluation & Promotion Gates¶
Use this page to evaluate strategy=codemode against baseline strategy=tool_call before wider rollout.
Objective¶
Compare harness strategies on identical workloads:
- baseline:
strategy=tool_call - candidate:
strategy=codemode
Promotion decision should be metric-driven and safety-aware.
Evaluation Protocol¶
- Select the same preset and same case limit for both runs.
- Keep model, MCP server, and environment fixed.
- Run baseline first, candidate second.
- Compare with gate thresholds.
Recommended starting preset for MCP-heavy behavior:
dynamic_web_filtering
Commands¶
Baseline¶
Candidate¶
Compare¶
/rlm bench compare candidate=latest baseline=previous min_reward_delta=0.00 min_completion_delta=0.00 max_steps_increase=0.50
CI-style gate¶
/rlm bench validate candidate=latest baseline=previous min_reward_delta=0.00 min_completion_delta=0.00 max_steps_increase=0.50 fail_on_completion_regression=on --json
Metrics to Watch¶
Core benchmark metrics:
avg_rewardcompletion_rateavg_steps- usage totals (
total_calls,prompt_tokens,completion_tokens)
CodeMode-specific diagnostics (per case):
harness_strategycodemode_chain_callscodemode_search_callscodemode_discovery_callscodemode_guardrail_blockedmcp_tool_calls
Suggested Promotion Criteria¶
Use these as default release criteria unless your team has stricter requirements.
| Gate | Recommended threshold |
|---|---|
| Reward delta | >= 0.00 |
| Completion delta | >= 0.00 |
| Steps increase | <= 0.50 |
| Completion regressions | 0 (enforce fail_on_completion_regression=on) |
| Safety | No unexplained policy failures in case logs |
If candidate fails any gate, keep default on tool_call and continue CodeMode as opt-in only.
Reading Summary Files¶
Benchmark summaries are stored under .rlm_code/rlm/benchmarks/*.json.
For harness runs, summary-level fields include:
modemcp_enabledmcp_serverharness_strategy
Case payloads include the CodeMode telemetry listed above.
Release Decision Template¶
Use this lightweight checklist for launch approval:
- Baseline benchmark ID:
- Candidate benchmark ID:
- Reward delta:
- Completion delta:
- Steps increase:
- Completion regressions:
- Guardrail blocked count:
- Decision:
promoteorhold - Owner + date: