CodeMode Evaluation¶
Evaluate strategy=codemode against strategy=tool_call on identical harness workloads.
Suggested workflow¶
/rlm bench preset=dynamic_web_filtering mode=harness strategy=tool_call mcp=on mcp_server=codemode
/rlm bench preset=dynamic_web_filtering mode=harness strategy=codemode mcp=on mcp_server=codemode
/rlm bench compare candidate=latest baseline=previous min_reward_delta=0.00 min_completion_delta=0.00 max_steps_increase=0.50
Metrics to review¶
avg_rewardcompletion_rateavg_steps- usage totals
codemode_chain_callscodemode_search_callscodemode_discovery_callscodemode_guardrail_blocked
Promotion guidance¶
Keep CodeMode opt-in unless it meets your gate thresholds with no policy regressions.
For full gating details, see: