Skip to content

Reward Policies

Overview

Reward policies calculate scalar reward signals from action results, providing the feedback mechanism that drives the RLM agent's learning and decision-making. Each reward policy produces a RewardSignal containing both a numerical value and a detailed component breakdown, enabling transparency and debugging of the reward computation.

All reward policies inherit from RewardPolicy and implement the calculate() method.


Base Classes

RewardPolicy

class RewardPolicy(Policy):
    """Policy for calculating rewards from action results."""

    name = "reward_base"
    description = "Base reward policy"

    @abstractmethod
    def calculate(
        self,
        action: dict[str, Any],
        result: ActionResult,
        context: PolicyContext,
    ) -> RewardSignal:
        """Calculate reward for an action."""
        ...

    def on_episode_start(self, context: PolicyContext) -> None:
        """Called when an episode starts."""
        pass

    def on_episode_end(self, context: PolicyContext, total_reward: float) -> None:
        """Called when an episode ends."""
        pass
Method Description
calculate(action, result, context) Required. Compute reward from action, its result, and current context
on_episode_start(context) Optional hook called at the beginning of an execution episode
on_episode_end(context, total_reward) Optional hook called at the end of an execution episode

RewardSignal

The RewardSignal dataclass is the return type of every reward calculation:

@dataclass
class RewardSignal:
    """Reward signal from a reward policy."""

    value: float                                      # Scalar reward value
    components: dict[str, float] = field(default_factory=dict)  # Named component breakdown
    explanation: str = ""                              # Human-readable explanation

    @property
    def clamped(self) -> float:
        """Return value clamped to [-1, 1]."""
        return max(-1.0, min(1.0, self.value))
Field Type Description
value float The total reward value. All built-in policies clamp this to [-1, 1]
components dict[str, float] Named breakdown of reward components (e.g., {"success": 0.7, "error": -0.1})
explanation str Human-readable explanation of the reward computation

Component breakdowns

The components dictionary is invaluable for debugging and research. It lets you see exactly which factors contributed to the final reward value and by how much. The ResearchRewardPolicy produces the most granular breakdowns.

ActionResult

The ActionResult dataclass represents the outcome of executing an action:

@dataclass
class ActionResult:
    """Result of an action execution."""

    action_type: str                               # Type of action (e.g., "code", "final")
    success: bool                                  # Whether the action succeeded
    output: str = ""                               # Stdout/output from the action
    error: str | None = None                       # Error message if any
    duration_ms: float = 0.0                       # Execution time in milliseconds
    tokens_used: int = 0                           # Tokens consumed
    metadata: dict[str, Any] = field(default_factory=dict)  # Additional metadata

Built-in Implementations

DefaultRewardPolicy

Registration name: "default"

A balanced reward policy suitable for general-purpose use. Provides moderate rewards for success and moderate penalties for failure, with bonuses for completing the task.

from rlm_code.rlm.policies import PolicyRegistry

policy = PolicyRegistry.get_reward("default")

Default Configuration

Parameter Default Description
success_bonus 0.7 Reward for a successful action
failure_penalty 0.3 Penalty for a failed action
partial_success_base 0.3 Base reward for partial success
stderr_penalty 0.1 Penalty when stderr/error output is present
final_bonus 0.5 Bonus for a successful final action

Reward Calculation

The DefaultRewardPolicy computes reward as a sum of components:

total = base (0.1)
      + success_bonus (if success)  OR  - failure_penalty (if failure)
      - stderr_penalty (if error present)
      + final_bonus (if final action AND success)
# Example: successful code execution
signal = policy.calculate(
    action={"action": "code", "code": "print(42)"},
    result=ActionResult(action_type="code", success=True, output="42"),
    context=PolicyContext(task="compute answer"),
)
# signal.value = 0.8  (0.1 base + 0.7 success)
# signal.components = {"base": 0.1, "success": 0.7}

# Example: failed action with error
signal = policy.calculate(
    action={"action": "code", "code": "1/0"},
    result=ActionResult(action_type="code", success=False, error="ZeroDivisionError"),
    context=PolicyContext(task="compute answer"),
)
# signal.value = -0.3  (0.1 base - 0.3 failure - 0.1 error)
# signal.components = {"base": 0.1, "failure": -0.3, "error": -0.1}

Custom Configuration

policy = PolicyRegistry.get_reward("default", config={
    "success_bonus": 0.9,
    "failure_penalty": 0.5,
    "final_bonus": 0.8,
})

StrictRewardPolicy

Registration name: "strict"

A reward policy with heavy penalties for errors, designed for production environments where correctness is critical and errors must be strongly discouraged. The final bonus is only awarded when the action succeeds and has no errors.

policy = PolicyRegistry.get_reward("strict")

Default Configuration

Parameter Default Description
success_bonus 0.5 Reward for a successful action
failure_penalty 0.6 Heavy penalty for failure
error_penalty 0.3 Additional penalty when error output is present
timeout_penalty 0.4 Extra penalty for timeout errors
final_bonus 0.3 Bonus for successful final action (only if error-free)

Reward Calculation

total = + success_bonus (if success)  OR  - failure_penalty (if failure)
        - error_penalty (if error present)
        - timeout_penalty (if error contains "timeout")
        + final_bonus (if final action AND success AND no errors)

Strict penalties stack

A failed action with a timeout error receives both the failure_penalty and the error_penalty and the timeout_penalty, potentially reaching -1.0 (the minimum clamped value). This aggressive penalization is intentional for production-critical workloads.

# Example: failed action with timeout
signal = policy.calculate(
    action={"action": "code", "code": "time.sleep(300)"},
    result=ActionResult(
        action_type="code",
        success=False,
        error="Execution timeout after 60s",
    ),
    context=PolicyContext(task="compute answer"),
)
# signal.value = -1.0  (clamped from -1.3: -0.6 failure - 0.3 error - 0.4 timeout)
# signal.components = {"failure": -0.6, "error": -0.3, "timeout": -0.4}

LenientRewardPolicy

Registration name: "lenient"

A forgiving reward policy that encourages exploration. Every action attempt is rewarded, failures receive only mild penalties, and producing substantial output earns a progress bonus. Ideal for research, experimentation, and early-stage development where exploration is more valuable than immediate correctness.

policy = PolicyRegistry.get_reward("lenient")

Default Configuration

Parameter Default Description
attempt_bonus 0.2 Reward just for attempting any action
success_bonus 0.5 Reward for a successful action
failure_penalty 0.1 Mild penalty for failure
progress_bonus 0.15 Bonus when output exceeds 50 characters (suggests learning)
final_bonus 0.4 Bonus for any final action (success not required)

Reward Calculation

total = + attempt_bonus (always)
        + success_bonus (if success)  OR  - failure_penalty (if failure)
        + progress_bonus (if output length > 50 chars)
        + final_bonus (if final action, regardless of success)

Exploration-friendly design

Unlike DefaultRewardPolicy and StrictRewardPolicy, the LenientRewardPolicy awards the final_bonus even when the final action fails. This encourages the agent to attempt answers rather than endlessly exploring. The progress_bonus rewards producing verbose output, which often correlates with the agent making progress on the task.

# Example: failed action with substantial output
signal = policy.calculate(
    action={"action": "code", "code": "complex_analysis()"},
    result=ActionResult(
        action_type="code",
        success=False,
        output="Partial results: computed 3 of 5 components..." * 3,
    ),
    context=PolicyContext(task="analyze data"),
)
# signal.value = 0.25  (0.2 attempt - 0.1 failure + 0.15 progress)
# signal.components = {"attempt": 0.2, "failure": -0.1, "progress": 0.15}

ResearchRewardPolicy

Registration name: "research"

A research-focused reward policy that produces granular, multi-dimensional reward breakdowns. Designed for reward function research, paper analysis, and detailed performance profiling. Tracks code quality, output quality, execution efficiency, step efficiency, and more.

policy = PolicyRegistry.get_reward("research")

Default Configuration

Parameter Default Description
Base components
base_attempt 0.05 Small reward for every attempt
base_success 0.3 Reward for success
base_failure 0.2 Penalty for failure
Code quality
code_length_bonus_per_100_chars 0.02 Bonus per 100 characters of code
code_length_cap 0.1 Maximum code length bonus
code_complexity_penalty_per_nest 0.01 Penalty per nesting level above 10
Output quality
output_length_bonus_per_100_chars 0.01 Bonus per 100 characters of output
output_length_cap 0.05 Maximum output length bonus
error_keyword_penalty 0.05 Penalty for error keywords in output
Efficiency
fast_execution_bonus 0.05 Bonus for execution under 1 second
slow_execution_penalty 0.05 Penalty for execution over 10 seconds
Progress
step_penalty_per_step 0.01 Penalty per step taken (encourages efficiency)
early_termination_bonus 0.1 Bonus for finishing before half of max_steps
Final
final_success_bonus 0.3 Bonus for successful final action
final_failure_penalty 0.1 Penalty for failed final action

Reward Calculation

The ResearchRewardPolicy computes up to 12 distinct components:

total = base_attempt (always)
      + base_success (if success)        OR  - base_failure (if failure)
      + code_length (capped bonus based on code size)
      - code_complexity (penalty if nesting > 10)
      + output_length (capped bonus based on output size)
      - error_keyword (if output contains "error", "exception", "traceback", "failed")
      + fast_execution (if duration < 1s)  OR  - slow_execution (if duration > 10s)
      - step_penalty (proportional to current step number)
      + final_success (if final and success)
      + early_termination (if final and success and step < max_steps/2)
      OR - final_failure (if final and not success)
# Example: fast, successful code execution at step 2
signal = policy.calculate(
    action={"action": "code", "code": "result = sum(range(100))"},
    result=ActionResult(
        action_type="code",
        success=True,
        output="4950",
        duration_ms=50.0,
    ),
    context=PolicyContext(task="compute sum", step=2, max_steps=10),
)
# signal.components might include:
# {
#     "base_attempt": 0.05,
#     "base_success": 0.3,
#     "code_length": 0.006,       # ~30 chars / 100 * 0.02
#     "output_length": 0.0004,    # 4 chars / 100 * 0.01
#     "fast_execution": 0.05,     # 50ms < 1000ms
#     "step_penalty": -0.02,      # step 2 * 0.01
# }

Research applications

The granular component breakdown is particularly useful for:

  • Ablation studies: Disable individual components to measure their impact
  • Reward shaping research: Adjust component weights to study learning dynamics
  • Performance profiling: Identify which reward dimensions are driving agent behavior
  • Paper reproduction: Match reward functions from published RLM research

Custom Configuration Example

# Emphasize code quality and efficiency
policy = PolicyRegistry.get_reward("research", config={
    "code_length_bonus_per_100_chars": 0.05,
    "code_length_cap": 0.2,
    "fast_execution_bonus": 0.15,
    "slow_execution_penalty": 0.15,
    "step_penalty_per_step": 0.03,
})

Comparing Reward Policies

The following table summarizes the key behavioral differences:

Scenario Default Strict Lenient Research
Successful action +0.8 +0.5 +0.7 ~+0.4
Failed action -0.2 -0.6 +0.1 ~-0.15
Failed with error -0.3 -0.9 +0.1 ~-0.2
Successful final +1.0 +0.8 +1.0 ~+0.7
Failed final -0.2 -0.6 +0.5 ~-0.25

Approximate values

The Research column shows approximate values because its calculations depend on additional factors like code length, output content, execution time, and step number. The values above assume a simple action at step 0 with no code or output.

When to Use Each Policy

Policy Best For
Default General-purpose tasks, balanced exploration/exploitation
Strict Production systems, safety-critical code, CI/CD pipelines
Lenient Research, experimentation, early-stage prototyping
Research Reward function research, ablation studies, paper analysis

Creating a Custom Reward Policy

from rlm_code.rlm.policies import (
    RewardPolicy,
    PolicyRegistry,
    RewardSignal,
    ActionResult,
    PolicyContext,
)
from typing import Any


@PolicyRegistry.register_reward("domain_specific")
class DomainSpecificRewardPolicy(RewardPolicy):
    """Reward policy for a specific domain (e.g., SQL generation)."""

    name = "domain_specific"
    description = "Rewards correct SQL query generation"

    @classmethod
    def get_default_config(cls) -> dict[str, Any]:
        return {
            "syntax_valid_bonus": 0.3,
            "correct_result_bonus": 0.6,
            "injection_penalty": 0.9,
            "performance_bonus": 0.1,
        }

    def calculate(
        self,
        action: dict[str, Any],
        result: ActionResult,
        context: PolicyContext,
    ) -> RewardSignal:
        config = {**self.get_default_config(), **self.config}
        components = {}
        total = 0.0

        code = action.get("code", "")

        # Check for SQL injection patterns
        if any(kw in code.lower() for kw in ["drop table", "delete from", "; --"]):
            components["injection"] = -config["injection_penalty"]
            total -= config["injection_penalty"]

        # Reward valid SQL syntax
        if result.success and not result.error:
            components["syntax_valid"] = config["syntax_valid_bonus"]
            total += config["syntax_valid_bonus"]

        # Reward correct results
        expected = context.variables.get("expected_result")
        if expected and result.output.strip() == str(expected).strip():
            components["correct_result"] = config["correct_result_bonus"]
            total += config["correct_result_bonus"]

        # Performance bonus for fast queries
        if result.duration_ms < 100:
            components["performance"] = config["performance_bonus"]
            total += config["performance_bonus"]

        return RewardSignal(
            value=max(-1.0, min(1.0, total)),
            components=components,
            explanation=f"SQL reward: {total:.2f} with {len(components)} components",
        )


# Use it
policy = PolicyRegistry.get_reward("domain_specific")