Policy Lab¶

Overview¶

The Policy Lab is RLM Code's hot-swappable policy system for customizing every aspect of agent behavior at runtime. Rather than hard-coding decision logic into the execution engine, RLM Code delegates all critical behavioral decisions to pluggable policies that can be registered, swapped, and configured without modifying core code.

Policies govern how the agent learns (reward), what it does next (action selection), how it manages memory (compaction), and when it stops (termination). Each category has multiple built-in implementations and supports custom user-defined policies through a decorator-based registration system.

from rlm_code.rlm.policies import (
    PolicyRegistry,
    RewardPolicy,
    ActionSelectionPolicy,
    CompactionPolicy,
    TerminationPolicy,
)

# Create a full policy suite from config
policies = PolicyRegistry.create_from_config({
    "reward": {"name": "research", "config": {"base_success": 0.4}},
    "action": {"name": "sampling", "config": {"temperature": 0.8}},
    "compaction": {"name": "hierarchical"},
    "termination": {"name": "final_pattern"},
})

The Four Policy Categories¶

RLM Code organizes policies into four distinct categories, each controlling a different aspect of agent execution:

Category	Base Class	Purpose	Default
Reward	`RewardPolicy`	Calculate reward signals from action results	`default`
Action Selection	`ActionSelectionPolicy`	Choose the next action from candidates	`greedy`
Compaction	`CompactionPolicy`	Compress and summarize execution history	`sliding_window`
Termination	`TerminationPolicy`	Decide when to stop executing	`final_pattern`

graph LR
    A[Agent Step] --> B{Action Selection Policy}
    B --> C[Execute Action]
    C --> D{Reward Policy}
    D --> E{Compaction Policy}
    E --> F{Termination Policy}
    F -->|Continue| A
    F -->|Stop| G[Final Answer]

Reward Policies¶

Reward policies compute a RewardSignal from an action and its result. The signal includes a scalar value (clamped to [-1, 1]) and a component-by-component breakdown for interpretability. Different reward policies trade off between strictness, exploration, and analytical detail.

Built-in implementations: default, strict, lenient, research

See Reward Policies for full documentation.

Action Selection Policies¶

Action selection policies determine which action to execute next from a set of candidates. Strategies range from simple deterministic selection to sophisticated tree-search methods.

Built-in implementations: greedy, sampling, beam_search, mcts

See Action Selection Policies for full documentation.

Compaction Policies¶

Compaction policies manage the agent's execution history, compressing older entries to stay within token budgets while preserving important context. Strategies range from simple windowing to LLM-powered summarization.

Built-in implementations: llm, deterministic, sliding_window, hierarchical

See Compaction Policies for full documentation.

Termination Policies¶

Termination policies determine when the agent should stop executing and return a final answer. They detect completion signals, reward thresholds, and confidence levels.

Built-in implementations: final_pattern, reward_threshold, confidence, composite

See Termination Policies for full documentation.

PolicyRegistry¶

The PolicyRegistry is the central management hub for all policies. It provides:

Decorator-based registration via @PolicyRegistry.register_reward(name), etc.
Lookup by name with PolicyRegistry.get_reward(name, config)
Configuration-based instantiation via PolicyRegistry.create_from_config(config_dict)
Discovery with PolicyRegistry.list_all() to enumerate all registered policies

from rlm_code.rlm.policies import PolicyRegistry

# List everything registered
all_policies = PolicyRegistry.list_all()
# {
#     "reward": [{"name": "default", "description": "Balanced reward for general use"}, ...],
#     "action": [{"name": "greedy", "description": "Always select highest-scored action"}, ...],
#     "compaction": [...],
#     "termination": [...],
# }

# Get a specific policy instance
reward = PolicyRegistry.get_reward("strict", config={"failure_penalty": 0.8})

# Change defaults
PolicyRegistry.set_default_reward("research")
PolicyRegistry.set_default_action("sampling")

See Policy Registry for full documentation.

PolicyContext¶

Every policy method receives a PolicyContext dataclass that carries the full execution state. This provides policies with the information they need to make context-aware decisions.

from dataclasses import dataclass, field
from typing import Any

@dataclass
class PolicyContext:
    """Context passed to policy methods."""

    task: str = ""                                    # The current task description
    step: int = 0                                     # Current step number (0-indexed)
    max_steps: int = 10                               # Maximum allowed steps
    history: list[dict[str, Any]] = field(default_factory=list)   # Execution history entries
    variables: dict[str, Any] = field(default_factory=dict)       # Named variables from execution
    metrics: dict[str, float] = field(default_factory=dict)       # Runtime metrics (rewards, etc.)
    config: dict[str, Any] = field(default_factory=dict)          # Additional config

Field	Type	Description
`task`	`str`	The task description the agent is working on
`step`	`int`	Current step number (0-indexed)
`max_steps`	`int`	Maximum number of steps allowed
`history`	`list[dict]`	Full execution history (actions, outputs, rewards)
`variables`	`dict[str, Any]`	Named variables accumulated during execution
`metrics`	`dict[str, float]`	Runtime metrics such as `last_reward`, cumulative totals
`config`	`dict[str, Any]`	Additional configuration passed through to policies

Using PolicyContext in custom policies

The context.history field is particularly useful for policies that need to analyze past behavior. For example, a reward policy might penalize repeated failed actions, or a termination policy might detect convergence by examining recent reward trends.

Base Policy Class¶

All policies inherit from the Policy base class, which provides:

class Policy(ABC):
    """Base class for all policies."""

    name: str = "base"
    description: str = "Base policy"

    def __init__(self, config: dict[str, Any] | None = None):
        self.config = config or {}

    @classmethod
    def get_default_config(cls) -> dict[str, Any]:
        """Get default configuration for this policy."""
        return {}

    def validate_config(self) -> list[str]:
        """Validate configuration, return list of errors."""
        return []

Configuration merging

All built-in policies merge user-provided config with defaults using the pattern:

config = {**self.get_default_config(), **self.config}

This means you only need to override the specific parameters you want to change. Any unspecified parameters fall back to their defaults.

Quick Start¶

Using built-in policies¶

from rlm_code.rlm.policies import PolicyRegistry

# Get policy instances (uses defaults if no name given)
reward_policy = PolicyRegistry.get_reward()          # DefaultRewardPolicy
action_policy = PolicyRegistry.get_action("sampling", config={"temperature": 0.5})
compaction_policy = PolicyRegistry.get_compaction("hierarchical")
termination_policy = PolicyRegistry.get_termination("final_pattern")

Creating a custom policy¶

from rlm_code.rlm.policies import RewardPolicy, PolicyRegistry, RewardSignal

@PolicyRegistry.register_reward("my_custom_reward")
class MyCustomRewardPolicy(RewardPolicy):
    name = "my_custom_reward"
    description = "Domain-specific reward for my application"

    @classmethod
    def get_default_config(cls):
        return {"bonus_multiplier": 2.0}

    def calculate(self, action, result, context):
        config = {**self.get_default_config(), **self.config}
        value = 0.5 if result.success else -0.3
        value *= config["bonus_multiplier"]
        return RewardSignal(
            value=max(-1.0, min(1.0, value)),
            components={"base": value},
            explanation=f"Custom reward: {value:.2f}",
        )

Configuration-driven setup¶

# rlm_config.yaml
policies:
  reward:
    name: research
    config:
      base_success: 0.4
      fast_execution_bonus: 0.1
  action:
    name: sampling
    config:
      temperature: 0.7
      min_probability: 0.05
  compaction:
    name: llm
    config:
      max_entries_before_compact: 15
      preserve_last_n: 3
  termination:
    name: composite
    config:
      policies:
        - final_pattern
        - reward_threshold

import yaml
from rlm_code.rlm.policies import PolicyRegistry

with open("rlm_config.yaml") as f:
    config = yaml.safe_load(f)

policies = PolicyRegistry.create_from_config(config["policies"])