Policy Lab¶
Overview¶
The Policy Lab is RLM Code's hot-swappable policy system for customizing every aspect of agent behavior at runtime. Rather than hard-coding decision logic into the execution engine, RLM Code delegates all critical behavioral decisions to pluggable policies that can be registered, swapped, and configured without modifying core code.
Policies govern how the agent learns (reward), what it does next (action selection), how it manages memory (compaction), and when it stops (termination). Each category has multiple built-in implementations and supports custom user-defined policies through a decorator-based registration system.
from rlm_code.rlm.policies import (
PolicyRegistry,
RewardPolicy,
ActionSelectionPolicy,
CompactionPolicy,
TerminationPolicy,
)
# Create a full policy suite from config
policies = PolicyRegistry.create_from_config({
"reward": {"name": "research", "config": {"base_success": 0.4}},
"action": {"name": "sampling", "config": {"temperature": 0.8}},
"compaction": {"name": "hierarchical"},
"termination": {"name": "final_pattern"},
})
The Four Policy Categories¶
RLM Code organizes policies into four distinct categories, each controlling a different aspect of agent execution:
| Category | Base Class | Purpose | Default |
|---|---|---|---|
| Reward | RewardPolicy | Calculate reward signals from action results | default |
| Action Selection | ActionSelectionPolicy | Choose the next action from candidates | greedy |
| Compaction | CompactionPolicy | Compress and summarize execution history | sliding_window |
| Termination | TerminationPolicy | Decide when to stop executing | final_pattern |
graph LR
A[Agent Step] --> B{Action Selection Policy}
B --> C[Execute Action]
C --> D{Reward Policy}
D --> E{Compaction Policy}
E --> F{Termination Policy}
F -->|Continue| A
F -->|Stop| G[Final Answer] Reward Policies¶
Reward policies compute a RewardSignal from an action and its result. The signal includes a scalar value (clamped to [-1, 1]) and a component-by-component breakdown for interpretability. Different reward policies trade off between strictness, exploration, and analytical detail.
Built-in implementations: default, strict, lenient, research
See Reward Policies for full documentation.
Action Selection Policies¶
Action selection policies determine which action to execute next from a set of candidates. Strategies range from simple deterministic selection to sophisticated tree-search methods.
Built-in implementations: greedy, sampling, beam_search, mcts
See Action Selection Policies for full documentation.
Compaction Policies¶
Compaction policies manage the agent's execution history, compressing older entries to stay within token budgets while preserving important context. Strategies range from simple windowing to LLM-powered summarization.
Built-in implementations: llm, deterministic, sliding_window, hierarchical
See Compaction Policies for full documentation.
Termination Policies¶
Termination policies determine when the agent should stop executing and return a final answer. They detect completion signals, reward thresholds, and confidence levels.
Built-in implementations: final_pattern, reward_threshold, confidence, composite
See Termination Policies for full documentation.
PolicyRegistry¶
The PolicyRegistry is the central management hub for all policies. It provides:
- Decorator-based registration via
@PolicyRegistry.register_reward(name), etc. - Lookup by name with
PolicyRegistry.get_reward(name, config) - Configuration-based instantiation via
PolicyRegistry.create_from_config(config_dict) - Discovery with
PolicyRegistry.list_all()to enumerate all registered policies
from rlm_code.rlm.policies import PolicyRegistry
# List everything registered
all_policies = PolicyRegistry.list_all()
# {
# "reward": [{"name": "default", "description": "Balanced reward for general use"}, ...],
# "action": [{"name": "greedy", "description": "Always select highest-scored action"}, ...],
# "compaction": [...],
# "termination": [...],
# }
# Get a specific policy instance
reward = PolicyRegistry.get_reward("strict", config={"failure_penalty": 0.8})
# Change defaults
PolicyRegistry.set_default_reward("research")
PolicyRegistry.set_default_action("sampling")
See Policy Registry for full documentation.
PolicyContext¶
Every policy method receives a PolicyContext dataclass that carries the full execution state. This provides policies with the information they need to make context-aware decisions.
from dataclasses import dataclass, field
from typing import Any
@dataclass
class PolicyContext:
"""Context passed to policy methods."""
task: str = "" # The current task description
step: int = 0 # Current step number (0-indexed)
max_steps: int = 10 # Maximum allowed steps
history: list[dict[str, Any]] = field(default_factory=list) # Execution history entries
variables: dict[str, Any] = field(default_factory=dict) # Named variables from execution
metrics: dict[str, float] = field(default_factory=dict) # Runtime metrics (rewards, etc.)
config: dict[str, Any] = field(default_factory=dict) # Additional config
| Field | Type | Description |
|---|---|---|
task | str | The task description the agent is working on |
step | int | Current step number (0-indexed) |
max_steps | int | Maximum number of steps allowed |
history | list[dict] | Full execution history (actions, outputs, rewards) |
variables | dict[str, Any] | Named variables accumulated during execution |
metrics | dict[str, float] | Runtime metrics such as last_reward, cumulative totals |
config | dict[str, Any] | Additional configuration passed through to policies |
Using PolicyContext in custom policies
The context.history field is particularly useful for policies that need to analyze past behavior. For example, a reward policy might penalize repeated failed actions, or a termination policy might detect convergence by examining recent reward trends.
Base Policy Class¶
All policies inherit from the Policy base class, which provides:
class Policy(ABC):
"""Base class for all policies."""
name: str = "base"
description: str = "Base policy"
def __init__(self, config: dict[str, Any] | None = None):
self.config = config or {}
@classmethod
def get_default_config(cls) -> dict[str, Any]:
"""Get default configuration for this policy."""
return {}
def validate_config(self) -> list[str]:
"""Validate configuration, return list of errors."""
return []
Configuration merging
All built-in policies merge user-provided config with defaults using the pattern:
This means you only need to override the specific parameters you want to change. Any unspecified parameters fall back to their defaults.Quick Start¶
Using built-in policies¶
from rlm_code.rlm.policies import PolicyRegistry
# Get policy instances (uses defaults if no name given)
reward_policy = PolicyRegistry.get_reward() # DefaultRewardPolicy
action_policy = PolicyRegistry.get_action("sampling", config={"temperature": 0.5})
compaction_policy = PolicyRegistry.get_compaction("hierarchical")
termination_policy = PolicyRegistry.get_termination("final_pattern")
Creating a custom policy¶
from rlm_code.rlm.policies import RewardPolicy, PolicyRegistry, RewardSignal
@PolicyRegistry.register_reward("my_custom_reward")
class MyCustomRewardPolicy(RewardPolicy):
name = "my_custom_reward"
description = "Domain-specific reward for my application"
@classmethod
def get_default_config(cls):
return {"bonus_multiplier": 2.0}
def calculate(self, action, result, context):
config = {**self.get_default_config(), **self.config}
value = 0.5 if result.success else -0.3
value *= config["bonus_multiplier"]
return RewardSignal(
value=max(-1.0, min(1.0, value)),
components={"base": value},
explanation=f"Custom reward: {value:.2f}",
)
Configuration-driven setup¶
# rlm_config.yaml
policies:
reward:
name: research
config:
base_success: 0.4
fast_execution_bonus: 0.1
action:
name: sampling
config:
temperature: 0.7
min_probability: 0.05
compaction:
name: llm
config:
max_entries_before_compact: 15
preserve_last_n: 3
termination:
name: composite
config:
policies:
- final_pattern
- reward_threshold