Termination Policies¶

Overview¶

Termination policies determine when the agent should stop executing and return a final answer. Without a termination policy, the agent would run until it exhausts its step budget. Termination policies detect completion signals -- explicit patterns in output, reward thresholds, confidence levels, or combinations thereof -- and extract the agent's final answer.

All termination policies inherit from TerminationPolicy and implement the should_terminate() method.

Base Class¶

TerminationPolicy¶

class TerminationPolicy(Policy):
    """Policy for determining when to terminate execution."""

    name = "termination_base"
    description = "Base termination policy"

    @abstractmethod
    def should_terminate(
        self,
        result: ActionResult,
        context: PolicyContext,
    ) -> tuple[bool, str | None]:
        """
        Check if execution should terminate.

        Args:
            result: Latest action result
            context: Execution context

        Returns:
            Tuple of (should_terminate, final_answer_if_any)
        """
        ...

Method	Return Type	Description
`should_terminate(result, context)`	`(bool, str \\| None)`	Returns whether to stop and the extracted final answer (or `None`)

The return value is a tuple:

First element (bool): True if execution should stop, False to continue
Second element (str | None): The extracted final answer if terminating, or None

Built-in Implementations¶

FinalPatternTerminationPolicy¶

Registration name: "final_pattern"

Detects FINAL() and FINAL_VAR() patterns in action output, matching the termination mechanism described in the RLM paper. This is the default termination policy. It uses configurable regex patterns to detect termination signals and extracts the final answer from the pattern match.

from rlm_code.rlm.policies import PolicyRegistry

policy = PolicyRegistry.get_termination("final_pattern")

Default Configuration¶

Parameter	Default	Description
`final_patterns`	(see below)	List of regex patterns to detect termination
`case_sensitive`	`False`	Whether pattern matching is case-sensitive
`extract_answer`	`True`	Whether to extract the answer from the matched group

Default patterns:

[
    r"FINAL\s*\(\s*['\"](.+?)['\"]\s*\)",    # FINAL('answer') or FINAL("answer")
    r"FINAL\s*\(\s*(.+?)\s*\)",              # FINAL(answer) without quotes
    r"FINAL_VAR\s*\(\s*['\"](\w+)['\"]\s*\)", # FINAL_VAR('variable_name')
]

Behavior¶

The policy checks for termination in two ways:

Action type check: If result.action_type == "final", terminates immediately with result.output as the answer.
Pattern matching: Scans result.output against each pattern in final_patterns. On first match:
- If extract_answer is True, extracts the captured group as the answer
- For FINAL_VAR patterns, looks up the variable name in context.variables and returns its value
- If extract_answer is False, returns the full output

from rlm_code.rlm.policies.base import ActionResult, PolicyContext

# Example: FINAL() in output
result = ActionResult(
    action_type="code",
    success=True,
    output="After computing, the answer is FINAL('42')",
)
context = PolicyContext(task="What is 6*7?")

should_stop, answer = policy.should_terminate(result, context)
# should_stop = True
# answer = "42"

# Example: FINAL_VAR() with variable lookup
result = ActionResult(
    action_type="code",
    success=True,
    output="FINAL_VAR('result')",
)
context = PolicyContext(
    task="Compute the sum",
    variables={"result": 4950},
)

should_stop, answer = policy.should_terminate(result, context)
# should_stop = True
# answer = "4950"  (looked up from context.variables)

Paper compliance

The FINAL() and FINAL_VAR() patterns are the termination mechanism specified in the RLM paper. Using this policy ensures your agent follows the standard RLM protocol. The patterns are designed to be unambiguous even when embedded in natural language output.

Custom Pattern Configuration¶

You can add domain-specific termination patterns:

policy = PolicyRegistry.get_termination("final_pattern", config={
    "final_patterns": [
        r"FINAL\s*\(\s*['\"](.+?)['\"]\s*\)",       # Standard FINAL()
        r"ANSWER:\s*(.+?)$",                          # Custom: "ANSWER: 42"
        r"SOLUTION\s*=\s*(.+?)$",                     # Custom: "SOLUTION = 42"
    ],
    "case_sensitive": False,
    "extract_answer": True,
})

RewardThresholdTerminationPolicy¶

Registration name: "reward_threshold"

Terminates when the cumulative reward reaches a configurable threshold, or when the agent hits a streak of consecutive negative rewards. This policy is useful for optimization-focused tasks where you want the agent to stop once it has achieved a "good enough" result.

policy = PolicyRegistry.get_termination("reward_threshold", config={
    "min_reward_threshold": 0.9,
})

Default Configuration¶

Parameter	Default	Description
`min_reward_threshold`	`0.8`	Cumulative reward threshold that triggers termination
`max_negative_streak`	`3`	Stop after this many consecutive negative rewards
`require_final_action`	`False`	If True, only terminate on final-type actions

Behavior¶

The policy maintains two internal state variables:

_cumulative_reward: Running sum of rewards from context.metrics["last_reward"]
_negative_streak: Count of consecutive steps with negative reward

Termination triggers in two cases:

Reward threshold reached: _cumulative_reward >= min_reward_threshold
- If require_final_action is True, only terminates when result.action_type == "final"
- Answer is the result output or a summary string
Negative streak: _negative_streak >= max_negative_streak
- Terminates with a message indicating consecutive failures

from rlm_code.rlm.policies.base import ActionResult, PolicyContext

# Step 1: reward = 0.4
context = PolicyContext(metrics={"last_reward": 0.4})
result = ActionResult(action_type="code", success=True, output="progress...")
should_stop, answer = policy.should_terminate(result, context)
# should_stop = False (cumulative = 0.4, threshold = 0.8)

# Step 2: reward = 0.5
context = PolicyContext(metrics={"last_reward": 0.5})
result = ActionResult(action_type="code", success=True, output="more progress...")
should_stop, answer = policy.should_terminate(result, context)
# should_stop = True (cumulative = 0.9, threshold = 0.8)
# answer = "Reward threshold reached: 0.90"

Call reset() to clear cumulative state between episodes:

policy.reset()

Stateful policy

The RewardThresholdTerminationPolicy accumulates reward across calls. You must call reset() between independent episodes to avoid carrying over state. The cumulative reward is read from context.metrics["last_reward"] -- ensure your reward policy populates this field.

Require final action

Setting require_final_action=True prevents premature termination when the cumulative reward threshold is reached mid-computation. The agent will continue until it explicitly produces a final-type action, giving it the opportunity to formulate a proper answer.

ConfidenceTerminationPolicy¶

Registration name: "confidence"

Terminates when the model's self-reported confidence in its answer exceeds a threshold. Confidence is read from result.metadata. This policy also enforces a minimum number of steps before allowing termination, preventing the agent from short-circuiting on the first step.

policy = PolicyRegistry.get_termination("confidence", config={
    "confidence_threshold": 0.95,
    "min_steps_before_termination": 3,
})

Default Configuration¶

Parameter	Default	Description
`confidence_threshold`	`0.85`	Minimum confidence level to trigger termination
`min_steps_before_termination`	`2`	Minimum steps before termination is allowed
`confidence_key`	`"confidence"`	Key to read from `result.metadata`
`fallback_to_final_pattern`	`True`	Fall back to FinalPatternTerminationPolicy if confidence is below threshold

Behavior¶

Minimum steps check: If context.step < min_steps_before_termination, always returns (False, None)
Confidence check: Reads result.metadata[confidence_key]. If it meets the threshold, terminates with result.output
Fallback (optional): If fallback_to_final_pattern is True and confidence is below threshold, delegates to FinalPatternTerminationPolicy to check for explicit FINAL() patterns

from rlm_code.rlm.policies.base import ActionResult, PolicyContext

# Step 0: blocked by min_steps
result = ActionResult(
    action_type="code",
    success=True,
    output="42",
    metadata={"confidence": 0.99},
)
context = PolicyContext(step=0)
should_stop, answer = policy.should_terminate(result, context)
# should_stop = False (step 0 < min_steps 2)

# Step 3: confidence met
context = PolicyContext(step=3)
should_stop, answer = policy.should_terminate(result, context)
# should_stop = True
# answer = "42"

# Step 3: low confidence, but FINAL() in output (fallback)
result2 = ActionResult(
    action_type="code",
    success=True,
    output="I think it might be FINAL('42')",
    metadata={"confidence": 0.4},
)
should_stop, answer = policy.should_terminate(result2, PolicyContext(step=3))
# should_stop = True (via fallback to FinalPatternTerminationPolicy)
# answer = "42"

Populating confidence

The confidence value must be placed in result.metadata by the execution engine or the model itself. Common approaches include:

Having the model output a confidence score alongside its answer
Computing confidence from token probabilities (logprobs)
Using an ensemble of models and measuring agreement

CompositeTerminationPolicy¶

Registration name: "composite"

Combines multiple termination policies using either OR logic (terminate when any sub-policy triggers) or AND logic (terminate only when all sub-policies agree). This enables sophisticated termination conditions without writing custom code.

policy = PolicyRegistry.get_termination("composite", config={
    "policies": ["final_pattern", "reward_threshold"],
    "require_all": False,  # OR logic
})

Default Configuration¶

Parameter	Default	Description
`policies`	`["final_pattern", "reward_threshold"]`	List of termination policy names to combine
`require_all`	`False`	`False` = OR logic (any triggers), `True` = AND logic (all must agree)

Behavior¶

OR mode (require_all=False): Iterates through sub-policies and terminates on the first one that triggers. Returns that sub-policy's answer.
AND mode (require_all=True): All sub-policies must agree to terminate. Returns the first non-None answer from the agreeing policies.

# OR mode: stop on FINAL() or high reward
policy = PolicyRegistry.get_termination("composite", config={
    "policies": ["final_pattern", "reward_threshold"],
    "require_all": False,
})

# AND mode: only stop when both confidence and FINAL() agree
policy = PolicyRegistry.get_termination("composite", config={
    "policies": ["confidence", "final_pattern"],
    "require_all": True,
})

Use cases for composite termination

Combination	Mode	Use Case
`final_pattern` + `reward_threshold`	OR	Standard RLM with safety net for stuck agents
`confidence` + `final_pattern`	AND	High-reliability tasks requiring both explicit answer and high confidence
`final_pattern` + `confidence` + `reward_threshold`	OR	Maximum flexibility -- any signal can trigger termination

Comparison¶

Policy	Stateful	Signal Source	Answer Extraction	Best For
FinalPattern	No	Output patterns	Regex capture groups	Standard RLM, paper compliance
RewardThreshold	Yes	Cumulative reward	Output or threshold message	Optimization tasks
Confidence	No	Metadata confidence	Output	Model-aware termination
Composite	Depends	Multiple sources	First non-None answer	Complex termination logic

Decision Guide¶

Do you use FINAL() patterns?
  YES --> FinalPattern (default, paper-compliant)
  NO  --> Is reward-based stopping appropriate?
            YES --> RewardThreshold
            NO  --> Does your model report confidence?
                      YES --> Confidence
                      NO  --> Composite (combine multiple signals)

Need multiple termination conditions?
  --> Composite with OR mode (safety net)
  --> Composite with AND mode (high reliability)

Creating a Custom Termination Policy¶

from rlm_code.rlm.policies import (
    TerminationPolicy,
    PolicyRegistry,
    ActionResult,
    PolicyContext,
)
from typing import Any


@PolicyRegistry.register_termination("convergence")
class ConvergenceTerminationPolicy(TerminationPolicy):
    """
    Terminate when the agent's outputs converge (stop changing).
    Detects when the last N outputs are similar, indicating
    the agent has reached a stable answer.
    """

    name = "convergence"
    description = "Stop when outputs converge (stabilize)"

    @classmethod
    def get_default_config(cls) -> dict[str, Any]:
        return {
            "window_size": 3,
            "similarity_threshold": 0.9,
            "min_steps": 3,
        }

    def __init__(self, config=None):
        super().__init__(config)
        self._recent_outputs: list[str] = []

    def should_terminate(self, result, context):
        config = {**self.get_default_config(), **self.config}

        # Track outputs
        self._recent_outputs.append(result.output or "")

        # Minimum steps
        if context.step < config["min_steps"]:
            return False, None

        # Check last N outputs for convergence
        window = self._recent_outputs[-config["window_size"]:]
        if len(window) < config["window_size"]:
            return False, None

        # Simple convergence: check if all outputs are the same
        if len(set(window)) == 1:
            return True, window[-1]

        return False, None

    def reset(self):
        self._recent_outputs = []


# Use it
policy = PolicyRegistry.get_termination("convergence", config={
    "window_size": 4,
    "min_steps": 5,
})