Paradigm Comparison¶
Module
rlm_code.rlm.comparison
The paradigm comparison module enables side-by-side empirical comparison of different RLM approaches on the same task. It directly addresses the debate around whether RLM provides real benefits over simpler approaches by measuring token usage, cost, execution time, and accuracy.
For concept-level guidance on when to use each execution style, see Execution Patterns.
Overview¶
Three paradigms are compared:
| Paradigm | How Context is Handled | Token Profile |
|---|---|---|
| Pure RLM | Context stored as REPL variable; LLM sees only metadata | Low context tokens, moderate total tokens |
| CodeAct | Context included directly in the token window | High context tokens, variable total tokens |
| Traditional | Context written to file, accessed via tools | Medium context tokens (partial reads) |
The comparison runs each paradigm on the same task and context, collecting detailed metrics for head-to-head analysis.
Classes¶
Paradigm¶
Enumeration of RLM paradigms for comparison.
| Value | Description | Environment Used |
|---|---|---|
PURE_RLM | Paper-compliant context-as-variable | pure_rlm |
CODEACT | Context in token window | generic |
TRADITIONAL | Tool-based file access | dspy |
ParadigmResult¶
Result from running a task under a specific paradigm.
@dataclass
class ParadigmResult:
paradigm: Paradigm
success: bool
answer: str
# Token metrics
context_tokens: int = 0
total_tokens: int = 0
prompt_tokens: int = 0
completion_tokens: int = 0
# Cost metrics
estimated_cost: float = 0.0
# Time metrics
duration_seconds: float = 0.0
iterations: int = 0
# Quality metrics (if ground truth available)
accuracy: float | None = None
f1_score: float | None = None
# LLM call breakdown
root_llm_calls: int = 0
sub_llm_calls: int = 0
# Event trace
events: list[dict[str, Any]] = field(default_factory=list)
# Error info
error: str | None = None
Field Reference¶
| Field | Type | Description |
|---|---|---|
paradigm | Paradigm | Which paradigm was used |
success | bool | Whether the task completed successfully |
answer | str | The final answer produced |
context_tokens | int | Tokens consumed by context (metadata-only for Pure RLM, full for CodeAct) |
total_tokens | int | Total tokens used across all LLM calls |
prompt_tokens | int | Total prompt (input) tokens |
completion_tokens | int | Total completion (output) tokens |
estimated_cost | float | Estimated cost in USD |
duration_seconds | float | Wall-clock execution time |
iterations | int | Number of RLM iterations |
accuracy | float \| None | Accuracy score (0.0-1.0) if ground truth available |
f1_score | float \| None | F1 score if ground truth available |
root_llm_calls | int | Number of root-level LLM calls (one per iteration) |
sub_llm_calls | int | Number of sub-LLM calls via llm_query() |
events | list[dict] | Full event trace from RLMEventCollector |
error | str \| None | Error message if the paradigm failed |
to_dict()¶
Serialize to dictionary (answer truncated to 500 characters).
ComparisonResult¶
Aggregated result of comparing multiple paradigms on the same task.
@dataclass
class ComparisonResult:
comparison_id: str
task: str
context_length: int
# Results by paradigm
results: dict[Paradigm, ParadigmResult] = field(default_factory=dict)
# Timing
started_at: str = field(...)
finished_at: str = ""
total_duration_seconds: float = 0.0
# Ground truth (if available)
ground_truth: str | None = None
Methods¶
add_result(result)¶
Add a ParadigmResult to the comparison.
get_winner(metric="total_tokens") -> Paradigm | None¶
Get the winning paradigm for a given metric.
winner = comparison.get_winner("total_tokens") # Lower is better
winner = comparison.get_winner("estimated_cost") # Lower is better
winner = comparison.get_winner("duration_seconds") # Lower is better
winner = comparison.get_winner("accuracy") # Higher is better
Winner Selection
Only paradigms with success=True are considered. For accuracy, higher is better; for all other metrics, lower is better.
get_summary() -> dict[str, Any]¶
Get a structured comparison summary with metrics grouped by paradigm and winners identified.
summary = comparison.get_summary()
# {
# "comparison_id": "abc12345",
# "task": "Analyze sentiment...",
# "context_length": 45230,
# "paradigms_tested": ["pure_rlm", "codeact", "traditional"],
# "total_duration_seconds": 45.2,
# "total_tokens_by_paradigm": {"pure_rlm": 5200, "codeact": 12400, "traditional": 8100},
# "total_tokens_winner": "pure_rlm",
# "estimated_cost_by_paradigm": {...},
# "estimated_cost_winner": "pure_rlm",
# ...
# }
format_table() -> str¶
Format the comparison as an ASCII table for terminal display.
Example output:
======================================================================
PARADIGM COMPARISON: Analyze sentiment of customer reviews
======================================================================
Metric pure_rlm codeact traditional
----------------------------------------------------------------------
Context Tokens 200 11,308 5,654
Total Tokens 5,200 12,400 8,100
Est. Cost $0.0260 $0.0620 $0.0405
Duration 12.30s 8.50s 15.20s
Iterations 3 2 4
Root LLM Calls 3 2 4
Sub LLM Calls 5 0 0
Accuracy 85.0% 82.0% 78.0%
Success True True True
----------------------------------------------------------------------
WINNERS:
Lowest Tokens: pure_rlm
Lowest Cost: pure_rlm
Fastest: codeact
======================================================================
ParadigmComparator¶
The orchestrator that runs side-by-side paradigm comparisons.
class ParadigmComparator:
def __init__(
self,
runner: Any, # RLMRunner instance
event_bus: RLMEventBus | None = None,
):
| Parameter | Type | Default | Description |
|---|---|---|---|
runner | RLMRunner | required | The RLM runner to execute tasks |
event_bus | RLMEventBus \| None | Auto-created | Event bus for comparison events |
compare()¶
Run a comparison across paradigms.
def compare(
self,
task: str,
context: str,
paradigms: list[Paradigm] | None = None,
ground_truth: str | None = None,
max_steps: int = 5,
exec_timeout: int = 60,
) -> ComparisonResult:
| Parameter | Type | Default | Description |
|---|---|---|---|
task | str | required | The task to perform |
context | str | required | The context to analyze |
paradigms | list[Paradigm] \| None | All three | Paradigms to test |
ground_truth | str \| None | None | Expected answer for accuracy calculation |
max_steps | int | 5 | Maximum steps per paradigm |
exec_timeout | int | 60 | Timeout per execution in seconds |
Execution flow:
- Emit
COMPARISON_STARTevent - For each paradigm:
- Emit
COMPARISON_PARADIGM_STARTevent - Subscribe an
RLMEventCollectorto capture events - Run the task using the appropriate paradigm strategy
- Calculate accuracy against ground truth (if provided)
- Build
ParadigmResultwith all metrics - Emit
COMPARISON_PARADIGM_ENDevent
- Emit
- Emit
COMPARISON_ENDevent with summary - Return
ComparisonResult
Paradigm Strategies¶
Pure RLM (_run_pure_rlm):
- Initializes
PureRLMEnvironmentand loads context as variable - Context tokens = ~200 (metadata only)
- Runs task in
pure_rlmenvironment
CodeAct (_run_codeact):
- Embeds the full context directly in the task prompt
- Context tokens =
len(context) / 4(full context) - Runs task in
genericenvironment
Traditional (_run_traditional):
- Writes context to a temporary file
- Task instructs LLM to use
read_fileandsearch_codetools - Context tokens = estimated at half of full context (partial reads)
- Runs task in
dspyenvironment - Cleans up temporary file after completion
Accuracy Calculation¶
When ground_truth is provided, accuracy is calculated using Jaccard similarity:
answer_tokens = set(answer.lower().split())
truth_tokens = set(ground_truth.lower().split())
accuracy = len(answer_tokens & truth_tokens) / len(answer_tokens | truth_tokens)
Cost Estimation¶
Costs are estimated based on token count and model pricing:
| Model | Cost per 1K tokens |
|---|---|
gpt-4o | $0.005 |
gpt-4 | $0.030 |
claude-3-opus | $0.015 |
claude-3-sonnet | $0.003 |
| Default | $0.005 |
create_comparison_report()¶
Generate a detailed human-readable comparison report.
from rlm_code.rlm.comparison import create_comparison_report
report = create_comparison_report(comparison)
print(report)
The report includes:
- Header with comparison ID, task, context length, and duration
- Metrics table (same as
format_table()) - Analysis section with:
- Token savings percentage between Pure RLM and CodeAct
- Context token reduction percentage
- Cost savings analysis
- Speed comparison
- Verdict section with a conclusion about which paradigm is best for this scenario
Example verdict:
VERDICT:
----------------------------------------
Pure RLM wins on both tokens and cost, validating the paper's claims
that context-as-variable reduces token usage.
Complete Usage Example¶
from rlm_code.rlm.comparison import ParadigmComparator, Paradigm
from rlm_code.rlm.runner import RLMRunner
# Set up runner
runner = RLMRunner(
llm_connector=my_connector,
execution_engine=my_engine,
workdir=my_project_dir,
)
# Create comparator
comparator = ParadigmComparator(runner=runner)
# Load a large context
with open("large_document.txt") as f:
context = f.read()
# Run comparison
result = comparator.compare(
task="Summarize the key findings in this document",
context=context,
paradigms=[Paradigm.PURE_RLM, Paradigm.CODEACT],
ground_truth="The document discusses three main findings: ...",
max_steps=5,
exec_timeout=120,
)
# Display results
print(result.format_table())
# Get winner
token_winner = result.get_winner("total_tokens")
cost_winner = result.get_winner("estimated_cost")
print(f"Token winner: {token_winner.value}")
print(f"Cost winner: {cost_winner.value}")
# Detailed analysis
from rlm_code.rlm.comparison import create_comparison_report
print(create_comparison_report(result))
# Access individual paradigm results
pure_rlm = result.results[Paradigm.PURE_RLM]
codeact = result.results[Paradigm.CODEACT]
print(f"Pure RLM: {pure_rlm.total_tokens} tokens, ${pure_rlm.estimated_cost:.4f}")
print(f"CodeAct: {codeact.total_tokens} tokens, ${codeact.estimated_cost:.4f}")
print(f"Token savings: {1 - pure_rlm.total_tokens / codeact.total_tokens:.1%}")
Event Integration¶
The comparator emits events at each stage through the event bus:
| Event | Payload |
|---|---|
COMPARISON_START | paradigms, task, context_length |
COMPARISON_PARADIGM_START | paradigm name |
COMPARISON_PARADIGM_END | Full ParadigmResult.to_dict() |
COMPARISON_END | summary, duration_ms |
Subscribe to these events for real-time progress updates:
from rlm_code.rlm.events import RLMEventBus, RLMEventType
bus = RLMEventBus()
def on_paradigm_end(event):
data = event.event_data.metadata
print(f" {data['paradigm']}: {data['total_tokens']} tokens, "
f"{'SUCCESS' if data['success'] else 'FAILED'}")
bus.subscribe_to_type(RLMEventType.COMPARISON_PARADIGM_END, on_paradigm_end)
comparator = ParadigmComparator(runner=runner, event_bus=bus)
result = comparator.compare(task=task, context=context)