Skip to content

๐Ÿงช RLM Code

Research Playground & Evaluation OS for Recursive Language Model Agentic Systems

v0.1.5 Python 3.11+ Apache 2.0

RLM Code is the definitive research operating system for building, running, evaluating, comparing, and optimizing LLM-based coding agents. It supports multiple agent paradigms including Pure RLM, CodeAct, and Traditional in a single unified platform with built-in safety, observability, and reproducibility.


๐ŸŽฏ What RLM Code Solves

The underlying long-context reasoning problem is what RLM (the method) addresses. RLM Code addresses the tooling and workflow problem around using that method in practice.

Core product problems it targets:

  • Implementation friction: provide a runnable RLM environment (llm_query, REPL, run loop) without custom scaffolding.
  • Experiment management: run, replay, compare, and benchmark experiments in one place.
  • Safety controls: route execution through secure backends and explicit runtime settings.
  • Reproducibility: store traces, metrics, and benchmark artifacts for repeatable research.
  • Operational visibility: expose observability, status, and diagnostics for debugging experiments.

In short, RLM Code is a research tooling layer for building and evaluating RLM-style workflows.


โœจ Highlights

๐Ÿง  Multi-Paradigm Engine

Run Pure RLM (paper-compliant with context-as-variable), CodeAct (context-in-tokens), or Traditional agent orchestration, all from one TUI.

๐Ÿ”ฌ Built-in Research Tab

A dedicated Research tab inside the TUI with Dashboard, Trajectory, Benchmarks, Replay, and Live Events sub-tabs for real-time experiment tracking.

๐Ÿ† Benchmarks & Leaderboard

10 preset benchmarks with 33+ test cases, a multi-metric leaderboard, and side-by-side paradigm comparison.

๐Ÿ” Session Replay

Time-travel through any RLM run step-by-step with forward/backward navigation, reward curve visualization, and checkpoint/restore.

๐ŸŽฏ Hot-Swappable Policies

Swap reward, action selection, compaction, and termination policies at runtime via the Policy Lab.

๐Ÿ”’ HITL Approval Gates

Risk assessment with 40+ rules, 6 approval modes, and full audit logging to keep humans in the loop for every critical action.

๐Ÿ“Š Pluggable Observability

7 sinks including JSONL, MLflow, OpenTelemetry, LangSmith, LangFuse, and Logfire to trace every step of every run.

๐Ÿ“ฆ Sandbox Runtimes

6 runtimes including Local, Docker, Apple Container, Modal, E2B, and Daytona for safe, isolated code execution.


๐Ÿ–ผ๏ธ RLM Research Lab

RLM Research Lab


๐Ÿš€ Quick Start

Install and launch

uv tool install "rlm-code[tui,llm-all]"
rlm-code

Connect to a model

/connect anthropic claude-opus-4-6

Run your first benchmark

/rlm bench preset=dspy_quick

Keep runs bounded

/rlm run "small scoped task" steps=4 timeout=30 budget=60
/rlm abort all

Compare benchmark output

/rlm bench compare candidate=latest baseline=previous

Switch to the Research tab

Press Ctrl+5 or F6 to open the Research tab to see your run's dashboard, trajectory, reward curves, and live events.


๐Ÿ—๏ธ Architecture

graph TB
    CLI["๐Ÿš€ rlm-code CLI"]
    CLI --> TUI["๐Ÿ–ฅ๏ธ Unified TUI"]
    TUI --> RLM["๐Ÿ” RLM"]
    TUI --> FILES["๐Ÿ“ Files"]
    TUI --> DETAILS["๐Ÿ“‹ Details"]
    TUI --> SHELL["โšก Shell"]
    TUI --> RESEARCH["๐Ÿ”ฌ Research"]

    CLI --> CMD["โŒจ๏ธ 50+ Slash Commands"]

    CMD --> RUNNER["๐Ÿง  RLM Runner"]
    RUNNER --> EVENTS["๐Ÿ“ก Event Bus (27+ types)"]
    RUNNER --> OBS["๐Ÿ“Š Observability (7 sinks)"]
    RUNNER --> TRAJ["๐Ÿ“ˆ Trajectory Logger"]
    RUNNER --> POL["๐ŸŽฏ Policy Lab"]
    RUNNER --> HITL["๐Ÿ”’ HITL Approval Gates"]

    RUNNER --> ENV["๐ŸŒ Environments"]
    ENV --> PURE["Pure RLM"]
    ENV --> DSPY["DSPy Coding"]
    ENV --> GEN["Generic"]

    RUNNER --> SAND["๐Ÿ“ฆ Sandbox Runtimes"]
    SAND --> LOCAL["Local"]
    SAND --> DOCKER["Docker"]
    SAND --> CLOUD["Modal ยท E2B ยท Daytona"]

    CMD --> BENCH["๐Ÿ† Benchmarks (10 presets)"]
    CMD --> LB["๐Ÿ“Š Leaderboard"]
    CMD --> SR["โช Session Replay"]

๐Ÿ“‹ Feature Matrix

Feature Module
๐Ÿง  RLM Runner (multi-paradigm) rlm_code.rlm.runner
๐Ÿงช Pure RLM Environment rlm_code.rlm.pure_rlm_environment
๐Ÿ“ก Event System (27+ types) rlm_code.rlm.events
๐ŸŽฏ Policy Lab (16 policies) rlm_code.rlm.policies
๐Ÿ”’ HITL Approval Gates rlm_code.rlm.approval
๐Ÿ“Š Observability (7 sinks) rlm_code.rlm.observability
๐Ÿ† Benchmarks (10 presets) rlm_code.rlm.benchmarks
๐Ÿ“Š Leaderboard rlm_code.rlm.leaderboard
โช Session Replay rlm_code.rlm.session_replay
๐Ÿ” Paradigm Comparison rlm_code.rlm.comparison
๐Ÿ“ˆ Trajectory Logging rlm_code.rlm.trajectory
๐Ÿงน Memory Compaction rlm_code.rlm.memory_compaction
๐Ÿ“ฆ 6 Sandbox Runtimes rlm_code.sandbox.runtimes
๐Ÿค– 12+ LLM Providers rlm_code.models
๐Ÿ”Œ MCP Server rlm_code.mcp
๐Ÿ–ฅ๏ธ Unified TUI (5 tabs) rlm_code.ui.tui_app
โŒจ๏ธ 50+ Slash Commands rlm_code.commands
Code Validation rlm_code.validation
๐Ÿงฉ Framework Adapters rlm_code.rlm.frameworks

๐Ÿ–ฅ๏ธ The TUI at a Glance

RLM Code ships a single unified TUI with 5 tabs:

Tab Shortcut Purpose
๐Ÿ” RLM Ctrl+1 / F2 Converse with LLMs, run slash commands
๐Ÿ“ Files Ctrl+2 / F3 Browse project files with syntax preview
๐Ÿ“‹ Details Ctrl+3 / F4 Status panel, diff viewer
โšก Shell Ctrl+4 / F5 Persistent stateful shell
๐Ÿ”ฌ Research Ctrl+5 / F6 Dashboard, trajectory, benchmarks, replay, live events

The Research tab has 5 internal sub-tabs for organizing experiment data:

  • Dashboard: Run metrics, reward sparkline, summary
  • Trajectory: Step-by-step timeline of actions and rewards
  • Benchmarks: Leaderboard table from /rlm bench runs
  • Replay: Step-through controls for time-travel debugging
  • Events: Live event stream from the RLM event bus

๐Ÿ”ฌ Research Tab

Press Ctrl+5 after running /rlm bench preset=dspy_quick to see real experiment data populate the Research tab dashboards.


๐Ÿ“š Documentation Guide

Section What You'll Find
๐Ÿš€ Getting Started Installation, quick start, CLI reference, configuration
๐Ÿง  Core Engine RLM Runner, environments, events, termination, trajectory
๐ŸŽฏ Policies & Safety Reward, action, compaction, termination policies + HITL gates
๐Ÿ–ฅ๏ธ Terminal UI Tab reference, Research tab, widgets, theme system
๐Ÿ“Š Benchmarks & Replay Presets, leaderboard, session replay
๐Ÿ” Observability Sink architecture, MLflow, OTel, LangSmith, LangFuse, Logfire
๐Ÿ“ฆ Platform Sandbox runtimes, LLM providers, MCP, framework adapters
๐Ÿ“– Reference Full API reference