Skip to content

Quick Start

This guide walks you through launching RLM Code, connecting to an LLM, running your first benchmark, validating results, and exploring the Research tab - all in under 10 minutes.


Prerequisites

Before you begin, make sure you have:

  • Python 3.11+ installed
  • RLM Code installed (uv tool install "rlm-code[tui,llm-all]")
  • At least one LLM API key (OpenAI, Anthropic, or Gemini) or a local Ollama instance

Local Models

You can use RLM Code entirely with local models via Ollama. No API keys needed:

ollama pull llama3.2

Step 1: Launch the TUI

Navigate to a project directory (not your home directory) and launch:

mkdir -p ~/projects/rlm-demo && cd ~/projects/rlm-demo
rlm-code

Directory Safety Check

RLM Code performs a safety check on startup. It will warn you if you are running from your home directory, Desktop, Documents, or a system directory. Always run from a dedicated project directory.

You should see the RLM Research Lab TUI with 5 tabs: RLM, Files, Details, Shell, and Research. The RLM tab is active by default.


Step 2: Initialize Your Project

Initialize a project configuration file:

/init

This creates an rlm_config.yaml in your current directory with default settings. The initializer scans your project for existing files and frameworks.


Step 3: Connect to a Model

Use the /connect command to connect to an LLM provider:

/connect anthropic claude-opus-4-6

Note

Requires ANTHROPIC_API_KEY in your environment or .env file.

/connect openai gpt-5.3-codex

Note

Requires OPENAI_API_KEY in your environment or .env file.

/connect gemini gemini-2.5-flash

Note

Requires GEMINI_API_KEY or GOOGLE_API_KEY in your environment or .env file.

/connect ollama llama3.2

Note

Requires a running Ollama server at http://localhost:11434.

Interactive Model Picker

For an interactive keyboard-driven model selection experience:

/connect

Run /connect with no arguments to open the guided picker and choose mode/provider/model interactively.

Verify Connection

Check that your model is connected:

/status

This shows the current model, provider, connection status, sandbox runtime, and observability sinks.


Step 4: Run a Benchmark

RLM Code ships with 10+ built-in benchmark presets. Start with the quick DSPy smoke test:

/rlm bench preset=dspy_quick

This runs 3 benchmark cases (Build Signature, Build Module, Add Tests) through the RLM loop: context -> action proposal -> sandbox execution -> observation -> reward -> memory update.

List Available Presets

/rlm bench list

Available built-in presets:

Preset Cases Description
dspy_quick 3 Fast DSPy coding loop smoke test
dspy_extended 5 Broader DSPy coding loop sweep
generic_smoke 2 Generic environment safety/sanity checks
pure_rlm_smoke 3 Pure RLM paper-compliant mode smoke test
pure_rlm_context 4 Pure RLM context-as-variable paradigm tests
deep_recursion 3 Deep recursion tests (depth > 1)
paradigm_comparison 3 Side-by-side paradigm comparison benchmarks
oolong_style 4 OOLONG-style long context benchmarks
browsecomp_style 3 BrowseComp-Plus style web reasoning benchmarks
token_efficiency 3 Token efficiency comparison benchmarks

Run a Pure RLM Benchmark

The Pure RLM benchmarks exercise the paper's core paradigm:

/rlm bench preset=pure_rlm_smoke

These tests use context as a REPL variable, llm_query() for recursive LLM calls, FINAL() and FINAL_VAR() for termination, and SHOW_VARS() for state inspection.

Load External Benchmark Packs

You can load benchmarks from YAML, JSON, or JSONL files:

/rlm bench pack=my_benchmarks.yaml

Supported formats include explicit preset mappings, Pydantic-style dataset cases, Google ADK eval sets, and generic record datasets.

Before larger runs, keep strict limits:

/rlm run "small scoped task" steps=4 timeout=30 budget=60
  • steps caps planner iterations
  • timeout caps per-action execution time
  • budget caps total run time

If a run is going out of control, cancel it:

/rlm abort all

Or cancel one run by id:

/rlm abort <run_id>

Step 5: Inspect Benchmark Results

After running benchmarks, inspect the Research tab and (after at least two runs) use compare/report commands:

/rlm bench compare candidate=latest baseline=previous
/rlm bench report candidate=latest baseline=previous format=markdown

The Research -> Benchmarks panel updates automatically after /rlm bench.


Step 6: Compare Paradigms

Run the same task through multiple paradigms and compare:

/rlm bench preset=paradigm_comparison

This runs document summarization, information extraction, and multi-hop reasoning tasks through Pure RLM, CodeAct, and Traditional paradigms side by side.

Use the comparison command for direct A/B analysis:

/rlm bench compare candidate=latest baseline=previous

Step 7: Session Replay

Every RLM run generates a trajectory that can be replayed step by step.

Load a Session for Replay

/rlm status
/rlm replay <run_id>

This loads the most recent run and enters replay mode with forward/backward navigation:

  • Step forward: View the next action, observation, and reward
  • Step backward: Go back to a previous state
  • Jump to step: Go directly to any step number
  • Find errors: Jump to steps that produced errors
  • View summary: See session-level statistics

Use the run_id printed by /rlm status (or from the Research tab).


Step 8: Explore Slash Commands

RLM Code has 50+ slash commands. Here are the most useful ones to explore next:

RLM Commands

/rlm run "Analyze this code and suggest improvements"
/rlm status
/rlm doctor
/rlm chat "What patterns does this codebase use?"
/rlm observability

Sandbox Commands

/sandbox status
/sandbox doctor
/sandbox profile secure
/sandbox use docker

Harness Commands

/harness tools
/harness run "fix failing tests" steps=8 mcp=on

With ACP:

/connect acp
/harness run "implement feature with tests" steps=8 mcp=on

In Local/BYOK modes, likely coding prompts can auto-route to harness. In ACP mode, auto-routing is intentionally disabled; use /harness run ... directly.

File and Layout Commands

/snapshot          # Take a project snapshot
/diff              # Show file diffs
/view chat         # Switch to chat view
/layout multi      # Switch to multi-pane layout
/pane files show   # Show the files panel
/focus chat        # Focus the chat input

For the complete researcher command handbook, see Researcher Onboarding.

Shell Access

/shell ls -la
!python --version

Shell Shortcut

Prefix any command with ! to run it as a shell command directly:

!pip list | grep dspy

Get Help

/help

Step 9: Explore the Research Tab

After running a benchmark, press Ctrl+5 to switch to the Research tab:

  • Dashboard: See run metrics, reward sparkline, and summary
  • Trajectory: Step-by-step breakdown of agent actions and rewards
  • Benchmarks: Leaderboard table from all your runs
  • Replay: Step-through controls for time-travel debugging
  • Events: Live event stream from the RLM event bus

Research Tab

The Research tab updates automatically when you run /rlm bench or /rlm run commands. No manual refresh needed!


Full Workflow Example

Here is a complete workflow from start to finish:

# Create a project directory
mkdir -p ~/projects/rlm-eval && cd ~/projects/rlm-eval

# Launch RLM Code
rlm-code
# Initialize the project
/init

# Connect to Claude Opus 4.6
/connect anthropic claude-opus-4-6

# Check everything is working
/status
/sandbox doctor

# Run the Pure RLM smoke test
/rlm bench preset=pure_rlm_smoke

# Compare the latest run with previous baseline
/rlm bench compare candidate=latest baseline=previous

# Export a benchmark report
/rlm bench report candidate=latest baseline=previous format=markdown

# Run a more comprehensive benchmark
/rlm bench preset=dspy_extended

# Compare paradigms
/rlm bench preset=paradigm_comparison

# Replay the last session
/rlm status
/rlm replay <run_id>

# Check observability sinks
/rlm observability

# Run an ad-hoc task
/rlm run "Write a Python function that finds the longest common subsequence"

# Export results
/export results.json

# Exit
/exit

What's Next?

  • CLI Reference: Complete documentation for all commands and flags
  • Configuration: Customize every aspect of RLM Code via rlm_config.yaml
  • Core Engine: RLM Runner, Environments, and Event System
  • Research Tab: Deep dive into the experiment tracking interface
  • Observability: MLflow, OpenTelemetry, LangSmith, LangFuse, Logfire
  • Sandbox Runtimes: Superbox runtime selection, Docker/Monty/cloud guidance