Quick Start¶
This guide walks you through launching RLM Code, connecting to an LLM, running your first benchmark, validating results, and exploring the Research tab - all in under 10 minutes.
Prerequisites¶
Before you begin, make sure you have:
- Python 3.11+ installed
- RLM Code installed (
uv tool install "rlm-code[tui,llm-all]") - At least one LLM API key (OpenAI, Anthropic, or Gemini) or a local Ollama instance
Local Models
You can use RLM Code entirely with local models via Ollama. No API keys needed:
Step 1: Launch the TUI¶
Navigate to a project directory (not your home directory) and launch:
Directory Safety Check
RLM Code performs a safety check on startup. It will warn you if you are running from your home directory, Desktop, Documents, or a system directory. Always run from a dedicated project directory.
You should see the RLM Research Lab TUI with 5 tabs: RLM, Files, Details, Shell, and Research. The RLM tab is active by default.
Step 2: Initialize Your Project¶
Initialize a project configuration file:
This creates an rlm_config.yaml in your current directory with default settings. The initializer scans your project for existing files and frameworks.
Step 3: Connect to a Model¶
Use the /connect command to connect to an LLM provider:
Note
Requires ANTHROPIC_API_KEY in your environment or .env file.
Note
Requires GEMINI_API_KEY or GOOGLE_API_KEY in your environment or .env file.
Interactive Model Picker¶
For an interactive keyboard-driven model selection experience:
Run /connect with no arguments to open the guided picker and choose mode/provider/model interactively.
Verify Connection¶
Check that your model is connected:
This shows the current model, provider, connection status, sandbox runtime, and observability sinks.
Step 4: Run a Benchmark¶
RLM Code ships with 10+ built-in benchmark presets. Start with the quick DSPy smoke test:
This runs 3 benchmark cases (Build Signature, Build Module, Add Tests) through the RLM loop: context -> action proposal -> sandbox execution -> observation -> reward -> memory update.
List Available Presets¶
Available built-in presets:
| Preset | Cases | Description |
|---|---|---|
dspy_quick | 3 | Fast DSPy coding loop smoke test |
dspy_extended | 5 | Broader DSPy coding loop sweep |
generic_smoke | 2 | Generic environment safety/sanity checks |
pure_rlm_smoke | 3 | Pure RLM paper-compliant mode smoke test |
pure_rlm_context | 4 | Pure RLM context-as-variable paradigm tests |
deep_recursion | 3 | Deep recursion tests (depth > 1) |
paradigm_comparison | 3 | Side-by-side paradigm comparison benchmarks |
oolong_style | 4 | OOLONG-style long context benchmarks |
browsecomp_style | 3 | BrowseComp-Plus style web reasoning benchmarks |
token_efficiency | 3 | Token efficiency comparison benchmarks |
Run a Pure RLM Benchmark¶
The Pure RLM benchmarks exercise the paper's core paradigm:
These tests use context as a REPL variable, llm_query() for recursive LLM calls, FINAL() and FINAL_VAR() for termination, and SHOW_VARS() for state inspection.
Load External Benchmark Packs¶
You can load benchmarks from YAML, JSON, or JSONL files:
Supported formats include explicit preset mappings, Pydantic-style dataset cases, Google ADK eval sets, and generic record datasets.
Safety and Budget Guardrails (Recommended)¶
Before larger runs, keep strict limits:
stepscaps planner iterationstimeoutcaps per-action execution timebudgetcaps total run time
If a run is going out of control, cancel it:
Or cancel one run by id:
Step 5: Inspect Benchmark Results¶
After running benchmarks, inspect the Research tab and (after at least two runs) use compare/report commands:
/rlm bench compare candidate=latest baseline=previous
/rlm bench report candidate=latest baseline=previous format=markdown
The Research -> Benchmarks panel updates automatically after /rlm bench.
Step 6: Compare Paradigms¶
Run the same task through multiple paradigms and compare:
This runs document summarization, information extraction, and multi-hop reasoning tasks through Pure RLM, CodeAct, and Traditional paradigms side by side.
Use the comparison command for direct A/B analysis:
Step 7: Session Replay¶
Every RLM run generates a trajectory that can be replayed step by step.
Load a Session for Replay¶
This loads the most recent run and enters replay mode with forward/backward navigation:
- Step forward: View the next action, observation, and reward
- Step backward: Go back to a previous state
- Jump to step: Go directly to any step number
- Find errors: Jump to steps that produced errors
- View summary: See session-level statistics
Use the run_id printed by /rlm status (or from the Research tab).
Step 8: Explore Slash Commands¶
RLM Code has 50+ slash commands. Here are the most useful ones to explore next:
RLM Commands¶
/rlm run "Analyze this code and suggest improvements"
/rlm status
/rlm doctor
/rlm chat "What patterns does this codebase use?"
/rlm observability
Sandbox Commands¶
Harness Commands¶
With ACP:
In Local/BYOK modes, likely coding prompts can auto-route to harness. In ACP mode, auto-routing is intentionally disabled; use /harness run ... directly.
File and Layout Commands¶
/snapshot # Take a project snapshot
/diff # Show file diffs
/view chat # Switch to chat view
/layout multi # Switch to multi-pane layout
/pane files show # Show the files panel
/focus chat # Focus the chat input
For the complete researcher command handbook, see Researcher Onboarding.
Shell Access¶
Shell Shortcut
Prefix any command with ! to run it as a shell command directly:
Get Help¶
Step 9: Explore the Research Tab¶
After running a benchmark, press Ctrl+5 to switch to the Research tab:
- Dashboard: See run metrics, reward sparkline, and summary
- Trajectory: Step-by-step breakdown of agent actions and rewards
- Benchmarks: Leaderboard table from all your runs
- Replay: Step-through controls for time-travel debugging
- Events: Live event stream from the RLM event bus
Research Tab
The Research tab updates automatically when you run /rlm bench or /rlm run commands. No manual refresh needed!
Full Workflow Example¶
Here is a complete workflow from start to finish:
# Create a project directory
mkdir -p ~/projects/rlm-eval && cd ~/projects/rlm-eval
# Launch RLM Code
rlm-code
# Initialize the project
/init
# Connect to Claude Opus 4.6
/connect anthropic claude-opus-4-6
# Check everything is working
/status
/sandbox doctor
# Run the Pure RLM smoke test
/rlm bench preset=pure_rlm_smoke
# Compare the latest run with previous baseline
/rlm bench compare candidate=latest baseline=previous
# Export a benchmark report
/rlm bench report candidate=latest baseline=previous format=markdown
# Run a more comprehensive benchmark
/rlm bench preset=dspy_extended
# Compare paradigms
/rlm bench preset=paradigm_comparison
# Replay the last session
/rlm status
/rlm replay <run_id>
# Check observability sinks
/rlm observability
# Run an ad-hoc task
/rlm run "Write a Python function that finds the longest common subsequence"
# Export results
/export results.json
# Exit
/exit
What's Next?¶
- CLI Reference: Complete documentation for all commands and flags
- Configuration: Customize every aspect of RLM Code via
rlm_config.yaml - Core Engine: RLM Runner, Environments, and Event System
- Research Tab: Deep dive into the experiment tracking interface
- Observability: MLflow, OpenTelemetry, LangSmith, LangFuse, Logfire
- Sandbox Runtimes: Superbox runtime selection, Docker/Monty/cloud guidance