OpenClaw-RL

Empowering OpenClaw with RL — Train a personalized agent simply by talking to it.

Scalable RL in real-world settings — Agentic RL for terminal, GUI, SWE, and tool-call settings.

demo.mp4

📰 News

[2026/3/20] 🔥 You can use your own openclaw now, simply install this extension.
[2026/3/13] 🚀 OpenClaw-RL now supports both local GPU and cloud (Tinker) deployment. Launch with one line of code — Hybrid RL, OPD, and Binary RL all supported!
[2026/3/12] 🔥 We support LoRA training now!
[2026/3/10] 🔥 We have released our Technical Report! 🏆 Ranked #1 on HuggingFace Daily Papers!
[2026/3/10] 🔥 Huge updates today! We released a new combination method, along with an interesting evaluation of these OpenClaw-RL methods. Track 2 is released too, featuring scalable RL implementations for general agent settings across terminal, GUI, SWE, and tool-call scenarios. We only focus on real-world settings!
[2026/3/3] 🙌 Working with the authors of SDFT and SDPO, we have integrated their methods into openclaw-opd. We welcome the integration of novel and effective methods!
[2026/3/3] 📺 Check out these community tutorial videos on OpenClaw-RL: Video 1 | Video 2
[2026/2/26] 🔥 We release OpenClaw-RL v1 — a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback.

💡 TL;DR

OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents, and supports training general agents with large-scale environment parallelization.

Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background — all without interrupting your usage.

Highlights: Fully async 4-component loop · Self-hosted & private · Zero manual labeling · Three learning paradigms (Binary RL / OPD / Combine) · Personal + General agent support

🌈 Features

Fully Asynchronous 4-Component Architecture

OpenClaw-RL decouples agent serving, rollout collection, PRM/judge evaluation, and policy training into independent async loops. None of them block one another: the model continues serving requests while training runs in the background, and judging happens concurrently with new interactions.

Self-Hosted & Private by Design

The entire stack, including the policy model, judge/PRM, and trainer, runs on your own infrastructure. Conversation data stays within your system, and no third-party model API is required.

From Feedback to Gradient — Automatically

You do not need to manually label data. The system automatically:

Organizes multi-turn interactions into session-aware training trajectories
Classifies API messages into main-line (trainable) vs. side (non-trainable) turns
Uses the next user, environment, or tool feedback as a natural "next-state" signal
Runs PRM/judge evaluation asynchronously, with majority voting when needed for more robust scoring
Submits ready samples to the trainer as they become available

Three Optimization Methods in One Framework

Binary RL (GRPO): A Process Reward Model scores each turn based on next-state feedback. The scalar reward is then used with GRPO advantage estimation and a PPO-style clipped surrogate loss.

On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an enhanced teacher, whose token-level log-probability gap with the student becomes a directional advantage signal richer than any scalar reward.

Combination Method: OpenClaw-RL further combines Binary RL and OPD in a unified training recipe, leveraging the dense scalar supervision of Binary RL together with the richer token-level directional signal from OPD. This combination achieves stronger and more robust optimization than either method alone.

From Personal Agents to Real-World Agentic RL

The same framework supports both personalized OpenClaw optimization and scalable RL for terminal, GUI, SWE, and tool-call agents in real-world settings.

🎯 Roadmap

Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

✅ Release Track 1: Fully async OpenClaw-RL framework with Binary RL + OPD
✅ Best recipe discovery via demonstration experiments
✅ Support LoRA Training
✅ Deploy training on Tinker
⬜ Support low-precision training/inference
⬜ Beyond the policy: extend learning to skills and memory

Track 2 — General Agents Optimization (Scalable Infra)

✅ Release Track 2: Scalable agentic RL infra for general agents
⬜ Support more cloud services

🤝 Contributing

We welcome contributions that integrate new learning methods into the OpenClaw-RL framework! The integration of SDFT / SDPO into openclaw-opd, and supporting LoRA are great examples of successful community contributions.

🔧 Personal Agent Optimization Quick Start

1. Deployment Options

Don't have any money?

Hardware: 8× GPUs (default; configurable via NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS)
Software: CUDA 12.9, Python 3.12
Framework: Slime (our base RL framework)

For detailed environment setup, see Slime or ./instructions/README.md.

Don't have a GPU?

Create a Tinker API. That's all you need. But note that Tinker only supports LoRA, which may not be as effective as full fine-tuning. So we are still testing it.

2. Start the RL Server

We provide three methods (RL servers):

Dimension	Binary RL	OPD	Combined
Signal type	Evaluative (good / bad)	Directional	Evaluative + directional
Advantage	Sequence-level scalar	Token-level directional	Mixed sequence and token-level
Density	All scored turns	Hint-accepted turns only	All scored turns
Feedback type	User / environment	Explicit corrections	Both implicit and explicit feedback
Signal richness	1 scalar per sample	1 value per token	1 value per token

Choose your optimization method:

Option A: Combination Method — Recommended !

cd slime
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.sh

This method combines binary RL and OPD to achieve the best optimization.

See ./openclaw-combine/README.md for algorithm details.

With LoRA (parameter-efficient, fewer GPUs):

bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_lora.sh

With Tinker (No GPUs at all)

cd openclaw-tinker
python run.py --method combine --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1 --w-opd 1.0 --w-rl 1.0

see ./openclaw-tinker/README.md for setup details.

Option B: Binary RL — Best for implicit feedback (likes/dislikes, env success/failure)

cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

The PRM will automatically judge response quality from next-state feedback. We recommend providing frequent feedback (e.g., 👍/👎) to help the model optimize effectively.

See ./openclaw-rl/README.md for algorithm details.

With LoRA (parameter-efficient, fewer GPUs):

bash ../openclaw-rl/run_qwen3_4b_openclaw_rl_lora.sh

With Tinker (No GPUs at all)

cd openclaw-tinker
python run.py --method rl --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 3

see ./openclaw-tinker/README.md for setup details.

Option C: On-Policy Distillation (OPD) — Best for rich textual feedback

cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

The system extracts hindsight hints from your feedback and distills them into the policy at the token level. We recommend providing concrete feedback (e.g., "you should have checked the file first" or "don't use that library").

See ./openclaw-opd/README.md for algorithm details.

With LoRA (parameter-efficient, fewer GPUs):

bash ../openclaw-opd/run_qwen3_4b_openclaw_opd_topk_lora.sh

With Tinker (No GPUs at all)

cd openclaw-tinker
python run.py --method opd --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1

see ./openclaw-tinker/README.md for setup details.

Once running, the model is served as an OpenAI-compatible API at:

http://<HOST_IP>:30000/v1

where <HOST_IP> is the IP address of the machine running the RL server (e.g. 115.190.98.251). The port 30000 is the default and can be changed via the PORT environment variable.

Take note of this endpoint — you will need it when configuring OpenClaw in the next step.

We also provide an interesting case for evaluation. A student who uses OpenClaw to do homework, does not want to be found using AI. A teacher who also uses OpenClaw to grade student's homework, wants the comments to be specific and friendly.

Evaluation Setting — Both student and teacher use AI!

We find that, under the combined optimization method, OpenClaw needs only 36 problem-solving interactions in the student setting and 24 grading interactions in the teacher setting to achieve a significant and clearly visible improvement.

See ./openclaw-test/README.md for setup and algorithm details.

3. OpenClaw Setup

You can use your own openclaw, just install this extension.

If you want local file-backed skill authoring in the bundled OpenClaw runtime, see openclaw/extensions/skill-bridge/README.md.

Then configure OpenClaw to route requests to your RL server.

Open your openclaw.json (or the equivalent settings file) and add a provider entry under "models" → "providers":

Example of Slime-based RL server:

{
  "models": {
    "providers": {
      "qwen": {
        "baseUrl": "http://<HOST_IP>:30000/v1",
        "apiKey": "apiKey",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b",
            "name": "Qwen3 4B",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

Replace <HOST_IP> with the IP address of your RL server machine. The apiKey should match the SGLANG_API_KEY you set when starting the server.

Example of Tinker-based RL server:

{
  "models": {
    "providers": {
      "openclaw-rl": {
        "baseUrl": "http://localhost:30000/v1",
        "apiKey": "no-auth-needed",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b-lora",
            "name": "Qwen3 4B (OpenClaw-RL LoRA)",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

That's it — start chatting with your OpenClaw agent. The RL server will automatically collect conversation trajectories, compute rewards, and train the model. Your agent gets better the more you use it.

🔧 Agentic RL in Real-world Settings

The same asynchronous RL backbone that powers our personal-agent setting can also support large-scale optimization for these broader real-world environments.

Setting	Environment	Next-state signal	Horizon
Terminal	Shell execution sandbox	stdout/stderr, exit code	Long
GUI	Screen state + accessibility tree	Visual state diff, task progress	Long
SWE	Code repository + test suite	Test verdicts, diff, lint output	Long
Tool-call	API/function execution	Return values, error traces	Medium

🖥️ Terminal Agent — the most widely used computer-use agent

cd slime
bash ../terminal-rl/terminal_qwen3_8b_rl.sh

See ./terminal-rl/README.md for setup details.

📟 GUI Agent — the most general computer-use agent

cd slime
bash ../gui-rl/gui_qwen3vl_8b_rl.sh

See ./gui-rl/README.md for setup details.

👨‍💻 SWE Agent — software engineering agent

cd slime
bash ../swe-rl/run_swe_rl_32b_remote_8nodes.sh

See ./swe-rl/README.md for setup details.

🛠️ Tool-call Agent — the most practical agent

cd slime
bash ../toolcall-rl/retool_qwen3_4b_rl.sh

See ./toolcall-rl/README.md for setup details.

📖 Citation

@article{wang2026openclawrl,
  title={OpenClaw-RL: Train Any Agent Simply by Talking},
  author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2603.10165},
  year={2026}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

🙏 Acknowledgements

This work aims to explore more effective paradigms for Agentic RL. Our implementation builds upon the excellent codebases of slime, OpenClaw, Tinker and Open-AgentRL.

We also build terminal RL using SETA's dataset and agent framework, GUI RL using OSWorld's evaluation scripts, SWE RL using mini-swe-agent's evaluation scripts, and tool-call RL based on the work of Retool.

We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.

⚠️ Reminder

When using OpenClaw-RL, please do not provide sensitive personal information during conversations with the model. Also, make sure to keep your API keys secure and never expose them in prompts, logs, or shared files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenClaw-RL

📰 News

💡 TL;DR

Fully Asynchronous 4-Component Architecture

Self-Hosted & Private by Design

From Feedback to Gradient — Automatically

Three Optimization Methods in One Framework

From Personal Agents to Real-World Agentic RL

🎯 Roadmap

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

Track 2 — General Agents Optimization (Scalable Infra)

🤝 Contributing

📝 Contents

🔧 Personal Agent Optimization Quick Start

1. Deployment Options

Don't have any money?

Don't have a GPU?

2. Start the RL Server

3. OpenClaw Setup

🔧 Agentic RL in Real-world Settings

🖥️ Terminal Agent — the most widely used computer-use agent

📟 GUI Agent — the most general computer-use agent

👨‍💻 SWE Agent — software engineering agent

🛠️ Tool-call Agent — the most practical agent

📖 Citation

🙏 Acknowledgements

⚠️ Reminder

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
Megatron-LM		Megatron-LM
assets		assets
extensions/rl-training-headers		extensions/rl-training-headers
gui-rl		gui-rl
instructions		instructions
openclaw-combine		openclaw-combine
openclaw-opd		openclaw-opd
openclaw-rl		openclaw-rl
openclaw-test		openclaw-test
openclaw-tinker		openclaw-tinker
slime		slime
swe-rl		swe-rl
terminal-rl		terminal-rl
toolcall-rl		toolcall-rl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OpenClaw-RL

📰 News

💡 TL;DR

Fully Asynchronous 4-Component Architecture

Self-Hosted & Private by Design

From Feedback to Gradient — Automatically

Three Optimization Methods in One Framework

From Personal Agents to Real-World Agentic RL

🎯 Roadmap

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

Track 2 — General Agents Optimization (Scalable Infra)

🤝 Contributing

📝 Contents

🔧 Personal Agent Optimization Quick Start

1. Deployment Options

Don't have any money?

Don't have a GPU?

2. Start the RL Server

3. OpenClaw Setup

🔧 Agentic RL in Real-world Settings

🖥️ Terminal Agent — the most widely used computer-use agent

📟 GUI Agent — the most general computer-use agent

👨‍💻 SWE Agent — software engineering agent

🛠️ Tool-call Agent — the most practical agent

📖 Citation

🙏 Acknowledgements

⚠️ Reminder

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Languages

Packages