Poster Fri, Apr 24, 2026 • 6:30 AM – 9:00 AM PDT

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

Yulei Qin ⋅ Xiaoyu Tan ⋅ Zhengbao He ⋅ Gang Li ⋅ Haojia Lin ⋅ Zongyi Li ⋅ Zihan Xu ⋅ Yuchen Shi ⋅ Siqi Cai ⋅ Renting Rui ⋅ Shaofei Cai ⋅ Yuzheng Cai ⋅ Xuan Zhang ⋅ Sheng Ye ⋅ Ke Li ⋅ Xing Sun

Project Page [ Poster] [ OpenReview]

Abstract

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1\%/5.1\%/8.6\% and 20.7\%/11.8\%/13.9\%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8\% and 6.1\%, respectively. Such gains incur only 10\%–25\% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.

Video

Chat is not available.