How Far Can Unsupervised RLVR Scale LLM Training?
Yuxin Zuo · Bingxiang He · Zeyuan Liu · Shangziqi Zhao · Zixuan Fu · Junlin Yang · Kaiyan Zhang · Yuchen Fan · Ganqu Cui · Cheng Qian · Xiusi Chen · Youbang Sun · Xingtai Lv · Xuekai Zhu · Li Sheng · Ran Li · Huan-ang Gao · Yuchen Zhang · Lifan Yuan · Zhiyuan Liu · Bowen Zhou · Ning Ding
Abstract
Unsupervised Reinforcement Learning with Verifiable Rewards (URLVR) offers a pathway for Large Language Models (LLMs) to improve without human supervision. Particularly, many works use model intrinsic information as rewards for URLVR, showing promising improvements, yet their potential and limitations remain unclear. In this work, we revisit URLVR through the lens of intrinsic rewards. We present a unified theoretical framework showing that intrinsic reward methods share a core mechanism: they trade uncertainty for performance by leveraging the model’s prior knowledge to sharpen output distributions. Empirical analysis confirms this tradeoff, revealing distinct failure modes and showing that collapse is not inevitable in small, domain-specific regimes such as test-time training. Beyond these findings, early intrinsic reward dynamics also provide a lightweight indicator of model-task priors, complementing $pass@k$ in assessing RL trainability. These insights highlight both the promise and pitfalls of URLVR, motivating future directions such as external rewards and hybrid supervision strategies.
Successful Page Load