Beyond Scalar Critics: A Distributional Perspective on Reinforcement Learning with Verifiable Rewards for LLMs
Jinyi Liu ⋅ Yiboyun Chen ⋅ Hongyao Tang ⋅ Yi Ma ⋅ Shuyue Hu ⋅ Yang Chen ⋅ Fei Ni ⋅ Qiaosheng Zhang ⋅ LEI BAI ⋅ YAN ZHENG ⋅ Jianye Hao
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become prevalent for LLM post-training, yet its reward signal is often terminal and near-binary, yielding prompt-conditional return distributions that are frequently long-tailed. Standard scalar critics usually adopted in RLVR obscure the distributional return structures and attenuate tail information, leading to less informative advantages and reduced optimization stability. To this end, we focus on modeling the return distribution for LLM RL fine-tuning and propose \textsc{DistRLVR}, a unified distributional RLVR framework that learns a critic with both categorical and quantile distributions. To stabilize distribution learning under long-horizon and terminal-sparse rewards, we introduce dual Sample-Replacement targets to diversify supervision. Building on the learned return distributions, we develop tail-aware advantage shaping that selectively amplifies informative tails. Across a range of mathematical reasoning benchmarks, \textsc{DistRLVR} delivers consistent gains in sample efficiency, Pass@$k$ and average performance, achieving a 24.1\% overall improvement over PPO. These results suggest that exploiting distributional structure is a practical and promising direction for more reliable RLVR post-training.
Chat is not available.
Successful Page Load