Poster Session #1
in
Workshop: Workshop on Scaling Post-training for LLMs (SPOT) Mon, Apr 27, 2026 • 7:35 AM – 8:20 AM PDT

Beyond Scalar Critics: A Distributional Perspective on Reinforcement Learning with Verifiable Rewards for LLMs

Jinyi Liu ⋅ Yiboyun Chen ⋅ Hongyao Tang ⋅ Yi Ma ⋅ Shuyue Hu ⋅ Yang Chen ⋅ Fei Ni ⋅ Qiaosheng Zhang ⋅ LEI BAI ⋅ YAN ZHENG ⋅ Jianye Hao

Project Page [ OpenReview]

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become prevalent for LLM post-training, yet its reward signal is often terminal and near-binary, yielding prompt-conditional return distributions that are frequently long-tailed. Standard scalar critics usually adopted in RLVR obscure the distributional return structures and attenuate tail information, leading to less informative advantages and reduced optimization stability. To this end, we focus on modeling the return distribution for LLM RL fine-tuning and propose \textsc{DistRLVR}, a unified distributional RLVR framework that learns a critic with both categorical and quantile distributions. To stabilize distribution learning under long-horizon and terminal-sparse rewards, we introduce dual Sample-Replacement targets to diversify supervision. Building on the learned return distributions, we develop tail-aware advantage shaping that selectively amplifies informative tails. Across a range of mathematical reasoning benchmarks, \textsc{DistRLVR} delivers consistent gains in sample efficiency, Pass@$k$ and average performance, achieving a 24.1\% overall improvement over PPO. These results suggest that exploiting distributional structure is a practical and promising direction for more reliable RLVR post-training.

Chat is not available.