FlowRL: Matching Reward Distributions for LLM Reasoning
Xuekai Zhu · Daixuan Cheng · Dinghuai Zhang · Henry Li · Kaiyan Zhang · Che Jiang · Youbang Sun · Ermo Hua · Yuxin Zuo · Xingtai Lv · Qizheng Zhang · Lin Chen · Fanghao Shao · Bo Xue · Yunchong Song · Zhenjie Yang · Ganqu Cui · Ning Ding · Jianfeng Gao · Xiaodong Liu · Bowen Zhou · Hongyuan Mei · Zhouhan Lin
Abstract
We propose FlowRL: matching the full reward distribution via flow balancing instead of solely maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (e.g., PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on both math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Successful Page Load