FlowRL: Matching Reward Distributions for LLM Reasoning
Xuekai Zhu ⋅ Daixuan Cheng ⋅ Dinghuai Zhang ⋅ Henry Li ⋅ Kaiyan Zhang ⋅ Che Jiang ⋅ Youbang Sun ⋅ Ermo Hua ⋅ Yuxin Zuo ⋅ Xingtai Lv ⋅ Qizheng Zhang ⋅ Lin Chen ⋅ Fanghao Shao ⋅ Bo Xue ⋅ Yunchong Song ⋅ Zhenjie Yang ⋅ Ganqu Cui ⋅ Ning Ding ⋅ Jianfeng Gao ⋅ Xiaodong Liu ⋅ Bowen Zhou ⋅ Hongyuan Mei ⋅ Zhouhan Lin
Abstract
We propose FlowRL: matching the full reward distribution via flow balancing instead of solely maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (e.g., PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on both math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Successful Page Load