Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
Ziyan Wang · Zheng Wang · Jie Fu · Xingwei Qu · Qi Cheng · Shengpu Tang · Minjia Zhang · Xiaoming Huo
Abstract
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient mechanism to address the above limitations via decomposing each iteration into three stages: a short fast trajectory of inner steps on the same batch, a reposition step to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces number of rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points on math reasoning benchmarks. It also achieves up to 4.93$\times$ fewer rollouts and a 4.19$\times$ reduction in wall-clock time to match GRPO’s best accuracy.
Successful Page Load