Compute-Efficient GRPO Training
Abstract
Reinforcement fine-tuning (RFT) methods such as Group Relative Policy Optimization (GRPO) and PPO are significantly more expensive than supervised fine-tuning due to on-policy sampling, repeated rollouts, multiple forward passes, and backpropagation through long sequences across multiple optimization epochs. These costs make post-training with reinforcement learning a major bottleneck for practitioners seeking to train or adapt large language models under limited computational budgets. In this work, we present an empirical study of GRPO post-training dynamics and identify a consistent early plateau in reward trajectories. Across four open-source models—Llama 3B/8B and Qwen 3B/7B—we observe that GRPO reward curves follow a highly regular, sigmoid-shaped pattern with three phases: slow initial progress, rapid improvement, and early saturation. We show that these dynamics are well captured by a simple parametric model conditioned on model size, initial reward, and normalized training progress, enabling reliable prediction of when marginal reward gains diminish. A key practical finding is that, across all tested models, the majority of reward improvement occurs early in training: continuing GRPO beyond roughly 70–80\% of a single epoch yields negligible gains while consuming a substantial fraction of total compute. Leveraging the proposed predictive model, practitioners can forecast saturation points early in training and select data-driven stopping criteria, substantially reducing GRPO compute without sacrificing final reward. Our results highlight predictable structure in GRPO training dynamics and suggest that lightweight, empirically grounded early-stopping strategies are an effective tool for managing post-training costs in reinforcement-based LLM fine-tuning.