The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
Abstract
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. Despite numerous proposed methods, the community's focus on the standard reverse KL-divergence has led to a surprising oversight: the potential of alternative f-divergences as a proactive solution has been largely unexamined. We argue that standard RLVR objectives—both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely—lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a 'rehearsal mechanism'. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Math and SQL generation experiments show that DPH-RL both improves in-domain Pass@1 and Pass@k scores and effectively prevents catastrophic forgetting on out-of-domain tasks. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.