Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs
Abstract
Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning difficulty, leading to suboptimal data utilization and final performance. To address this challenge, we propose Uni-DPO, a unified dynamic preference optimization paradigm that jointly accounts for (1) the inherent quality of each preference pair and (2) the model's evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model's learning dynamics during training, Uni-DPO enables more effective utilization of training data and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Uni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Uni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on Arena-Hard. On mathematical reasoning and multimodal tasks, Uni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be made publicly available.