Poster
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment
SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment
Taneesh Gupta · Rahul Madhavan · Xuchao Zhang · Chetan Bansal · Saravanakumar Rajmohan
Direct Preference Optimization (DPO) has proven effective in aligning large language models with human preferences but is often constrained to pairwise comparisons - overlooking additional positive and negative responses that are commonly available in real-world settings. We propose Simultaneous Weighted Preference Optimization (SWEPO), which incorporates multiple responses per query and prioritizes those that deviate most from the average reward. This deviation-based weighting focuses training on the most informative outliers, akin to a built-in curriculum. Theoretically, we prove that such multi-preference sampling lowers alignment bias, bounding the expected deviation from the true acceptable-response distribution at a rate of O(1/sqrt(k)). Empirically, SWEPO outperforms state-of-the-art baselines on the Ultra-Feedback dataset and demonstrates substantial improvements over DPO and InfoNCA, yielding boosts of up to ~4% on length-controlled win-rate on AlpacaEval.