Skip to yearly menu bar Skip to main content


Poster

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Yuheng Zhang · Dian Yu · Baolin Peng · Linfeng Song · Ye Tian · Mingyue Huo · Nan Jiang · Haitao Mi · Dong Yu

Hall 3 + Hall 2B #402
[ ]
Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT
 
Oral presentation: Oral Session 4A
Fri 25 Apr 12:30 a.m. PDT — 2 a.m. PDT

Abstract:

Reinforcement Learning with Human Feedback (RLHF) has achieved great successin aligning large language models (LLMs) with human preferences. PrevalentRLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. Inthis paper, we explore RLHF under a general preference framework and approachit from a game-theoretic perspective. Specifically, we formulate the problem asa two-player game and propose a novel online algorithm, iterative Nash policyoptimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods,INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead,we introduce a new loss objective that is directly minimized over a preferencedataset. We provide theoretical analysis for our approach and demonstrate itseffectiveness through experiments on various representative benchmarks. With anLLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled winrate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantialimprovement over the state-of-the-art online RLHF algorithms.

Live content is unavailable. Log in and register to view live content