Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization

Audrey Huang · Wenhao Zhan · Tengyang Xie · Jason Lee · Wen Sun · Akshay Krishnamurthy · Dylan Foster

Hall 3 + Hall 2B #601
[ ]
Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract: Language model alignment methods, such as reinforcement learning from human feedback (RLHF), haveled to impressive advances in language model capabilities. However, existing techniques are limited by a widely observed phenomenon known as *overoptimization*, where the quality of the language model degrades over the course of the alignment process. Overoptimization occurs when a language model overfits to inaccuracies in an (either explicit or implicit) offline reward model, and drifts away from preferred responses covered by the data. To discourage such distribution shift, offline alignment methods typically employ KL-regularization, but this, as we show, is too weak to prevent degradation in performance. Then, can we design an efficient algorithm that is provably robust to overoptimization?In this paper, we advance theoretical understanding of sample-efficient offline alignment and introduce a new algorithm called χ2-Preference Optimization (χPO). χPO is a one-line change to Direct Preference Optimization (DPO; Rafailov et al. 2023), that modifies only the logarithmic link function in the DPO objective. Despite this minimal change, χPO implicitly implements the principle of *pessimism in the face of uncertainty* via regularization with the χ2-divergence---which quantifies uncertainty more effectively than KL-regularization---and provably alleviates overoptimization, achieving sample-complexity guarantees based on *single-policy concentrability*---the gold standard in offline reinforcement learning. This guarantee makes χPO the first simple, yet general-purpose offline alignment algorithm that is provably robust to overoptimization.

Live content is unavailable. Log in and register to view live content