Poster
in
Workshop: ICLR 2026 Workshop on AI with Recursive Self-Improvement

Unlocking Intrinsic Self-Reflection for LLM Preference Policy Optimization

Yu Li ⋅ Tian Lan ⋅ Zhengling Qi

Project Page [ OpenReview]

Abstract

Direct Preference Optimization (DPO) and its variants have become the standard for aligning Large Language Models (LLMs). However, we identify two fundamental limitations. First, the optimized policy lacks invariance since it varies with modeling choices such as scalarization function or reference policy, whereas an optimal policy should remain invariant. Second, most existing methods yield theoretically suboptimal policies by not fully exploiting the comparative information in pairwise preference data, thus missing an opportunity for self-reflection through comparing and contrasting responses. To address both limitations, we propose Intrinsic Self-reflective Preference Optimization (InSPO), which derives a globally optimal policy conditioned on both context and alternative response, explicitly formalizing self-reflection. We prove this formulation surpasses standard DPO and RLHF targets while guaranteeing invariance. InSPO serves as a plug-and-play enhancement for DPO-family algorithms, decoupling alignment from modeling constraints without architectural changes. Using privileged information learning, InSPO requires no alternative response at inference since the self-reflective mechanism is distilled during training, incurring zero overhead. Experiments show InSPO consistently improves win rates and length-controlled metrics across DPO variants, yielding more robust and human-aligned LLMs.

Chat is not available.