Poster Thu, Apr 23, 2026 • 11:15 AM – 1:45 PM PDT

Preference-based Policy Optimization from Sparse-reward Offline Dataset

Wenjie Qiu ⋅ Guofeng Cui ⋅ Shicheng Liu ⋅ Yuanlin Duan ⋅ He Zhu

[ Poster] [ OpenReview]

Abstract

Offline reinforcement learning (RL) holds the promise of training effective policies from static datasets without the need for costly online interactions. However, offline RL faces key limitations, most notably the challenge of generalizing to unseen or infrequently encountered state-action pairs. When a value function is learned from limited data in sparse-reward environments, it can become overly optimistic about parts of the space that are poorly represented, leading to unreliable value estimates and degraded policy quality. To address these challenges, we introduce a novel approach based on contrastive preference learning that bypasses direct value function estimation. Our method trains policies by contrasting successful demonstrations with failure behaviors present in the dataset, as well as synthetic behaviors generated outside the support of the dataset distribution. This contrastive formulation mitigates overestimation bias and improves robustness in offline learning. Empirical results on challenging sparse-reward offline RL benchmarks show that our method substantially outperforms existing state-of-the-art baselines in both learning efficiency and final performance.

Video

Chat is not available.