Barriers to Pareto Steerability in Preference-Conditioned LLM Alignment
Abstract
Post-training alignment of Large Language Models (LLMs) is inherently a multi-objective problem, yet standard paradigms often collapse conflicting goals into a single "one-size-fits-all" reward scalar. While preference-conditioned alignment aims to give users dynamic control over trade-offs, achieving broad and continuous steerability across the Pareto frontier remains challenging in practice. In this paper, we investigate the limitations of current state-of-the-art methods on the Helpfulness vs. Harmlessness (HH) task and identify two recurring failure modes: an Optimization Gap, where conflicting gradients lead to fragmented behavior across preferences, and a Geometric Gap, where linear scalarization cannot recover non-convex regions of the trade-off space. Through a sequence of systematic experiments, we show how these two gaps manifest in practice and demonstrate that addressing only one of them is insufficient in the setting studied here. Finally, we evaluate a framework that combines meta-learning with geometry-aware scalarization, aiming to address both gaps in a unified way.