One Model, Many Goals: Meta-Learning Preference-Conditioned Alignment for Lifelong LLM Agents
Abstract
Deployed AI agents increasingly face evolving preference goals: user intent shifts, contexts change acceptable risk, and constraints update over time, so a single deployed LLM policy must re-target behavior on the fly without updating its weights at deployment time. Standard Reinforcement Learning from Human Feedback (RLHF) collapses multiple objectives into one fixed scalar reward, yielding brittle trade-offs, while existing preference-conditioned methods that sample one preference per update and use linear scalarization often (i) lose sensitivity to the preference signal due to gradient interference and (ii) miss Pareto-optimal solutions in non-convex trade-off regions. We propose MERIDIAN (Meta-Learning for Preference-Conditioned Alignment), a bi-level framework that treats each preference as an alignment task: an inner loop optimizes preference-specific objectives in isolation and a Reptile-style meta-update aggregates adapted parameters to preserve steerability across the simplex, paired with a smoothed Tchebycheff scalarization to recover all Pareto regions. Empirically, MERIDIAN achieves denser Pareto coverage, better access to extreme goal modes, and higher performance on unseen preferences, supporting robust inference-time goal re-targeting. We also provide a generalization result showing that optimizing an empirical objective over sampled preferences extends to all preferences.