Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Self-Improving Foundation Models Without Human Supervision

MPAW: Multi-Preference Alignment through Weak Model Collaboration for Efficient and Flexible LLM Decoding

Nuo Chen · GUOJUN XIONG · Bingsheng He

Keywords: [ Large Language Models ] [ Weak-to-strong alignment ] [ Multi-objective alignment ] [ Decoding-time Optimization ]


Abstract:

Aligning large language models (LLMs) with diverse and competing human preferences remains a critical challenge for safe and effective deployment. While recent work demonstrates that decoding-time alignment via weak preference models achieves strong performance with minimal compute, existing methods optimize for single objectives, severely limiting their adaptability to real-world scenarios requiring multifaceted trade-offs (e.g., safety vs. helpfulness). We propose Multi-Preference Alignment through Weak Model Collaboration (\texttt{MPAW}), a scalable framework that aggregates guidance from heterogeneous weak preference models-smaller LLMs aligned to distinct objectives-into a unified decoding strategy. By dynamically integrating signals from specialized proxies (e.g., safety classifiers, conciseness scorers), \texttt{MPAW} preserves the generalization capabilities of large base models while enabling zero-shot adaptation to arbitrary preference weightings. Empirical results demonstrate reliable alignment quality and nearly matching the performance of computationally expensive multi-objective RLHF fine-tuning. Our findings establish weak model collaboration as a principled pathway for efficient, flexible LLM alignment without retraining.

Chat is not available.