Oral
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment

A Roadmap for Human-Agent Moral Alignment: Integrating Pre-defined Intrinsic Rewards and Learned Reward Models

Project Page [ OpenReview]

Abstract

The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. This approach suffers from low transparency, low controllability and high cost. More recently, researchers have introduced the design of intrinsic reward functions that explicitly encode core human moral values for Reinforcement Learning-based fine-tuning of foundation agent models. The strengths of this approach are transparency and cost-effectiveness, but its weaknesses include simplicity, lack of flexibility and the inability to dynamically adapt to the needs or preferences of (potentially diverse) users. In this position paper, we argue that a combination of intrinsic rewards and learned reward models may provide an effective way forward for alignment research that enables human agency and control. Integrating intrinsic rewards and learned reward models in post-training can allow models to act in a way that is respectful of the users' moral preferences while also relying on a transparent foundation of pre-defined values.

Chat is not available.