Poster Fri, Apr 24, 2026 • 6:30 AM – 9:00 AM PDT Pavilion 4 P4-#5204

Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

Ruoxi Cheng ⋅ Hao-Xuan Ma ⋅ Weixin Wang ⋅ Ranjie Duan ⋅ Jiexi Liu ⋅ Xiaoshuang Jia ⋅ Simeng Qin ⋅ Xiaochun Cao ⋅ Yang Liu ⋅ Yang Liu

[ Poster] [ OpenReview]

Abstract

Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based--train a reward model on preference pairs and optimize with reinforcement learning (RL)--or reward-free--directly fine-tune on ranked outputs. Recent research show that well-tuned reward-based pipelines remain the most robust, and single-response demonstrations can outperform pairwise preference data. However, there still exist two key challenges: (1) imbalanced safety dataset that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. To address these limitations, we propose DR-IRL, which Dynamically adjusts Rewards through Inverse Reinforcement Learning. We first train category‑specific reward models using a balanced safety dataset of seven harmful categories as demonstration via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling--adjusting rewards by task difficulty--data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.

Video

Chat is not available.