Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs
Abstract
Multimodal Large Reasoning Models (MLRMs) have exhibited remarkable capabilities in complex multimodal tasks. However, our findings reveal a critical trade-off: reasoning-based models are more prone to generating harmful content, leading to degradation in safety performance. This paper presents a large-scale analysis of this safety–reasoning trade-off, identifying two main mechanisms of safety degradation: (i) visual attention drift, which reduces the model’s reliance on visual grounding and thereby exacerbates overlooked risks in cross-modal interactions; (ii) unsafe reasoning patterns, including flawed reasoning initiation and chain-of-thought safety attenuation, which compromise the model’s safety awareness. To mitigate these issues, we propose Policy-guided Safety Tuning (PST), a two-stage alignment framework. It first employs Policy-Guided Supervised Fine-Tuning to integrate explicit safety policies into the reasoning process, establishing a structured and interpretable foundation for safe decision-making. Then, PST applies Safety Reasoning Preference Optimization to encourage the model to generate safe, helpful, and informative responses while reducing oversensitive and homogeneous characteristics. Extensive experiments demonstrate that PST significantly reduces harmful outputs across multiple multimodal safety benchmarks, while maintaining competitive performance on general tasks.