Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
Abstract
Agentic language models operate in a distinct safety regime from chat models: they plan, call tools, and execute long-horizon actions where a single error (e.g., file access or credential entry) can cause irreversible harm. Alignment methods optimized for static generation fail in this setting due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC organizes inference as a plan–check–act/refuse loop with explicit safety reasoning and refusal as first-class actions. Training uses preference-based reinforcement learning over pairwise trajectory comparisons, avoiding trajectory-level labels while capturing safety distinctions missed by scalar rewards. MOSAIC reduces harmful behavior by up to 50\%, increases harmful-task refusal by over 20\% under injection, cuts privacy leakage, and preserves or improves benign performance, demonstrating robust generalization across models, domains, and agentic settings.