Better Attacks for Better Monitors: Semi-Automated Red-Teaming for Agent Monitoring
Abstract
AI control evaluations require strong red-teaming to upper-bound deployment risk, but using language models as red-teamers faces three challenges: mode collapse in attack generation, a gap between attacks models can conceive versus execute, and brittle elicitation. We propose a semi-automated red-teaming method with three components: an attack taxonomy that maps the space of possible strategies, human-seeded attack plans that ensure broad coverage across this taxonomy, and an agentic refinement pipeline that iteratively strengthens attacks post-hoc. Applied to the BashArena firewall side task, this method produces MonitorBench, a benchmark of 429 human-seeded attack trajectories for evaluating AI monitor capabilities and failure modes. We find that human-seeded attacks reduce frontier monitor (Claude Opus 4.5) safety scores by 27 percentage points compared to model-elicited attacks, and that monitoring capability varies across model families and generations: Claude Opus 4.5 catches 71% of attacks on our HARD dataset while GPT-5 catches only 11%. Critically, we find that monitors often detect suspicious actions but fail to calibrate suspiciousness scores appropriately suggesting tractable paths for improvement. The gap between human-seeded and model-elicited attack performance suggests that current red-teaming practices may underestimate attack strength, risking overconfidence in reported safety metrics.