Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning
Yunbei Zhang ⋅ Yingqiang Ge ⋅ Weijie Xu ⋅ Yuhui Xu ⋅ Jihun Hamm ⋅ Chandan Reddy
Abstract
Current multimodal red teaming treats images as wrappers for malicious payloads via typography or adversarial noise. These attacks are structurally brittle, as standard defenses neutralize them once the payload is exposed. We introduce Visual Exclusivity (VE), a more resilient Image-as-Basis threat where harm emerges only through reasoning over visual content such as technical schematics. To systematically exploit VE, we propose Multimodal Multi-turn Agentic Planning (MM-Plan), which reframes jailbreaking from turn-by-turn reaction to global plan synthesis. MM-Plan trains an attacker planner optimized via Group Relative Policy Optimization (GRPO), enabling self-discovery of effective strategies without human supervision. We introduce VE-Safety, a human-curated dataset of 440 instances spanning 15 safety categories. MM-Plan achieves 46.3% attack success rate against Claude 4.5 Sonnet and 13.8% against GPT-5, outperforming baselines by 2--5$\times$. Warning: This paper contains potentially harmful content.
Chat is not available.
Successful Page Load