Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
Abstract
Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, and Gemini 2.0 Flash, we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing (fitness 0.8), Gemini to direct attacks with ROT13 encoding (0.8), while Claude shows robust refusal across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic weaknesses, providing actionable insights for improving LLM safety.