Invited Talk 4 | Devina Jain | Securing Agents in the Wild: Lessons from Large-Scale Attacker-Defender Competitions
Abstract
How robust are today's LLM-based agents when facing motivated adversaries? Current approaches to agent security rely heavily on hand-crafted exploits—jailbreak prompts that are labor-intensive to develop and brittle across models. We propose a shift from static prompt discovery to dynamic, adversarial testing. We present findings from a recent large-scale red-team/blue-team competition (hosted in collaboration with UC Berkeley and Lambda). By analyzing submissions across threat domains including prompt injection, data exfiltration, and policy violations, we highlight the most common vulnerabilities attackers exploit and the most effective mitigation strategies defenders deploy. We discuss the infrastructure required to host live leaderboards and the novel failure modes of agents operating under active adversarial threat. A key focus is the challenge of objective evaluation: how simplistic heuristics (like keyword matching or basic LLM-as-a-judge) frequently mistake safe refusals for successful attacks, creating misleading false positives. We propose directions for rigorous automated metrics and discuss how community-driven red-teaming can augment static benchmarks for verifiable agent security.