Oral
in
Workshop: Secure and Trustworthy Large Language Models
PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning
Zhaorun Chen · Zhuokai Zhao · Wenjie Qu · zichen wen · Zhiguang Han · Zhihong Zhu · Jiaheng Zhang · Huaxiu Yao
While the breakthrough of large language models (LLMs) has brought significant advancement to the development of natural language processing, it also introduces new vulnerabilities, especially in security and privacy. Jailbreak attacks, a core component of red-teaming LLMs, have been an effective way to better understand and enhance LLMs security, through testing the resilience of existing safety features and simulating real-world attacks. In this paper, we propose PANDORA, a novel approach designed for LLMs jailbreaking through collaborated phishing agents with decomposed reasoning. PANDORA uniquely leverages the multi-step reasoning capabilities of the LLMs, decomposing adversarial attacks into stealthier sub-queries to elicit more informative responses. More specifically, it consists of four collaborated sub-modules, where each is tailored to refine the attack strategy dynamically when producing the adversarial response. In addition, we propose two new metrics, PASS and Adv-NER, to complement the current jailbreaking evaluations with response quality measures that work without ground-truths. Extensive experiments conducted on the AdvBench-subset demonstrate PANDORA's superior performance over existing state-of-the-arts on four major victim models. More notably, even a more efficient, distilled version of the original PANDORA, demonstrates high success rates on LLMs with black-box access such as GPT-4 and GPT-3.5,while requiring much less memory allocation and query iterations than other jailbreak approaches.