Oral
in
Workshop: Secure and Trustworthy Large Language Models
Is Your Jailbreaking Prompt Truly Effective for Large Language Models?
Bochuan Cao · Tianrong Zhang · Yuanpu Cao · Jinyuan Jia · Lu Lin · Jinghui Chen
Despite the widespread use of large language models (LLMs), there is a growing concern of their disregaring human ethics and generating harmful content. While a series of studies are dedicated to aligning LLMs with human values, jailbreaking attacks are also designed to bypass the alignment and solicit malicious outputs from LLMs through manually-/auto-generated prompts. While jailbreaking attacks and defenses claim to either enhance or diminish the success rate of jailbreaks, how the success is being identified is often overlooked. Without proper and acknowledge evaluation method, the research resources devoted can end up in vein, andunfortunately, existing evaluation methods all exhibit flaws of varying degrees. In this paper, we analyzed current evaluation methods for jailbreak, grouped them into 5 categories, identified their shortcomings, and revealed 6 root causes behind them.