ICLR GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Oral
in
Workshop: Secure and Trustworthy Large Language Models

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Haibo Jin · Ruoxi Chen · Andy Zhou · Yang Zhang · Haohan Wang

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Large Language Models (LLMs) face significant challenges with ``jailbreaks" — specially crafted prompts designed to bypass safety filters and induce safety measures. In response, researchers have focused on developing comprehensive testing protocols, to generate a wide array of potential jailbreaks efficiently. In this paper, we propose a role-playing system, namely GUARD (Guideline Upholding through Adaptive Role-play Diagnostics), which can automatically follow the government-issued guidelines to generate jailbreaks to test whether LLMs follow the guidelines accordingly. GUARD works by assigning four different roles to LLMs to collaborate jailbreaks, in the style of the human generation.We have empirically validated the effectiveness of GUARD on three cutting-edge open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the realm of vision-language models (MiniGPT-v2 and Gemini Vision Pro), showcasing GUARD's versatility and contributing valuable insights for the development of safer, more reliable LLM-based applications across diverse modalities.

Chat is not available.

Oral in Workshop: Secure and Trustworthy Large Language Models

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Haibo Jin · Ruoxi Chen · Andy Zhou · Yang Zhang · Haohan Wang

Oral
in
Workshop: Secure and Trustworthy Large Language Models