Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Secure and Trustworthy Large Language Models

Fight Back Against Jailbreaking via Prompt Adversarial Tuning

Yichuan Mo · Yuji Wang · Zeming Wei · Yisen Wang


Abstract:

Although Large Language Models (LLMs) have achieved tremendous success in various applications, they are also susceptible to jailbreak attacks. To protect LLMs from producing harmful information, various defense strategies are proposed, with most focusing on content filtering or adversarial training of models. In this paper, we propose an approach named Prompt Adversarial Tuning (PAT) to train a defense control mechanism, which is then embedded as a prefix to user prompts to implement our defense strategy. We design a training process similar to adversarial training to achieve our optimized goal, alternating between updating attack and defense controls. To our knowledge, we are the first to implement defense from the perspective of prompt tuning. Once employed, our method will hardly impact the operational efficiency of LLMs. Experiments show that our method is effective in both black-box and white-box settings, reducing the success rate of advanced attacks to nearly 0 while maintaining the benign answer rate of 80% to simple benign questions. Our work might potentially chart a new perspective for future explorations in LLM security. Our code is available at https://github.com/rain152/PAT.

Chat is not available.