Skip to yearly menu bar Skip to main content


poster
in
Affinity Workshop: Tiny Papers Poster Session 6

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Mansi Phute

#294

Abstract:

Large language models (LLMs) are popular for high-quality text generation but can also produce harmful responses as adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses, thus not requiring any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 against various types of attacks on GPT 3.5 and Llama 2.

Live content is unavailable. Log in and register to view live content