poster
in
Affinity Workshop: Tiny Papers Poster Session 6
LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
Mansi Phute
#294
Abstract:
Large language models (LLMs) are popular for high-quality text generation but can also produce harmful responses as adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses, thus not requiring any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 against various types of attacks on GPT 3.5 and Llama 2.
Chat is not available.