Poster
in
Workshop: Secure and Trustworthy Large Language Models
Open Sesame! Universal Black-Box Jailbreaking of Large Language Models
Raz Lapid · Ron Langberg · Moshe Sipper
Abstract:
We introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that—when combined with a user’s query—disrupts the attacked model’s alignment, resulting in unintended and potentially harmful outputs. To our knowledge this is the first automated universal black box jailbreak attack.
Chat is not available.