Oral
in
Workshop: Secure and Trustworthy Large Language Models
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping · Alex Stein · Manli Shu · Khalid Saifullah · Yuxin Wen · Tom Goldstein
Abstract:
It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak'' the model into outputting harmful text. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking and provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We then analyze the mechanism by which these attacks function, highlighting the use of glitch tokens, and the propensity of attacks to control the model by coercing it to simulate code.
Chat is not available.