ICLR Coercing LLMs to do and reveal (almost) anything

Oral
in
Workshop: Secure and Trustworthy Large Language Models

Coercing LLMs to do and reveal (almost) anything

Jonas Geiping · Alex Stein · Manli Shu · Khalid Saifullah · Yuxin Wen · Tom Goldstein

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak'' the model into outputting harmful text. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking and provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We then analyze the mechanism by which these attacks function, highlighting the use of glitch tokens, and the propensity of attacks to control the model by coercing it to simulate code.

Chat is not available.

Oral in Workshop: Secure and Trustworthy Large Language Models

Coercing LLMs to do and reveal (almost) anything

Jonas Geiping · Alex Stein · Manli Shu · Khalid Saifullah · Yuxin Wen · Tom Goldstein

Oral
in
Workshop: Secure and Trustworthy Large Language Models