Poster
Breach By A Thousand Leaks: Unsafe Information Leakage in 'Safe' AI Responses
David Glukhov · Ziwen Han · I Shumailov · Vardan Papyan · Nicolas Papernot
Hall 3 + Hall 2B #537
[
Abstract
]
Sat 26 Apr midnight PDT
— 2:30 a.m. PDT
Abstract:
Vulnerability of Frontier language models to misuse has prompted the development of safety measures like filters and alignment training seeking to ensure safety through robustness to adversarially crafted prompts. We assert that robustness is fundamentally insufficient for ensuring safety goals due to inferential threats from dual-intent queries, with current defenses and evaluations failing to account for these risks. To quantify these risks, we introduce a new safety evaluation framework based on impermissible information leakageimpermissible information leakage of model outputs and demonstrate how our proposed question-decomposition attack can extract dangerous knowledge from a censored LLM more effectively than traditional jailbreaking. Underlying our proposed evaluation method is a novel information-theoretic threat model of inferential adversariesinferential adversaries, distinguished from security adversariessecurity adversaries, such as jailbreaks, in that success involves inferring impermissible knowledge from victim outputs as opposed to forcing explicitly impermissible victim outputs. Through our information-theoretic framework, we show that ensuring safety against inferential adversaries requires defenses which bound impermissible information leakage, and, such defenses inevitably incur safety-utility trade-offs.
Live content is unavailable. Log in and register to view live content