Skip to yearly menu bar Skip to main content


Poster

Scaling Laws for Adversarial Attacks on Language Model Activations and Tokens

Stanislav Fort

Hall 3 + Hall 2B #595
[ ]
Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: We explore a class of adversarial attacks targeting the activations of language models to derive upper-bound scaling laws on their attack susceptibility. By manipulating a relatively small subset of model activations, a, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens t. We empirically verify a scaling law where the maximum number of target tokens predicted, tmax, depends linearly on the number of tokens a whose activations the attacker controls as tmax=κa. We find that the number of bits the attacker controls on the input to exert a single bit of control on the output (a property we call \textit{attack resistance χ}) is remarkably stable between 16 and 25 over orders of magnitude of model sizes and between model families. Compared to attacks directly on input tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert a surprisingly similar amount of control over the model predictions. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models. By using language models as a controllable test-bed to study adversarial attacks, we explored input-output dimension regimes that are inaccessible in computer vision and greatly extended the empirical support for the dimensionality theory of adversarial attacks.

Live content is unavailable. Log in and register to view live content