ICLR Poster Persistent Pre-training Poisoning of LLMs

Poster

Persistent Pre-training Poisoning of LLMs

Yiming Zhang · Javier Rando · Ivan Evtimov · Jianfeng Chi · Eric Michael Smith · Nicholas Carlini · Florian Tramer · Daphne Ippolito

Hall 3 + Hall 2B #301

[ Abstract ] [ Project Page ]

Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web.Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets.Our work evaluates for the first time whether language models can also be \emph{compromised during pre-training}, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO).We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B).Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.

Live content is unavailable. Log in and register to view live content