Poster session B
in
Workshop: ICLR 2025 Workshop on GenAI Watermarking (WMARK)

Radioactive Secrets: Detecting Pretraining Data in Language Models via Data Poisoning

Wassim Bouaziz · Mathurin VIDEAU · Nicolas Usunier · El-Mahdi El-Mhamdi

Project Page [ OpenReview]

Abstract

The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. While backdoor attacks on LLM training have been explored, such methods explicitly embed the targeted behavior in the training data. Achieving indirect data poisoning, where the targeted behavior is hidden, has been understudied due to the intractability of optimizing discrete textual data. In this work, we demonstrate a data poisoning approach that can make a model learn arbitrary secret targets: secret responses to secret prompts that are **absent from the training corpus**. Using gradient-based optimization prompt-tuning, we craft poisoned samples such that the model's exposure to these sequences during pre-training is equivalent to learning the secrets. We demonstrate our approach on language models pre-trained from scratch and show that less than $0.005\\%$ of poisoned data is sufficient to covertly make a LM learn a secret. All without performance degradation (as measured on common LM benchmarks) and despite the secrets **never appearing in the training set**.

Chat is not available.