Poster
Scalable Extraction of Training Data from Aligned, Production Language Models
Milad Nasr · Javier Rando · Nicholas Carlini · Jonathan Hayase · Matthew Jagielski · A. Feder Cooper · Daphne Ippolito · Christopher Choquette-Choo · Florian Tramer · Katherine Lee
Hall 3 + Hall 2B #502
Large language models are prone to memorizing some of their training data. Memorized (and possibly sensitive) samples can then be extracted at generation time by adversarial or benign users. There is hope that model alignment---a standard training process that tunes a model to harmlessly follow user instructions---would mitigate the risk of extraction. However, we develop two novel attacks that undo a language model's alignment and recover thousands of training examples from popular proprietary aligned models such as OpenAI's ChatGPT. Our work highlights the limitations of existing safeguards to prevent training data leakage in production language models.
Live content is unavailable. Log in and register to view live content