ICLR Poster Scalable Extraction of Training Data from Aligned, Production Language Models

Poster

Scalable Extraction of Training Data from Aligned, Production Language Models

Milad Nasr · Javier Rando · Nicholas Carlini · Jonathan Hayase · Matthew Jagielski · A. Feder Cooper · Daphne Ippolito · Christopher Choquette-Choo · Florian Tramer · Katherine Lee

Hall 3 + Hall 2B #502

[ Abstract ]

Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Large language models are prone to memorizing some of their training data. Memorized (and possibly sensitive) samples can then be extracted at generation time by adversarial or benign users. There is hope that model alignment---a standard training process that tunes a model to harmlessly follow user instructions---would mitigate the risk of extraction. However, we develop two novel attacks that undo a language model's alignment and recover thousands of training examples from popular proprietary aligned models such as OpenAI's ChatGPT. Our work highlights the limitations of existing safeguards to prevent training data leakage in production language models.

Live content is unavailable. Log in and register to view live content