Poster
in
Workshop: Secure and Trustworthy Large Language Models
Preventing Memorized Completions through White-Box Filtering
Oam Patel · Rowan Wang
Large Language Models (LLM) generate text they've memorized during training, which can raise privacy and copyright concerns. For example, in a recent lawsuit from the New York Times against OpenAI, it was argued that GPT-4's verbatim memorization of NYT articles violated copyright laws \citet{nytlawsuit2023}. Current production systems moderate content through a combination of small text classifiers or string processing algorithms, which can have generalization failures. In this work, we show that the internal computations of a model provide an effective signal for memorization. Probes trained to detect LLM regurgitation of memorized training data are more sample-efficient, parameter-efficient, and generalize better than text classifiers. We package this into a rejection-sampling based filtering mechanism that can effectively mitigate memorized completions.