ICLR Poster On the Reliability of Watermarks for Large Language Models

Poster

On the Reliability of Watermarks for Large Language Models

John Kirchenbauer · Jonas Geiping · Yuxin Wen · Manli Shu · Khalid Saifullah · Kezhi Kong · Kasun Fernando · Aniruddha Saha · Micah Goldblum · Tom Goldstein

Halle B #244

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract: As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. _Watermarking_ is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a

$1\mathrm{e}{-5}$ false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.

Chat is not available.