Poster session A
in
Workshop: ICLR 2025 Workshop on GenAI Watermarking (WMARK)
Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma · Hai Phan · Shubhendu Trivedi
While watermarking’s effects on perplexity are well documented, its interaction with model alignment remains under-explored. This paper presents a systematic analysis showing that watermarking techniques alter key alignment properties—including truthfulness, safety constraints, and response rates—in large language models (LLMs). Through experiments with Gumbel and KGW watermarking across four aligned LMs of varying scales and architectures, we identify two consistent patterns: guard amplification, where overly restrictive safety constraints affect utility, and guard attenuation, where increased helpfulness compromises safety guarantees. To address this, we introduce a rejection sampling algorithm for restoring alignment in watermarked models. We present bounds on reward scores as a function of sample size and demonstrate their tightness to a constant factor empirically, providing a practical approach for selecting the minimum sample size needed for alignment recovery. When combined with our modified Gumbel watermark, this method offers a solution for maintaining both detectability and alignment in watermarked LMs.