When Rubrics Backfire: Systematic Preference Drift in LLM Judges
Abstract
Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges guided by natural-language rubrics. We identify a failure mode in this workflow, which we term Rubric-Induced Preference Drift (RIPD): rubric edits that pass standard validation can nonetheless induce systematic and directional shifts in a judge’s preferences on target domains. We show that such drift can arise from seemingly natural, criterion-preserving rubric refinements and remain difficult to detect using aggregate evaluation metrics. Across multiple datasets and models, these edits preserve benchmark performance while reducing target-domain accuracy up to 27.9%. When used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies, leading to persistent behavioral drift. Our findings demonstrate that evaluation rubrics function as a sensitive control interface rather than a neutral specification, exposing a structural vulnerability in current LLM evaluation and alignment practices.