Poster
in
Workshop: I Can't Believe It's Not Better: Where Large Language Models need to improve

When Rubrics Backfire: Systematic Preference Drift in LLM Judges

Ruomeng Ding ⋅ Yifei Pang ⋅ He Sun ⋅ Yizhong Wang ⋅ Steven Wu ⋅ Zhun Deng

Project Page [ Poster] [ OpenReview]

Abstract

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges guided by natural-language rubrics. We identify a failure mode in this workflow, which we term Rubric-Induced Preference Drift (RIPD): rubric edits that pass standard validation can nonetheless induce systematic and directional shifts in a judge’s preferences on target domains. We show that such drift can arise from seemingly natural, criterion-preserving rubric refinements and remain difficult to detect using aggregate evaluation metrics. Across multiple datasets and models, these edits preserve benchmark performance while reducing target-domain accuracy up to 27.9%. When used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies, leading to persistent behavioral drift. Our findings demonstrate that evaluation rubrics function as a sensitive control interface rather than a neutral specification, exposing a structural vulnerability in current LLM evaluation and alignment practices.

Chat is not available.