Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
Abstract
We present the Judge Reliability Harness, an open source library for constructing synthetic validation suites that test the reliability of AI judges (also referred to as LLM judges or autograders). As AI-judge-based scoring is widely deployed in AI benchmarks, more tooling may be needed to systematically assess judge behavior under realistic perturbations. Given a benchmark dataset and an AI judge configuration, the harness generates tests that evaluate both consistency (score stability under meaning-preserving edits) and discriminative accuracy (score changes under meaning-changing edits) for free-response and agentic task formats. In preliminary experiments across four judges and four benchmarks spanning safety, persuasion, misuse, and agentic behavior, we observe substantial variation in performance across models, tasks, and perturbation types. We do not observe a judge that is uniformly reliable across all tested settings, and superficial changes such as formatting, paraphrasing, and verbosity can induce failures. Code: https://github.com/RANDCorporation/judge-reliability-harness