Poster
in
Workshop: Agents in the Wild: Safety, Security, and Beyond

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Sunishchal Dev ⋅ Andrew Sloan ⋅ Joshua Kavner ⋅ Nicholas Kong ⋅ Morgan Sandler

Project Page [ OpenReview]

Abstract

We present the Judge Reliability Harness, an open source library for constructing synthetic validation suites that test the reliability of AI judges (also referred to as LLM judges or autograders). As AI-judge-based scoring is widely deployed in AI benchmarks, more tooling may be needed to systematically assess judge behavior under realistic perturbations. Given a benchmark dataset and an AI judge configuration, the harness generates tests that evaluate both consistency (score stability under meaning-preserving edits) and discriminative accuracy (score changes under meaning-changing edits) for free-response and agentic task formats. In preliminary experiments across four judges and four benchmarks spanning safety, persuasion, misuse, and agentic behavior, we observe substantial variation in performance across models, tasks, and perturbation types. We do not observe a judge that is uniformly reliable across all tested settings, and superficial changes such as formatting, paraphrasing, and verbosity can induce failures. Code: https://github.com/RANDCorporation/judge-reliability-harness

Chat is not available.