Rethinking LLMs as Verifiers: When Verification is Harder Than Solving
Abstract
Large Language Models (LLMs) are increasingly used as evaluators of model outputs, a paradigm commonly referred to as LLM-as-a-judge. A natural assumption underlying this paradigm is that verification should be easier than solving: given a candidate solution, a model should reliably determine its correctness. In this work, we empirically test this assumption. Across multiple benchmarks and model families, we find that LLMs are often less accurate at verification than at solving the same tasks. This gap persists across domains, including multiple-choice reasoning, program synthesis, and multi-step problem solving. To understand this failure, we study verification along three axes. First, we identify \emph{epistemic bias}, where models are more reliable at accepting correct solutions than rejecting incorrect ones. Second, we show \emph{perturbation insensitivity}, where models fail to detect localized errors in near-correct solutions. Third, we demonstrate that verification accuracy improves with \emph{rubric conditioning}, highlighting the role of explicit evaluation criteria. Our results show that LLM-based evaluation is not a straightforward proxy for correctness. Instead, it exhibits systematic failure modes that must be accounted for when using LLMs as evaluators in post-training and benchmarking pipelines.