Rethinking LLM Judges: Chain-of-Thought and Multi-Step Pipelines for Math Grading
Abstract
LLM judges are promising for evaluating reasoned mathematical solutions, yet their scores can be prompt-sensitive, unstable, and opaque. Two assumptions are currently prevalent: that chain-of-thought (CoT) reasoning provides little or no improvement in agreement with human grades for LLM judges, and that multi-step pipelines---e.g., debate or ensembles---outperform single-step evaluation. We challenge both. For CoT, on Putnam-AXIOM-Grading and IMO-GradingBench, two human-graded competition-mathematics benchmarks, we find that unconstrained CoT often reduces evaluation consistency and doesn't significantly affect performance. In contrast, deliberately structured CoT recovers and often improves agreement with human grades relative to single-pass CoT-absent scoring. This pattern is strongest for reasoning-optimized models such as DeepSeek-R1. For multi-step pipelines, popular methods---G-Eval, Chain-of-Verification, and Debate---consistently underperform simpler strategies. Our most striking finding is that comparative prompting---a single-pass strategy that explicitly compares student solutions to reference answers---is the most consistently high-performing strategy we tested: it ranks in the top three for correlation with human grades on all five models, and achieves the best correlation for three of them. Our findings point to a simple prescription for difficult mathematical grading: compare against a reference, reason within constraints, and avoid unnecessary complexity.