Language-Dependent Miscalibration in Multilingual LLM Evaluators
Ej Zhou ⋅ Lucas Resck ⋅ Zheng Hui ⋅ Anna Korhonen
Abstract
Prompted LLM-as-a-Judge systems or trained reward models are typically validated using pairwise accuracy, under the assumption that high accuracy implies reliable and language-invariant evaluation. We demonstrate that multilingual LLM evaluators exhibit large, systematic, and statistically significant language-dependent bias in pointwise scoring. We show that this mismatch has concrete downstream consequences: threshold filtering can result in huge differences in acceptance rates.
Chat is not available.
Successful Page Load