Poster
in
Workshop: I Can't Believe It's Not Better: Where Large Language Models need to improve

Language-Dependent Miscalibration in Multilingual LLM Evaluators

Ej Zhou ⋅ Lucas Resck ⋅ Zheng Hui ⋅ Anna Korhonen

Project Page [ Poster] [ OpenReview]

Abstract

Prompted LLM-as-a-Judge systems or trained reward models are typically validated using pairwise accuracy, under the assumption that high accuracy implies reliable and language-invariant evaluation. We demonstrate that multilingual LLM evaluators exhibit large, systematic, and statistically significant language-dependent bias in pointwise scoring. We show that this mismatch has concrete downstream consequences: threshold filtering can result in huge differences in acceptance rates.

Chat is not available.