Choices Speak Louder than Questions
Abstract
Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of \textit{choice sensitivity}, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called Normalized Probability Shift by the Question (NPSQ), designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods — such as those based on log-likelihood or its length-normalized variant — are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.