ICLR Poster MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

Poster

MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

Zhongshen Zeng · Pengguang Chen · Shu Liu · Haiyun Jiang · Jiaya Jia

Hall 3 + Hall 2B #283

[ Abstract ] [ Project Page ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

In this work, we introduce a novel evaluation paradigm for Large Language Models(LLMs) that compels them to transition from a traditional question-answering role,akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," termed meta-reasoning, shifts the emphasisfrom result-oriented assessments, which often neglect the reasoning process, to amore comprehensive evaluation that effectively distinguishes between the cognitivecapabilities of different models. Our meta-reasoning process mirrors "system-2"slow thinking, requiring careful examination of assumptions, conditions, calculations, and logic to identify mistakes. This paradigm enables one to transformexisted saturated, non-differentiating benchmarks that might be leaked in data pretraining stage to evaluation tools that are both challenging and robust against datacontamination. To prove our point, we applied our paradigm to GSM8K dataset anddeveloped the MR-GSM8K benchmark. Our extensive analysis includes severalstate-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies.Specifically, we found the OpenAI o1 models which possess characteristics of"system-2" thinking excel the other SOTA models by more than 20 absolute pointsin our benchmark, supporting our deficiency hypothesis.

Live content is unavailable. Log in and register to view live content