Poster Sat, Apr 25, 2026 • 11:15 AM – 1:45 PM PDT

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

Gaurav Srivastava ⋅ Aafiya Hussain ⋅ Zhenyu Bi ⋅ Swastik Roy ⋅ Priya Pitre ⋅ Meng Lu ⋅ Morteza Ziyadi ⋅ Xuan Wang

Project Page [ Slides] [ Poster] [ OpenReview]

Abstract

Evaluating language models fairly is becoming harder as static benchmarks risk contamination by training data, making it unclear whether models are truly reasoning or just recalling answers. We introduce **BeyondBench**, an evaluation framework that avoids this problem by using **algorithmic problem generation**. Unlike traditional benchmarks that risk contamination from internet-scale training data, BeyondBench creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers **44 algorithmic tasks** with a total of **117 variations**, grouped into three difficulty levels: the *Easy Suite* (29 tasks) for basic arithmetic and statistics, the *Medium Suite* (5 tasks, 49 variations) for sequence patterns and reasoning, and the *Hard Suite* (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than $10^{15}$ unique instances, with solutions verified deterministically by mathematical proofs. We evaluated **101 language models**, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. All evaluations use three-fold evaluation to ensure statistical robustness. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of **56.21%, 27.16%, and 33.37%,** respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a **decline** of **16.81%, 15.86%, and 43.95%** in overall accuracy without tool access. The contamination resistance of BeyondBench rests on three guarantees: (i) the problem space is vastly larger than any static dataset, (ii) every instance has a deterministically verifiable solution (unique or fully enumerated), and (iii) isomorphic transformations generate semantically equivalent but syntactically new problems. BeyondBench redefines reasoning evaluation through genuine algorithmic problem-solving, ensuring fair and meaningful evaluation. Our public leaderboard is available at https://ctrl-gaurav.github.io/BeyondBench/. Our open-source Python package is available at https://pypi.org/project/beyondbench/, and the codebase can be found at https://github.com/ctrl-gaurav/BeyondBench for easy and reproducible evaluation.

Video

Chat is not available.