EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
Abstract
Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages—Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare—that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000–100,000× less training data than Python. We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85–95% on standard benchmarks score only 0–11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning—acquiring new languages through documentation, interpreter feedback, and iterative experimentation—measuring transferable reasoning skills resistant to data contamination.