EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
Abstract
We present EsoLang-Bench, a benchmark revealing fundamental limitations in how large language models (LLMs) leverage in-context learning (ICL). Frontier models achieving 85–95% accuracy on standard code benchmarks (HumanEval, MBPP) score only 0–11% on esoteric programming languages with scarce training data. Notably, few-shot prompting yields no statistically significant improvement over zero-shot (p = 0.505), contradicting assumptions about ICL enabling adaptation to novel domains. Our analysis indicates that ICL primarily activates training priors rather than enabling genuine learning. Despite this limitation, self-scaffolding with direct interpreter feedback outperforms multi-agent approaches, and agentic systems achieve 2–3× improvement through interpreter feedback loops with efficient context management. These findings have implications for understanding LLM generalization to out-of-distribution domains.