CFLBENCH: BENCHMARKING NOVEL CONTROL FLOW LANGUAGE LEARNING
Aaroosh Rustagi ⋅ Jounghyuck Sohn ⋅ Thomas Peng ⋅ Mykaala Firdaus ⋅ Huanzhi Mao
Abstract
CFLBench evaluates whether tool-augmented language models can acquire the execution semantics of \emph{previously unseen programming languages} from a handful of examples and black-box execution feedback. We construct 19 small deterministic languages that share a compact, assembly-like surface vocabulary but vary the underlying control-flow mechanism. Each language provides four tasks: Tasks 1--2 emphasize direct generalization from examples and are defined as \textbf{passive}, while Tasks 3--4 are designed to benefit from active experimentation via probe programs and iterative hypothesis refinement and are defined as \textbf{active}. Across a range of state-of-the-art models, we observe a consistent gap between passive and active tasks. The strongest model we evaluate, GPT 5.2, achieves 94.7\% on passive tasks but drops to 60.5\% on active tasks. Comparatively, other models such as Claude Opus 4.5 (94.7\%$\to$42.1\%), Gemini-3-Pro (84.2\%$\to$31.6\%) degrade more sharply. This pattern suggests that learning new execution semantics through interaction is a distinct bottleneck, beyond simply learning semantic patterns from examples. We release code: \url{https://anonymous.4open.science/r/CFLBench-C01E/}. We release the tasks and interpreters to support fine-grained behavioral analysis of inductive strategies and failure modes.
Chat is not available.
Successful Page Load