Poster
in
Workshop: Workshop on Logical Reasoning of Large Language Models

CFLBENCH: BENCHMARKING NOVEL CONTROL FLOW LANGUAGE LEARNING

Aaroosh Rustagi ⋅ Jounghyuck Sohn ⋅ Thomas Peng ⋅ Mykaala Firdaus ⋅ Huanzhi Mao

Project Page [ OpenReview]

Abstract

CFLBench evaluates whether tool-augmented language models can acquire the execution semantics of \emph{previously unseen programming languages} from a handful of examples and black-box execution feedback. We construct 19 small deterministic languages that share a compact, assembly-like surface vocabulary but vary the underlying control-flow mechanism. Each language provides four tasks: Tasks 1--2 emphasize direct generalization from examples and are defined as \textbf{passive}, while Tasks 3--4 are designed to benefit from active experimentation via probe programs and iterative hypothesis refinement and are defined as \textbf{active}. Across a range of state-of-the-art models, we observe a consistent gap between passive and active tasks. The strongest model we evaluate, GPT 5.2, achieves 94.7\% on passive tasks but drops to 60.5\% on active tasks. Comparatively, other models such as Claude Opus 4.5 (94.7\%$\to$42.1\%), Gemini-3-Pro (84.2\%$\to$31.6\%) degrade more sharply. This pattern suggests that learning new execution semantics through interaction is a distinct bottleneck, beyond simply learning semantic patterns from examples. We release code: \url{https://anonymous.4open.science/r/CFLBench-C01E/}. We release the tasks and interpreters to support fine-grained behavioral analysis of inductive strategies and failure modes.

Chat is not available.