OR-LLM-Bench: A Pipeline for Scalable and Verifiable Text-to-Optimization Synthesis
Zhiqi Gao ⋅ Albert Ge ⋅ Alexander Berenbeim ⋅ Nathaniel Bastian ⋅ Frederic Sala
Abstract
Operations research (OR)-style modeling poses challenges for large language models (LLMs). It requires long-context consistency, producing precise mathematical formulations, and the ability to infer implicit constraints. To study these challenges under controlled conditions, we build a verifiable synthetic pipeline that generates large-scale certified optimization problem instances. Using this pipeline, we obtain several insights: first, direct natural language translation of optimization problems runs into an \emph{effective context limit}, beyond which frontier models abruptly fail to maintain global variable–constraint consistency---despite remaining within nominal context window length. Second, naive divide-and-conquer scaling strategies struggle due to context explosion and semantic fragmentation. Third, while frontier models can reliably infer high-level optimization structure they struggle to correctly bind large, dense numerical data to variables at scale. Taken together, these findings identify important limitations for current LLM-based optimization approaches. For example, we synthesize an OR task where GPT-5 nano has an effective reasoning context limit of only $\sim$2,000 tokens and suffers a more than 50\% performance drop.
Chat is not available.
Successful Page Load