World Models as Execution Simulators for Automated Program Repair
Abstract
Automated program repair faces a practical bottleneck: validating candidate patches requires expensive test execution, yet many plausible patches prove semantically incorrect. We investigate whether world-model-trained code language models can simulate patch application and test outcomes without runtime execution, enabling efficient patch ranking. Using Meta’s Code World Model (CWM) 32-billion-parameter LLM mid-trained on 120 million Python execution traces and 3 million agentic Docker trajectories—we propose WorldRepair, an agentic framework that formulates performance bug repair as sequential decision-making. Rather than one-shot patch generation followed by exhaustive validation, the agent iteratively proposes and evaluates patches through simulated execution. On a dataset of 12,847 real-world Python performance optimizations from 1,847 production repositories, WorldRepair achieves 77.8% ± 1.2% Top-1 patch ranking accuracy (mean ± std over 5 seeds) and reduces test execution costs by 72.1% compared to exhaustive validation, while maintaining 91.7% agreement between simulated and actual test outcomes. These results provide initial evidence that execution trace training helps language models encode program behavior useful for guiding automated repair, though the 4.9% false discovery rate indicates that simulated validation should complement—not replace—selective test execution.