Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification
Abstract
Despite rapid advances in code generation, current Large Language Models (LLMs) still lack an essential capability for reliable and verifiable code generation: compositional reasoning across multi-function programs. To explore this potential and important gap, we introduce DafnyCOMP, a benchmark designed to systematically evaluate LLMs on the generation of compositional specifications in Dafny. Unlike prior benchmarks that primarily target single-function annotation, DafnyCOMP focuses on programs composed of multiple interacting functions with necessary data dependencies, requiring LLMs to produce specifications that ensure correctness across component boundaries. Our benchmark comprises 300 automatically synthesized programs, each carefully constructed by combining 2–5 originally independent functions in a chain-based manner through LLM-driven synthesis. We evaluate LLMs from five leading research groups that represent the current frontier of reasoning-centric AI, including the GPT, Claude, Gemini, DeepSeek, and Qwen families. Our results reveal a striking dichotomy: while LLMs achieve both high syntax correctness (>99%) and moderate verification rates (>58%) in prior single-function benchmarks, they exhibit degraded syntax correctness (95.67%) and a catastrophic verification failure (3.69%) in DafnyCOMP's compositional tasks— a 92% performance gap. Even the most powerful LLM achieves only 7% verification at Pass@8, with most LLMs below 2%. Further analysis reveals that LLMs systematically fail at cross-functional reasoning through three primary failure modes: specification fragility (39.2%), implementation-proof misalignment (21.7%), and reasoning instability (14.1%). These failures clearly reveal the absence of compositional reasoning capabilities in current LLMs. DafnyCOMP thus establishes a diagnostic benchmark for tracking progress in verifiable code generation with LLMs, highlighting that the path from local to compositional verification remains largely uncharted.