Divide and Abstract: Autoformalization via Decomposition and Abstraction Learning
Marcus Min ⋅ Yeqi Gao ⋅ Wilson Sy ⋅ Zhaoyu Li ⋅ Xujie Si ⋅ Osbert Bastani
Abstract
Existing approaches to autoformalization---the task of translating informal mathematics into formal machine-verifiable languages---rely heavily on pre-defined libraries and expect LLMs to directly generate complete formalizations. These approaches face three fundamental limitations: they are bottlenecked by existing abstractions, they have difficulty handling the complexity of realistic statements, and they do not transfer well across formal languages. We propose $\textit{Divide and Abstract (DNA)}$, a zero-training framework that addresses these challenges through a two-phase approach. First, $\textit{DNA}$ extracts common mathematical concepts from the entire corpus and formalizes them as reusable abstractions, extending the target language's capability. Second, $\textit{DNA}$ hierarchically decomposes new statements into structured informal clauses, translates each clause using the learned abstractions, and composes them into complete formalizations. Our evaluation on the LeanEuclidPlus and ProofNet-Hard benchmarks demonstrates consistent improvements across multiple model families, achieving up to $\textbf{8.6}\times$ performance gains over baselines. Notably, $\textit{DNA}$ enables smaller models to match baselines using much larger models, and shows particularly strong performance on complex mathematical statements requiring nested reasoning. Furthermore, our framework requires no training on target languages, making it effective for low-resource domain-specific languages. Our code is available at https://github.com/marcusm117/DNA.
Successful Page Load