Skip to yearly menu bar Skip to main content


Poster

Herald: A Natural Language Annotated Lean 4 Dataset

Guoxiong Gao · Yutong Wang · Jiedong Jiang · Qi Gao · Zihan Qin · Tianyi Xu · Bin Dong

Hall 3 + Hall 2B #251
[ ] [ Project Page ]
Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Verifiable formal languages like Lean have profoundly impacted mathematical reasoning, particularly through the use of large language models (LLMs) for automated reasoning. A significant challenge in training LLMs for these formal languages is the lack of parallel datasets that align natural language with formal language proofs. To address this challenge, this paper introduces a novel framework for translating the Mathlib4 corpus (a unified library of mathematics in formal language Lean 4) into natural language. Building upon this, we employ a dual augmentation strategy that combines tactic-based and informal-based approaches, leveraging the Lean-jixia system, a Lean 4 analyzer. We present the results of this pipeline on Mathlib4 as Herald (Hierarchy and Retrieval-based Translated Lean Dataset). We also propose the Herald Translator, which is fine-tuned on Herald. Herald translator achieves a 96.7\% accuracy (Pass@128) on formalizing statements in the miniF2F-test and a 23.5\% accuracy on our internal graduate-level textbook dataset, outperforming InternLM2-Math-Plus-7B (73.0\% and 7.5\%) and TheoremLlama (50.1\% and 4.0\%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, are open-sourced to the public.

Live content is unavailable. Log in and register to view live content