Poster
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Huajian Xin · Z.Z. Ren · Junxiao Song · Zhihong Shao · Wanjia Zhao · Haocheng Wang · Bo Liu · Liyue Zhang · Xuan Lu · Qiushi Du · Wenjun Gao · Haowei Zhang · Qihao Zhu · Dejian Yang · Zhibin Gou · Z.F. Wu · Fuli Luo · Chong Ruan
Hall 3 + Hall 2B #226
Lean is an advanced proof assistant designed to facilitate formal theorem proving by providing a variety of interactive feedback. In this paper, we explore methodologies to leverage proof assistant feedback to augment the capabilities of large language models in constructing formal proofs. First, we deploy online reinforcement learning using Lean verification outcomes as the reward signal to improve the proof completion policy. This straightforward approach shows great promise in enhancing the model's alignment with the formal verification system. In addition, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. The tree structure is organized to represent the transitions of intermediate tactic states, extracted from the compilation messages given by Lean's tactic mode. The intrinsic reward is constructed to incentivize the discovery of novel tactic states, which helps to to mitigate the sparse-reward problem inherent in proof search. These techniques lead to a more efficient planning scheme for formal proof generation, achieving new state-of-the-art results on both miniF2F and ProofNet benchmarks.
Live content is unavailable. Log in and register to view live content