Skip to yearly menu bar Skip to main content


INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

Yuhuai Wu · Albert Jiang · Jimmy Ba · Roger Grosse

Keywords: [ Monte Carlo Tree Search ] [ transformers ] [ Synthetic benchmark dataset ] [ theorem proving ] [ graph neural networks ] [ generalization ]


In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark designed to test agents’ generalization ability. INT is based on a theorem generator, which provides theoretically infinite data and allows us to measure 6 different types of generalization, each reflecting a distinct challenge, characteristic of automated theorem proving. In addition, provides a fast theorem proving environment with sequence-based and graph-based interfaces, conducive to performing learning-based research. We introduce base-lines with architectures including transformers and graph neural networks (GNNs)for INT. Using INT, we find that transformer-based agents achieve stronger test performance for most of the generalization tasks, despite having much larger out-of-distribution generalization gaps than GNNs. We further find that the addition of Monte Carlo Tree Search (MCTS) at test time helps to prove new theorems.

Chat is not available.