Log In Sign Up

Learning to Prove Theorems by Learning to Generate Theorems

by   Mingzhe Wang, et al.

We consider the task of automated theorem proving, a key AI task. Deep learning has shown promise for training theorem provers, but there are limited human-written theorems and proofs available for supervised learning. To address this limitation, we propose to learn a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover. Experiments on real-world tasks demonstrate that synthetic data from our approach improves the theorem prover and advances the state of the art of automated theorem proving in Metamath.


page 1

page 2

page 3

page 4


Training a First-Order Theorem Prover from Synthetic Data

A major challenge in applying machine learning to automated theorem prov...

Towards Concise, Machine-discovered Proofs of Gödel's Two Incompleteness Theorems

There is an increasing interest in applying recent advances in AI to aut...

INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

In learning-assisted theorem proving, one of the most critical challenge...

Designing Game of Theorems

"Theorem proving is similar to the game of Go. So, we can probably impro...

Graph Contrastive Pre-training for Effective Theorem Reasoning

Interactive theorem proving is a challenging and tedious process, which ...

Learning-Assisted Automated Reasoning with Flyspeck

The considerable mathematical knowledge encoded by the Flyspeck project ...

Gym-saturation: an OpenAI Gym environment for saturation provers

`gym-saturation` is an OpenAI Gym environment for reinforcement learning...

Code Repositories


Code for the paper "Learning to Prove Theorems by Learning to Generate Theorems"

view repo

1 Introduction

Automated theorem proving is a key task in Artificial Intelligence. The goal is to automatically generate a proof, given a conjecture (the target theorem) and a knowledge base of known facts, all expressed in a formal language. Automated theorem proving is useful in a wide range of applications, including the verification and synthesis of software and hardware systems 

(gu2016certikos; darvas2005theorem; kern1999formal).

Automated theorem proving boils down to a search problem: finding the sequence of symbol manipulations that generate a valid proof. A prover typically works backward: starting from the theorem statement, it searches for a path that connects the theorem to known facts in the knowledge base. The fundamental challenge lies in the explosion of search space, in particular with long proofs and large knowledge bases. The success of theorem proving thus relies on effective heuristics that guide the search by deciding the next step the prover should take.

Deep learning has emerged as a promising approach to learning search heuristics in an automated theorem prover (irving2016deepmath; yang2019coqgym; whalen2016holophrasm; loos2017deep; bansal2019holist; lee2019mathematical). The search process fundamentally reduces to a sequence of actions on manipulating a set of symbols. Thus a deep network can be trained to select the best action at each step.

A key challenge is how to train such networks. Prior work has used human-written theorems and proofs to perform imitation learning and has shown promising results 

(loos2017deep; yang2019coqgym; whalen2016holophrasm; paliwal2019graph). The training data consists of theorems and proofs manually written by human experts in a formal language, and the prover is trained to imitate the proof steps demonstrated by humans.

However, relying on human-written data has a major drawback: such data has limited availability and scalability. Writing theorems and proofs in a formal language requires highly specialized knowledge and skills, including mathematics, computer programming, and proficiency in the particular formal language. For a computer science graduate student, it can take months to master a new formal language such as Mizar, Metamath or HOLight (wiedijk2003formal), after which it can take days to formalize a single page of a math textbook. This makes it impractical to crowdsource human-written proofs at large scale.

An alternative to imitation learning is reinforcement learning, which requires only formalized theorem statements but not their proofs. During training, the prover estimates the value of each action through exploration. This approach substantially reduces the amount of manual formalization needed, but at the expense of sample efficiency. The prover needs positive rewards to assess past attempts, but positive rewards are only available when the prover finds a complete proof, which is rare because it involves a combination of multiple correct steps. This leads to extremely sparse positive rewards and in turn very low sample efficiency.

In this paper, we propose to learn search heuristics using synthetic data. The basic idea is to construct a generator that automatically synthesizes new theorems and their proofs, which are then used to augment human-written data. To generate a new theorem and its proof, the generator applies an inference rule on a set of existing theorems and combines their proofs to form the proof of the new theorem. Similar to the prover, the generator performs a sequence of symbol manipulations, albeit in the inverse direction, going forward from existing theorems to a new theorem instead of from a target theorem to existing ones. A key question is how to construct a generator such that the generated data is useful. The space of new theorems is infinite, but a prover can only process a finite amount of data during training. Thus, to maximize the utility of the generated data, we make the generator learnable by parameterizing it with deep networks.

We hypothesize that the generated data will be more useful if they are similar to human-written data. Therefore we use human-written data to train a generator. We consider two scenarios. If the human-written data consist of both theorem statements and their proofs, we train the generator to follow the proof steps in the forward direction, so that a well-trained generator would derive theorems humans tend to derive. If the human-written data consist of only theorem statements but not their proofs, we use reinforcement learning to train the generator such that the generated theorems are similar to the human-written theorems. Similar to GANs (goodfellow2014generative), we measure similarity using a discriminator trained to distinguish the human-written theorems from synthetic ones.

We instantiate our approach in Metamath (metamath), a popular language for formal mathematics, and with Holophrasm (whalen2016holophrasm), a Metamath neural prover. We propose a neural theorem generator called “MetaGen“, which synthesizes new theorems and their proofs expressed in the formalism of Metamath. To the best of our knowledge, MetaGen is the first neural generator of synthetic training data for theorem proving. Experiments on real-world Metamath tasks show that synthetic data from MetaGen help provers prove more human-written theorems, achieving state of the art results. Experiments also show that our approach synthesizes useful data, even with only human-written theorems but zero proofs during training.

2 Related Work

Automated theorem proving Our work is related to prior work on learning to prove theorems (whalen2016holophrasm; gauthier2018tactictoe; bansal2019holist; yang2019coqgym; loos2017deep; balunovic2018learning; kaliszyk2018reinforcement; bansal2019learning). Our work directly builds off of Holophrasm (whalen2016holophrasm), a neural-augmented theorem prover for Metamath. It contains three deep networks to generate actions and initial values to guide proof search following the UCT algorithm (kocsis2006bandit).

TacticToe (gauthier2018tactictoe), DeepHOL (bansal2019holist) and ASTactic (yang2019coqgym) are learning-based theorem provers for the higher-order logic based on various interactive theorem provers, including HOL4 (slind2008brief), HOL Light (hollight) and Coq (bertot2004coq). paliwal2019graph improves DeepHOL by representing formulas as graphs. loos2017deep proposes to learn clause selection by deep learning inside the first-order logic prover E (schulz2002brainiac). FastSMT (balunovic2018learning) learns to compose search heuristics as programs with branches for the SMT solver (de2008z3).

All of these methods are orthogonal to our approach because all of their provers are learned from human-written training data, whereas our contribution is on training a neural generator of synthetic training data for theorem proving.

kaliszyk2018reinforcement; bansal2019holist; bansal2019learning use reinforcement learning to train provers with only human-written theorems but not proofs. During training, a prover collects rewards only upon finding full proofs. In contrast, we always train our prover using imitation learning. Under the same setting with only human-written theorems but not proofs, we use reinforcement learning to train our generator, whose reward is the similarity between a generated theorem and human-written theorems, as measured by an adversarial discriminator. Our reinforcement learning task is much easier because the reward is continuous and there are many ways to generate theorems similar to human-written ones.

Automatic goal generation by self-play Our work is similar to the line of work in reinforcement learning  (pmlr-v80-florensa18a; sukhbaatar2017intrinsic; sukhbaatar2018learning; durugkar2018adversarial) that deploys one agent to generate tasks for another agent to accomplish. sukhbaatar2017intrinsic; pmlr-v80-florensa18a propose to train these two agents by adversary self-play, where the generation agent learns to produce difficult goals for another agent. With self-play, the generator learns to increase the difficulty of goals and build a learning curriculum automatically.

We pursue similar ideas in the new context of theorem proving by learning to generate synthetic theorems to train the prover. Also of note is that we have no adversarial self-play. The goal of the generator is to discover novel theorems similar to human-written ones, not to beat the prover.

Recently, huang@learntoprove introduced a two-player game which encourages players to learn to predict the consistency of logical formulas by self-play. These two players behave symmetrically and complete with each other in the game. In contrast, our generator and prover execute different tasks, and are co-operative. In addition, their game remains a theoretical proposal without any empirical validation, whereas we have performed experiments on large-scale data.

3 Background on Metamath

Figure 1: Left: An example proof task. Right: The proof tree of the theorem 3eqtri. Each leaf node is a hypothesis and each internal node corresponds to a proof step.

Metamath is a language for developing formal mathematics. It is one of the simplest formal systems. It has only one inference rule, called substitution, but is universally applicable in formalizing a large portion of mathematics 111Its largest knowledge base, ranks 3rd in the ”Formalizing 100 Theorems” challenge (formalize100). and different types of logic (metamath).

Expression and theorem A basic building block of Metamath is expressions. An expression is a sequence of tokens that follows a set of grammar rules called “generating axioms”. A token is either a constant or a variable. For example, is an expression, where and are two variables. Each expression corresponds to a unique parse tree where each internal node represents a generating axiom and each leaf node is a token.

A theorem consists of a set of expressions, one expression as its assertion and zero or more expressions as its hypotheses. The theorem can be understood to state that the hypotheses (e.g. and ) entail the assertion (e.g. ). Some examples of theorems are shown in Figure 1.

Substitution The only inference rule in Metamath is substitution, which transforms one expression by replacing each variable with a non-empty new expression. For example, the expression can be transformed to by the substitution and .

Given two expressions and , we say can reach or is reachable from if there exists a substitution that transforms to . This is equivalent to saying that the parse tree of can be obtained by “trimming” the parse tree of —repeatedly picking an internal node, removing all its descendants, and replacing it with a variable node. Reachability can be checked by comparing parse trees; an algorithm is described in Appendix A.

Proof step A proof step is the basic unit of reasoning. A proof step in Metamath has two parts: (1) a theorem and (2) a substitution that maps each variable in the theorem to a new expression. A proof step serves to establish entailment between expressions based on the invoked theorem. For example, let be the theorem over1i,


where is the set of variables in . Let be a substitution that maps each variable in to a new expression,


By replacing variables in with their corresponding expressions given by , we have a new assertion and new hypotheses,


This proof step establishes that the new hypothesis entails the new assertion based on theorem . The new assertion is called the conclusion of this proof step and the new hypotheses are called its preconditions. Because a theorem has one assertion and zero or more hypotheses, a proof step thus has one conclusion and zero or more preconditions.

Proof A theorem is proved if we can construct a proof tree that connects the hypotheses of the theorem to its assertion through entailment. The root node of a proof tree is the assertion of the theorem. Each leaf node of the tree is either a hypothesis of the theorem or empty. Each internal node of the tree is an expression and is associated with a proof step that uses an pre-existing theorem, together with an appropriate substitution, to establish the entailment of this internal node by its child nodes. Note that if an internal node has an empty child, it means that the proof step has no preconditions. An example proof tree is shown in Figure 1.

A proof is a sequence of proof steps that can be obtained by traversing a proof tree in pre-order. This linearized proof is an equivalent to the tree representation. In this work we will use “proof” and “proof tree” interchangeably.

Corpus A corpus consists of a set of axioms and a sequence of theorems and their corresponding proofs. The proof of each theorem uses only the axioms and the preceding theorems.

Backward Reasoning To construct a proof tree of a target theorem, a straightforward strategy is to search backwards. We start with a single root node—the assertion of the new theorem—and pick a proof step that establishes the entailment of the root node. We expand the tree by adding the preconditions of this proof step as children of the root node. We repeatedly expand the tree by adding children to leaf nodes, until each leaf node is either empty or a hypothesis of the target theorem. This construction process can be understood as recursive goal decomposition: the assertion of the target theorem is the original goal; by picking a proof step we decompose the original goal into subgoals, which are the preconditions of the proof step; then for each subgoal we repeat this process until all subgoals are resolved.

Obviously, each time we expand the tree, we may have multiple choices of proof steps and most of them will lead to dead ends. We thus need to explore multiple alternatives, which gives rise to a search process where we need to keep track of what paths have been explored and decide which paths to explore further.

4 Approach

Task setup We use the standard theorem proving setup in prior work (irving2016deepmath; bansal2019holist). A proof task consists of a target theorem (or “target” in short) to be proved and a set of background theorems to be used as known facts. For each theorem in a corpus, we construct a proof task using the theorem as the target theorem and all preceding theorems as the background theorems. In other words, each theorem in the corpus corresponds to a unique proof task that uses the theorem as the target.

We randomly split all theorems into three disjoint sets: a training set, a validation set, and a test set. Accordingly, we have three corresponding sets of proof tasks using the theorems as targets. For simplicity, we will use “a (target) theorem in the training set” as a shorthand to mean “a theorem that serves as the target theorem in a proof task in the training set”. Similarly, we will use “a background theorem in the training set” to mean “a theorem that serves as the background theorem in a proof task in the training set”.

It is important to note that a theorem can serve both as a target theorem in the test set and as a background theorem in the training set. This is a standard setup and is not “training on the test set”—a background theorem is used as a known fact in a training proof task and only its statement is provided, not its proof; seeing the statement of a background theorem during training does not tell us how to prove it during testing.

We consider two training settings: (1) the standard setting where we are provided with (human-written) proofs for some or all of the proof tasks in the training set and (2) the the more challenging setting where we are provided with zero proofs during training.

Approach Overview Our approach consists of two main components, a generator and a prover. The generator learns to generate synthetic theorems and proofs. The prover learns from the synthetic theorems and proofs by imitating the proof steps.

We introduce MetaGen, a neural generator that creates synthetic theorems along with their proofs. The generator conducts forward reasoning and generates theorems by creating proof steps—selecting existing theorems to use with appropriate substitutions.

We use Holophrasm (whalen2016holophrasm) as the prover. It learns by imitating the proof steps from the synthetic proofs generated by MetaGen together with human-written proofs when they are available. The prover conducts backward reasoning as described in Sec. 3. To construct a proof tree, it starts from the root node and recursively expands the tree by generating proof steps. It searches the space of possible proof steps until a successful tree is found.

For both the generator and the prover, the key operation is making a sequence of decisions to generate certain desired symbolic structures—new theorems (and their proofs) for the generator and proofs for the prover. In both cases, we train deep networks to make these decisions.

4.1 Generator

In a nutshell, MetaGen generates new theorems along with their proofs given a set of existing theorems and their proofs. The basic operation is generating a proof step—selecting an existing theorem and constructing a substitution. From this single proof step we can derive a new theorem. Now, we can treat this new theorem as an existing theorem and repeat to generate additional new theorems.

One issue requiring special handling is avoiding generating “meaningless” theorems. A meaningless theorem is one that includes falsehood in its hypotheses—as a result it is always provable regardless what the assertion says. It is possible to generate such a theorem if we allow arbitrary substitutions in constructing a proof step. For example, the hypothesis can be substituted into . Such theorems are valid but unlikely to be useful as training data.

To avoid meaningless theorems, in constructing a proof step, we require that each new hypothesis produced by substitution must be an existing expression, one that already appears in a proof tree of an existing theorem (either the root, a leaf, or an internal node). This prevents introducing false expressions as hypotheses, provided that the existing proofs have no false expressions.

A second issue is generating new theorems with multi-step proofs. A single proof step gives a new theorem with a shallow proof tree: the leaf nodes are directly attached to the root. To generate theorems with longer proofs, we expand this shallow tree by extending it with existing proof trees or subtrees. For a leaf node of the shallow tree, we can replace it with an existing proof tree (or subtree) whose root node is also . For example, suppose the shallow tree proves that and entail , and there already exists another tree proving that entails . Then we can join the two trees to generate a new tree proving that and entail .

Note that such tree “grafting” can potentially introduce meaningless theorems by combining conflicting hypotheses. For example, in the same example above, we can replace the leaf node with a subtree proving entails , which leads to a new tree proving that and entail , which is meaningless. Unfortunately, there does not appear to be an easy way to avoid meaningless theorems resulting from tree grafting, because this would require checking the consistency of an arbitrary set of expressions, which can be as hard as general theorem proving. Despite this limitation, however, we still perform tree grafting because a lot of interesting mathematics do result from nontrivial combination of hypotheses.

To generate theorems and proofs more similar to human-written ones, we impose an additional constraint that a synthesized proof step can only invoke a theorem that has appeared as a background theorem in a training proof task. This is because in the ground-truth proof for a proof task, only the background theorems are invoked in proof steps. This means that we do not invoke any synthesized theorems or theorems that only appear as targets in the training set. To implement this constraint, the generator constructs proof steps using a restricted set of “invocable” theorems pre-specified as input to the generator.

Initializing existing proof trees The generator takes as input a set of existing theorems together with their proofs, and a set of invocable theorems. To enable tree grafting, it first builds a set of existing proof trees. For every node in every proof tree in , we add to the subtree that is rooted at and contains all nodes below .

Two proof trees are considered equivalent if they have the same root node and the same leaf nodes, i.e. they prove the same theorem. Among equivalent trees, we only keep the smallest one.

Generating new theorems To generate a new theorem, the key procedure is to construct a proof step and a set of existing proof trees such that each precondition of this proof step matches the root node of a proof tree in . This is achieved in three steps as follows (see Appendix B for pseudo-code):

  1. Pick an invocable theorem according to the frequencies of invocable theorems being used in the proofs of the existing theorems.

  2. Initialize the set of proof trees as empty. Initialize the substitution for as empty. For each hypothesis of theorem , apply the current substitution to hypothesis to obtain the transformed expression , find all compatible proof trees, those whose root nodes are reachable from can be transformed to the root nodes by substitution, which can be determined by comparing parse trees—and perform the following:

    • Select a compatible proof tree using a relevance network (to be described later). For each variable that has not been substituted in , update by assigning the variable a substitute expression to match the root of . Add tree to set .

    If no compatible proof tree exists, go to Step 1.

  3. If a variable appears in a hypothesis of , its substitution has been determined by matching this hypothesis with the root of a compatible proof tree. For the remaining variables that appear exclusively in the assertion of , use a subtitution network (to be described later) to generate substitute expressions for them.

This proof step gives a one-step proof tree, which we expand to a multi-step proof tree by grafting the trees in set onto its leaves. We then add this multi-step proof tree to the set of existing proof trees.

Relevance network of generator The relevance network in step 2 is a deep network trained to pick a proof tree from a set of candidates by scoring and ranking them. It uses the same design as the relevance network in Holophrasm (whalen2016holophrasm) (see Sec. 4.2

) but is trained with different inputs and purposes. It takes two sequences of tokens as input. One input sequence represents the root and leaf nodes of a proof tree. The other sequence consists of two parts. One part represents the leaf nodes of the proof trees that have been selected for preceding hypotheses (the hypotheses are processed one by one). The other part represents the assertion and hypotheses of the invocable theorem transformed by the current substitution, except for the current hypothesis to be processed which is represented by a special token. Two GRU encoders convert each input sequence to an embedding vector, followed by a bilinear layer to output a score from the two vectors. In practice, we limit the number of candidate trees to 2000 for tractability.

Substitution network of generator The substitution network generates the substitution for a target variable of an invocable theorem. It uses the same design as the “generation network” in Holophrasm (whalen2016holophrasm) (see Sec. 4.2

) but is trained with different inputs and purposes. It is a sequence-to-sequence model with the encoder-decoder GRU network. It takes as input the sequence of tokens that represents the assertion of the invocable theorem and the leaf nodes of the existing proof trees that have been selected to construct a proof step. The target variable is represented by a special token. The network outputs a sequence of tokens, sampled one by one based on the probabilities predicted by the network.

Generator training We propose two strategies to train the relevance network and the substitution network, depending on the availability of human-written proofs.

Our generator can work without learnable parameters if we remove the two deep network and sample new proof steps by randomly picking existing proof trees and generating substitutions. We call such a generator as MetaGen-Rand.

Given human-written proofs, we train MetaGen-IL by imitation learning. Given a proof step in a human-written proof tree , each transformed hypothesis of theorem is an internal node of tree and is the root of a subtree; we train the relevance network to imitate this step by selecting this subtree among a large set of candidates.

For a variable that appears in the assertion but not the hypotheses of , the substitution network is trained to produce its human-written substitute expression .

In the case of only human-written theorems but not their proofs, we can no longer perform imitation learning. We instead use reinforcement learning. The objective is to learn actions to maximize the similarity between the generated theorems and human-written theorems. We propose two reward functions to evaluate a generated theorem and update the two deep networks toward the higher rewards via the Reinforce algorithm (williams1992simple).

The first reward function is the cross-entropy of a generated theorem given by a language model trained from the human-written theorems. The generator trained with this reward is called MetaGen-RL-LM.

The second reward function is given by an adversarial loss similar to GAN (goodfellow2014generative)

—a binary classifier trained to distinguish the human-written theorems from the generated ones. It is pretrained to separate human-written theorems from the theorems generated by

MetaGen-Rand, and then updated on-the-fly to separate human-written theorems from the theorems generated by the current generator. The generator is updated to minimize the adversarial loss. We call this generator MetaGen-RL-Adv.

More details about the deep networks of the generator are presented in Appendix D.1.

4.2 Prover

We use Holophrasm (whalen2016holophrasm) as our theorem prover and augment its training with synthetic data. Given a proof task, Holophrasm conducts backward reasoning to prove the target theorem as described in Sec. 3. For completeness we briefly summarize how Holophrasm works and refer the reader to whalen2016holophrasm and Appendix C for more details.

Holophrasm uses Monte Carlo Tree Search (MCTS) to explore multiple branches of actions to find a proof tree. It involves three deep networks: a payoff network to determine which branch is more promising, a relevance network to pick a background theorem to construct a proof step, and a substitution network222called the generation network in whalen2016holophrasm but renamed here to avoid confusion with the generator. to generate substitutions.

4.3 Applicability to other formal systems

As is standard in related work (loos2017deep; irving2016deepmath; kaliszyk2018reinforcement; yang2019coqgym), we instantiate and validate our approach on a single formal system, but our approach is applicable to other formal systems such as HOL Light, Coq and Isabelle.

Our approach can be applied to a new system under the following conditions: (1) the search heuristics of the theorem prover can be trained by imitating ground truth proofs; (2) the proof of a theorem is a tree of intermediate goals, and a proof steps demonstrate the entailment of a goal by its children; (3) an intermediate goal in the proof is equivalent to a legal theorem. These conditions are satisfied by the formal systems mentioned above.

To adapt our approach to a new system, the main effort is to rewrite the procedure of sampling proof steps, by replacing substitution with inference rules of the new system. HOL Light, Coq and Isabelle only provide tactics as inference rules to decompose a goal into subgoals for backward reasoning. However, to generate new theorems, we need to execute the corresponding reverse tactics, which are not readily available in their ML environments. Therefore, we leave the experiments on these systems as future work.

Human proofs Synthetic proofs Generator Model Prob Top-1 Top-5 Top-20
0 0 - tf-idf 0.0081 27.82 36.15 47.68
0 0 - relevance 0.0065 0.279 1.924 12.50
0 300K MetaGen-Rand relevance 0.4681 48.60 62.93 70.76
0 300K MetaGen-RL-LM relevance 0.4658 49.15 62.39 73.28
0 300K MetaGen-RL-Adv relevance 0.5102 51.34 64.59 70.95
2179 0 - relevance 0.6007 62.14 76.30 89.30
2179 1M MetaGen-Rand relevance 0.5942 61.82 77.57 89.72
2179 1M MetaGen-IL relevance 0.5889 62.14 76.29 87.61
21788 0 - relevance 0.5978 61.54 74.55 87.28
21788 10M MetaGen-Rand relevance 0.5907 62.31 76.01 87.78
21788 10M MetaGen-IL relevance 0.5920 63.02 76.71 87.93
Table 1: Performance of the relevance network of the prover on validation data.

Human proofs Synthetic proofs Generator Model Prob Accuracy
0 0 - lauguage model 0.0032 9.06
0 0 - substitution 0.0008 0.01
0 300K MetaGen-Rand substitution 0.0103 29.68
0 300K MetaGen-RL-LM substitution 0.0181 24.33
0 300K MetaGen-RL-Adv substitution 0.0186 31.38
2179 0 - substitution 0.2738 58.91
2179 1M MetaGen-Rand substitution 0.3203 61.78
2179 1M MetaGen-IL substitution 0.3710 66.56
21788 0 - substitution 0.6142 81.57
21788 10M MetaGen-Rand substitution 0.6439 81.85
21788 10M MetaGen-IL substitution 0.6847 83.90
Table 2: Performance of the substitution network of the prover on validation data.

5 Experiments

Dataset We experiment on the same version of Metamath knowledge base as whalen2016holophrasm. It formalizes the ZFC set theory and contains 1099 axioms and 27218 theorems, which give rise to 27218 corresponding proof tasks. These proof tasks are divided into 21786 training tasks, 2712 validation tasks and 2720 test tasks.

Training protocol We control for the number of human-written proofs provided during training. Specifically, we compare our approach to baselines while including either 0%, 10%, or 100% of the human-written proofs.

Implementation details We train the generator using training tasks. We then use the trained generator to generate synthetic theorems and proofs. The prover is trained on both human proofs and synthetic proofs.

We generate 300K unique theorems for the setting of 0% of human proofs (after discarding any duplicates) and 1M unique theorems for 10% of the human training proofs. We generate 10M theorems for the setting of 100% of human proofs, by generating 1M unique theorems a time (maximum allowed by memory limit) and repeating 10 times.

Please refer to Appendix D for more details about the implementation, hyper-parameters and baselines.

Human proofs Synthetic proofs Generator Prover Test proofs found
0 0 - tf-idf LM 312
0 0 - Holophrasm 219
0 300K MetaGen-Rand Holophrasm 346
0 300K MetaGen-RL-LM Holophrasm 351
0 300K MetaGen-RL-Adv Holophrasm 357
2179 0 - Holophrasm 452
2179 1M MetaGen-Rand Holophrasm 461
2179 1M MetaGen-IL Holophrasm 475

0 - Holophrasm(’16) 388
21788 0 - Holophrasm 539
21788 10M MetaGen-Rand Holophrasm 546
21788 10M MetaGen-IL Holophrasm 574
Table 3: Number of theorems proved on test data.

Hypothesis Assertion Comment
Simple arithmetic.
complex number set.
real number set.
F: bijection from X to Y.
: range of .
integer set
mod: modulo operation
Table 4: Examples of synthetic theorems from MetaGen-IL trained on all human proofs.

5.1 Results

To validate the effectiveness of our theorem generator, we evaluate provers trained on the synthetic data and compare them against various baselines.

Relevance network of prover We evaluate how synthetic data can improve the relevance network of Holophrasm. The relevance network assigns a score to each candidate background theorem. We use two metrics: (1) average probability given by the network to the groundtruth candidate through a softmax of all candidate scores and (2) top-k accuracy defined as the percentage of times a groundtruth candidate is ranked in the top k.

We evaluate Holophrasm combined with different generators. We also evaluate TF-IDF, a baseline that replaces the relevance network with if-idf similarity. In Table 1, we see that all relevance networks trained on human-written proofs have similar performance, even among those trained on 10% or 100% of proofs, which means the scale of training data isn’t the bottleneck. Similarly, we see no additional improvement by using synthetic data.

With zero human-written proofs for training, the relevance networks trained on synthetic data perform better than TF-IDF and the untrained network. The relevance network augmented by MetaGen-RL-Adv achieves better results than MetaGen-Rand and MetaGen-RL-LM, showing the effectiveness of the adversarial loss.

Substitution network of prover We evaluate how synthetic data can improve the substitution network of Holophrasm. The substitution network predicts the probability of each token at each position under teacher forcing. We use two metrics: (1) accuracy, defined as the percentage of times the tokens in the groundtruth substitutions have the highest probabilities and (2) the average probability to generate the groundtruth substitutions normalized by the number of tokens. Tab. 2 reports the results, include the result of a language model. In all settings, synthetic data brings significant improvement. The best performance is achieved with our trained generators.

Prover To evaluate the prover as a whole, we follow the same protocol of whalen2016holophrasm (more details in Appendix D.2) and report the number of theorems proved. We compare with the original Holophrasm prover proposed by whalen2016holophrasm trained by imitation learning on human-written proofs only. With zero human-written proofs for prover training, we also evaluate TF-IDF & LM, an ablated version of Holophrasm that needs no training proofs—we remove the relevance network and instead pick a background theorem using tf-idf similarity; we replace the substitution network with a language model of theorem statements.

As shown in Table 3, the performance of the prover shares the same pattern as the substitution network. The provers trained on synthetic data consistently prove more theorems than the provers trained on human proofs only. The provers trained with MetaGen-IL and MetaGen-RL perform better than the provers trained with MetGen-Rand.

Our re-implementation of the Holophrasm baseline trained on all human proofs proves 539 test theorems, more than the number reported in whalen2016holophrasm. We believe this is due to our GPU implementation, which gives a speed advantage since all provers have the same time limit.

With 100% of human-written proofs for training, the prover augmented by MetaGen-IL proves 574 theorems—the best known result on the benchmark.

Examples of generated theorems Some examples of synthetic theorems are presented in the Table 4. Some are trivial (first and fourth), whereas others are fairly interesting—the third theorem involves a non-trivial statement about trigonometric functions and complex numbers.

6 Conclusion

We have proposed a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover. Experiments on real-world tasks have demonstrated that synthetic data from our approach improves the theorem prover and advances the state of the art of automated theorem proving in Metamath.

Acknowledgements This work is partially supported by the National Science Foundation under Grant No. 1633157.


A. Checking reachability between expressions

For an expression , let be the root node of the parse tree of . Each node in the parse tree represents either a generating axiom (if internal node) or a token (if leaf node). We check if expression can reach expression by comparing their parse trees and through the following procedure:

  1. Initialize the substitution as empty.

  2. Compare the two root nodes.

    • If root node represents a variable , do the following:

      • If the substitute expression is not determined, let . Return True (i.e. reachable).

      • If , return True (i.e. reachable) because we can replace with .

      • Otherwise return False (unreachable), because conflicts with the current substitution .

    • If the two root nodes represent the same generating aixom or constant, repeat Step 2 to check if each child of is reachable from the corresponding child of .

      • If every child of is reachable from the corresponding child of , return True.

      • Otherwise return False.

    • Otherwise return False, because the two root nodes have different values and they can not be matched.

This procedure is summarized in Algorithm 1.

  Input: node , node , substitution
  Output: True if could reach , otherwise False
  if  represents a variable  then
     if  in  then
        if  then
           return True {Consistent with the current substitution}
           return False {Conflict with a preceding branch}
        end if
         {Variable should be replaced by }
        return True
     end if
     if  and represent the same generating axiom or constant then
        for  to len do
           { is the list of children of node }
           if Reachable false then
              {A pair of child nodes doesn’t match}
              return False
           end if
        end for
        {Every child of could reach a child of }
        return True
        return False {Two nodes have different values}
     end if
  end if
Algorithm 1 Function Reachable(, , )

B. Pseudo-code for MetaGen

Algorithm 3 summarizes the procedure to construct a proof step and the set of existing proof trees. Algorithm 4 summarizes the complete procedure of MetaGen.

  Input: existing theorems , existing proofs
  Output: existing proof trees
  for theorem in  do
     for hypothesis in  do
        Add to
     end for
     Add to as a one-step proof tree
  end for
  for proof tree in  do
     for node in  do
         the largest subtree of rooted at .
        Add to
     end for
  end for
Algorithm 2 Initializing existing proof trees
  Input: existing proof trees , invocable theorems
  Output: proof step , proof trees
  Sample an invocable theorem
  for hypothesis in  do
     { is the root node of proof tree . is the set of compatible existing proof trees}
     Sample a proof tree using softmax of the relevance network scores
      the substitution that transforms to
     Add to
  end for
  for variable in  do
     if  not in  then
        Generate an expression using the substitution network
     end if
  end for
Algorithm 3 Constructing a proof step
  Input: existing theorems , existing proofs , int
  Output: generated theorems
  Initialize existing proof trees from and
     Construct a proof step with proof trees
      the one-step proof tree of
     for hypothesis in  do
        { is a leaf node of the one-step proof tree }
        Find such that
        Replace with in {tree grafting}
     end for
     Add the new tree to
  until  reaches the expected volume
Algorithm 4 MetaGen

C. Holophrasm

In this section we provide more background on the Holophrasm prover (whalen2016holophrasm). we refer the reader to  whalen2016holophrasm for more details.

Backward reasoning in Holophrasm (whalen2016holophrasm) is implemented with a proof search tree, which keeps track of the exploration of multiple branches of actions to search for a complete proof tree. A proof search tree has two kinds of nodes, expressions and proof steps. An expression node has multiple proof steps as children and each proof step establishes the entailment of this expression by the preconditions. A proof step node has its preconditions as children. A expression is labeled solved if it is a hypothesis of the target theorem or any proof step in its children is solved. A proof step is labeled solved if it has no precondition or all of its preconditions are solved. A complete proof is found if the root node, which is the assertion of the target theorem, is solved.

Holophrasm maintains a payoff of each node in the proof search tree and uses Monte Carlo Tree Search (MCTS) to extend the proof search tree. The prover runs in iterations. In each iteration, it travels down from the root node. After visiting an expression, it either creates a new proof step as a new child or visits its best-performing child according to the UCB (kocsis2006bandit) algorithm. After visiting a proof step, it travels to its worst-performing child with the lowest payoff. When an expression node is created, it is assigned an initial payoff and has no children. When a proof step node is created, its preconditions are also created as its children and the payoff of this proof step is the lowest payoff among its children. A pass continues until a new proof step is created.

The main heuristics of the prover are how to construct a proof step and what is the initial payoff of an expression. Similar to the generator, the prover constructs a proof step by using a relevance network to pick a background theorem, and a substitution network to generate a substitution for the selected background theorem. The initial payoff of an expression is calculated by a payoff network.

Network Human-written Ratio of synthetic Training Initial Epoch to decrese
proofs proofs step per batch epochs learning rate learning rate
relevance 0% 100% 5 -
substitution 0% 100% 5 -
relevance 10% 70% 20 [8, 12, 16]
substitution 10% 70% 60 [15, 30, 45]
relevance 100% 50% 16 [5, 12, 14]
substitution 100% 50% 24 [10, 15, 20]
Table 5: Training details of the relevance network and the substitution network of the prover.
Relevance network of Holophrasm

The relevance network of the prover is a deep network trained to pick a background theorem to establish the entailment of an expression , for the purpose of proving a target theorem . It takes as input two sequences of symbols. One sequence represents the assertion and hypotheses of . Another one represents and the hypotheses of . Two GRU encoders convert each sequence to an embedding vector, followed by a bilinear layer to output a score from two embeddings. The background theorem with the highest score is selected to construct the next proof step. The relevance network is trained to pick the background theorem that is used in the groundtruth proof step.

Substitution network of Holophrasm

The substitution network generates the substitution for a target variable of a background theorem for the purpose of proving a target theorem . It is a sequence-to-sequence model with an encoder-decoder GRU network. It takes as input a sequence of symbols that represents the hypotheses of and the hypotheses of . The target variable is replaced by a special token. It is trained to generate the substitutions of groundtruth proof steps under teacher forcing. When it is called by the prover, it generates multiple substitution candidates for each target variable via beam search.

Payoff network of Holophrasm

The payoff network calculates the payoff of an expression as the probability of this expression being used in the proof tree of a target theorem. It consists of a GRU network followed by two linear layers and the sigmoid, and takes as input a sequence of symbols that represents the expression to be evaluated and the hypotheses of the target theorem.

The payoff network is trained as a binary classifier to distinguish the expressions in groundtruth proof trees (called positive expressions) from other expressions. Since the payoff network is used to evaluate an expression added to the proof search tree, which is a precondition of a newly generated proof step, the training examples of the payoff network are generated in a similar way. For each positive expression, proof steps that establish the entailment of this expression are constructed by using the pretrained relevance and substitution network. The positive expressions from the preconditions of these proof steps are filtered out and the payoff network is trained to distinguish the positive expressions from the rest of preconditions.

D. Additional Implementation details

We implement MetaGen and Holophrasm with the same network architectures as used by (whalen2016holophrasm). For all of our networks in the generator and the prover, we use bidirectional GRUs to encode input sequences, and use the Adam (kingma2014adam) optimizer to update parameters. The batch size is 100 unless otherwise noted.

Input representation of the relevance and substitution network Here we provide more details on the input representation of the relevance and substitution network, which take sequences as input. We use the same form of input representations as used by whalen2016holophrasm.

To represent an expression in a sequential form, one option is to use its “surface form”. For example, “(1+1)=2” is simply given as such. Another option is to serialize its parse tree. The parse tree of “(1+1)=2” has two generating axioms. The first axiom is the root node of its parse tree and generates an expression in the form of “A=B”. The second axiom is the left child of the root node and generates an expression in the form of “(C+D)” and this expression is used to substitute the variable A in the first axiom. The right child of the first axiom is the token “2”. Both of the left child and the right child of the second axiom are the token “1”. Then we can represent “(1+1)=2” as a sequence of symbols , where each symbol is a node in the parse tree and and represent two generating axioms. This new sequence is obtained by traversing the parse tree in pre-order. Following  whalen2016holophrasm, we use the second option to represent expressions as input to our network.

Following whalen2016holophrasm, we also make use of the graph structure of the parse tree. Each node in the input sequence is converted to a feature vector by a learnable embedding layer. Then the feature of this node is concatenated with another four-dimension vector describing the depth of the node, the degree of the node, the degree of its parent, and its position into the children of its parent. The concatenated vector is fed into the GRU encoder of the relevance and substitution network.

Multiple expressions are represented by their concatenation.

D.1. Generator

Configuration of GRUs All of the GRUs in the generator have two layers and 128-dimensional hidden units.

Training relevance network of MetaGen-IL The relevance network of MetaGen-IL is updated to minimize the cross-entropy loss. Each training sample has one groundtruth proof tree and 10 negative candidates that are randomly sampled from compatible proof trees. It is trained for 60 epochs. The learning rate is set to initially and halved after 30, 40 and 50 epochs.

Training substitution network of MetaGen-IL The substitution network of MetaGen-IL is trained for 40 epochs. The learning rate is set to initially and halved after 20, 26 and 32 epochs.

Training of MetaGen-RL To train MetaGen-RL-LM, we learn the language model of human-written theorems by utilizing a one-layer GRU with 64-dimensional hidden units. It is trained for 200 epochs. The learning rate is set to initially and halved after 80, 120 and 160 epochs.

To train MetaGen-RL-Adv, we train a binary classifier using the same architecture as the payoff network of Holophrasm, which contains a two-layer GRU with 128-dimensional hidden units and two subsequent linear layers. It is pretrained to distinguish human-written theorems from 300K synthetic theorems generated by MetaGen-Rand. Then it is updated on-the-fly to distinguish human-written theorems from the synthetic theorems generated in the most recent 20 episodes.

For both MetaGen-RL-LM and MetaGen-RL-Adv, we train the generator for 700 episodes with the learning rate fixed to . We deploy 10 parallel threads to synthesize new theorems by utilizing the current generator. Each thread generates 50 theorems in one episode and synchronizes the set of existing proof trees with other threads for every 20 episodes. We clip policy gradients whose norm is larger 5.

D.2. Prover

Configuration of GRUs In the relevance and substitution network of the prover, all GRUs have two layers and 256-dimensional hidden units. We found 256-dimensional GRUs have slightly better performance than the 128-dimensional GRUs that are used by whalen2016holophrasm. The GRU in the payoff network of the prover has two layers and 128-dimensional hidden units.

Training of the prover All three networks of the prover are trained by imitation learning. The relevance network and the substitution network are trained on both human-written proofs and synthetic proofs. The payoff network is trained on human-written proofs only.

The relevance network of the prover is trained to minimize the cross-entropy loss. Each training sample contains one groundtruth background theorem and 10 negative candidates that are randomly sampled from all background theorems that can be applied in this step.

Table 5 presents the settings of learning rate schedules and the ratio of synthetic training samples per batch, for the training of the relevance and substitution network of the prover.

In all experiments, the payoff network is trained for 30 epochs. The learning rate is set to initially and halved after 15, 20 and 25 epochs.

Evaluation protocol Following the evaluation protocol used by whalen2016holophrasm, the prover attempts to prove each target theorem in the test set three times with the beam search width of the substitution network set to 1, 5, or 20. The prover stops if it has executed 10000 MCTS passes or hit the time limit of 5 minutes.

D.3. Baseline

Without human-written proofs, we compare our approach with a baseline that needs no training proofs. We remove the relevance network of the prover and pick a background theorem according to the tf-idf similarity between an expression and a background theorem, as proposed by bansal2019learning. We replace the substitution network of the prover with a language model trained on the statements of human-written theorems. We use this language model to generate an expression as the substitution of a target variable.