Modelling High-Level Mathematical Reasoning in Mechanised Declarative Proofs

Mathematical proofs can be mechanised using proof assistants to eliminate gaps and errors. However, mechanisation still requires intensive labour. To promote automation, it is essential to capture high-level human mathematical reasoning, which we address as the problem of generating suitable propositions. We build a non-synthetic dataset from the largest repository of mechanised proofs and propose a task on causal reasoning, where a model is required to fill in a missing intermediate proposition given a causal context. Our experiments (using various neural sequence-to-sequence models) reveal that while the task is challenging, neural models can indeed capture non-trivial mathematical reasoning. We further propose a hierarchical transformer model that outperforms the transformer baseline.


page 1

page 2

page 3

page 4


A New Style of Mathematical Proof

Mathematical proofs will play a crucial role in building a universal dig...

Learning to Prove Theorems via Interacting with Proof Assistants

Humans prove theorems by relying on substantial high-level reasoning and...

Premise Selection for Mathematics by Corpus Analysis and Kernel Methods

Smart premise selection is essential when using automated reasoning as a...

Enhancing Neural Mathematical Reasoning by Abductive Combination with Symbolic Library

Mathematical reasoning recently has been shown as a hard challenge for n...

A Trustful Monad for Axiomatic Reasoning with Probability and Nondeterminism

The algebraic properties of the combination of probabilistic choice and ...

multiPRover: Generating Multiple Proofs for Improved Interpretability in Rule Reasoning

We focus on a type of linguistic formal reasoning where the goal is to r...

Explosive Proofs of Mathematical Truths

Mathematical proofs are both paradigms of certainty and some of the most...

1 Introduction

Mathematical proof can mostly be considered as a sequence of intermediate propositions. Consider the following prose proof:

Proof of irrationality of .

Assume is rational. Then there exists a pair of coprime integers and such that , and it follows that and then . Hence is even, and therefore is even. Thus there exists such that , which combined with yields : hence is also even. So and are both even although they are coprime, contradiction. ∎

By intermediate propositions, we refer to each of the reasoning steps, such as ‘ is rational’, ‘’, that are connected causally by words such as ‘therefore’, ‘since’, ‘yields’. The implicit gaps between these intermediate propositions are neglected in human reasoning as they are often too primitive and obvious to go through explicitly. However, those implicit derivations sometimes contain fatal errors. As a result, proof assistants, such as Coq Bertot and Castéran (2013) and Isabelle Paulson (1994), were developed to mechanically check proofs down to primitive inferences or axioms. Over the last decade, non-trivial mathematical theorems Gonthier et al. (2013); Gouëzel (2017); Hales et al. (2017), security protocols Gomes et al. (2017); Pîrlea and Sergey (2018), and industry-scale software systems Klein et al. (2009); Leroy (2009) have been mechanically checked, highlighting numerous issues and bugs. Despite those successes, mechanisation still requires substantial human effort: in the seL4 project Klein et al. (2009), 20 person-years were devoted to verify an OS kernel written in 8700 lines of C. To boost the automation in proof assistants, approaches have been proposed ranging from incorporating automated theorem provers Blanchette et al. (2011, 2016); Czajka and Kaliszyk (2018) to synthesising tactics Bansal et al. (2019); Paliwal et al. (2019); Sanchez-Stern et al. (2019); Yang and Deng (2019) using learning-based methods.

However, existing learning-based automation attempts Bansal et al. (2019); Yang and Deng (2019) are mostly machine-oriented: numerous primitive inferences or hard-coded tactics are explored to justify a human-provided conjecture and produce a machine-checkable but illegible proof. Those proofs are often called tactic proofs. In contrast, we approach automation in a human-oriented way: synthesising high-level intermediate propositions in a proof style called declarative proof style. A comparison of two proof styles is shown in Fig. 1, both proving the irrationality of . One can see that the declarative proof (left) resembles the human prose proof with clear high-level structures, whereas the tactic proof (right) is difficult to comprehend without domain knowledge of the tactics. We argue that a human-oriented approach is an essential step to further automation. When searching for a proof, machine-level exploration can easily get lost due to the astronomical number of primitive steps, while high-level jumpy reasoning largely reduces the effective search space. As will be illustrated in §8, existing tactic-synthesis frameworks are limited by the inability to generate intermediate propositions. The high-level reasoning approach is also complementary to existing tactic based works.

Figure 1: Full declarative proof (left) and tactic proof (right) of the irrationality of in Isabelle/HOL.

In this work, we have mined arguably the largest publicly-hosted repository of mechanised (declarative style) proofs: the Achieve of Formal Proofs (AFP).111 The AFP is checked by the Isabelle proof assistant Paulson (1994) and contains 143K lemmas (increasing by about 10K annually) contributed by 347 authors. Combining the AFP with the Isabelle/HOL library yields a dataset of 204K formally-proved lemmas, which is much larger than (the number of lemmas used in) the previous tactic-oriented benchmarks (71K in CoqGym Yang and Deng (2019), 11K in HolStep Kaliszyk et al. (2017), 29K in HOList Bansal et al. (2019), 1.6K in GamePad Huang et al. (2019)).222Tactic-oriented tasks are usually more primitive so each lemma can be split into more data points. The dataset covers a broad spectrum of subjects, including foundational logic (e.g. Gödel’s incompleteness theorems Paulson (2013)), advanced analysis (the Prime Number Theorem Eberl and Paulson (2018)), computer algebra Thiemann and Yamada (2016), cryptographic frameworks Karayel and Gonzàlez (2020); Lochbihler (2017), and data structures Eberl (2018); Lammich and Nipkow (2019); Nordhoff et al. (2010).

The mined dataset allows us to propose a proposition synthesis task — IsarStep.333The dataset will be publicly available at xxxx. In this task (§3), a model is asked to propose meaningful steps in unseen proofs by reading derivations in existing proofs. For example, given and , we would like to synthesise a meaningful intermediate proposition

. To succeed in this task, the model is required to learn the meaning of important mathematical concepts (e.g. determinant in linear algebra, residue in complex analysis), how they are related to each other through theorems, and how they are utilised in proofs. In addition to the contribution to the theorem proving community on improving proof automation, we believe that our proposed task will also contribute to the machine learning community by providing a benchmark for testing and advancing machine learning algorithms on mathematical reasoning.

We frame the proposed task as a sequence-to-sequence (seq2seq) prediction problem. Beyond evaluating the existing neural seq2seq model baselines: the seq2seq with attention Bahdanau et al. (2015), the transformer Vaswani et al. (2017), we also propose a new architecture: hierarchical transformer444The code will be released at xxxx.5). The architecture is motivated by the way humans reason about propositions; it consists of a set of local transformer layers, modelling the representation of each proposition, and a set of global layers, modelling the correlation across propositions. Experiments (§6) reveal that these neural models can solve 10–20% of problems on the test set on average, in which the hierarchical transformer achieves the best result. Further analysis (§7) on the output of these models shows that while the proposition synthesis task is hard, the neural models can indeed capture mathematical reasoning. We find that the embeddings of closely related mathematical concepts are close in cosine space; models can reason about the relation between set, subset, and member, and more complicated multi-step reasoning that is even difficult for humans to follow.

2 Technical Background

Most proof assistants support tactics (e.g., clarsimp, elim and drule in Fig. 1, right) that transform proof states. A conjecture is considered mechanically proved if a proof state starting with the conjecture itself has been transformed, by a sequence of tactics, into a state with no subgoals left. Therefore, as has been shown, a tactic proof is a sequence of tactics optionally with some arguments (see Appendix A for additional details).

Declarative proofs emulate prose proofs by focusing on intermediate propositions. They differ from the prose ones in that all claims require explicit justifications. These justifications can be found automatically by Sledgehammer Blanchette et al. (2016), which internally invokes automatic theorem provers Barrett et al. (2011); De Moura and Bjørner (2008); Riazanov and Voronkov (2002). When automatic justifications fail, users can open another proof block (e.g., line 8-12 in Fig. 1) and provide extra intermediate propositions.

3 The IsarStep Task

In order to imitate human mathematical reasoning, we focus on the synthesis of intermediate propositions in declarative proofs using machine learning, a different approach from prior works on tactic synthesis Bansal et al. (2019); Paliwal et al. (2019); Sanchez-Stern et al. (2019); Yang and Deng (2019). We mined a dataset of mechanised declarative proofs from the AFP and the Isabelle/HOL Paulson (1994) standard library and defined a proposition generation task: the IsarStep.

Figure 2: Causal environment around even a, where a filled arrow and a dotted arrow, respectively, refer to implications from a (local) intermediate proposition and a (global) named lemma.

Key to high-level mathematical reasoning is to bridge the gap in causal relations using intermediate propositions. For example, in the proof, the proposition even a is conjectured to fill in the gap between 2  b2  a2, a = 2 * c and False (see Fig. 2, which was extracted from highlighted parts in Fig. 1, left). Deriving False also requires two other local propositions, coprime a b and even b, and justifying even a requires three library lemmas (dvd_triv_left, power2_eq_square and even_mult_iff), which were found by Sledgehammer. In this case, even a is the target proposition while other surrounding propositions and lemmas form the context.

In our defined IsarStep task, each example is formed by five parts:

  1. [label=F.0]

  2. a target proposition (e.g. even a),

  3. a set of used local propositions to derive 1 (e.g. 2  b2  a2),

  4. a set of local propositions derived from the target proposition 1 (a = 2 * c and False),

  5. other local propositions and (global) lemmas used to justify 3 (even b and coprime a b),

  6. a set of used (global) lemmas to justify 1 (e.g. dvd_triv_left ).

We want to synthesise 1 given 24 with 5 optional: the named lemmas in 5 are common knowledge and can be used as additional hints. The propositions are generated as a sequence of tokens and therefore the search space is : search over 28K actions (§4.3, vocabulary size for seq2seq models) at every timestep without a predefined maximum output length. IsarStep can be considered as single step reasoning, which can be repeated to sketch more complex proofs. This task provides a vehicle for machine learning models to imitate causal reasoning in mathematics.

4 Dataset Preprocesssing and Statistics

The mined raw dataset has long propositions and a large number of unique tokens. To alleviate the performance deterioration of machine learning models due to the aforementioned problems, we propose tricks to preprocess the raw dataset, including free variable normalisation and removing unnecessary parentheses, which substantially reduce the sequence lengths and vocabulary size.

4.1 The Logic and Tokens

The core logic of Isabelle/HOL is simply-typed -calculus with de Bruijn indices for bound variables (Wenzel, 2020, Chapter 2.2). A local proposition or a (global) lemma/theorem is essentially a term in the calculus. As types can be inferred automatically, we drop types in terms (to reduce the size of the vocabulary) and encode a term as a sequence of tokens that include lambda term constructors: CONST, FREE, VAR, BOUND, ABS (function abstraction), and $ (function application). Additionally, parentheses have been used in the sequence to represent the tree structure. To give an example, we encode the proposition even a as the following sequence of tokens separated by a white space:

ΨCONST HOL.Trueprop $ ( CONST Parity.semiring_parity_class.even $ FREE <X0> )

where CONST HOL.Trueprop is a boilerplate function that converts from type bool to prop; CONST Parity.semiring_parity_class.even is the even predicate; FREE <X0> encodes the Skolem constant a in even a. Since a is a user-introduced local constant that can be arbitrary, we normalised it to the algorithmically generated name <X0> in order to reduce the vocabulary size (see §4.2).

Overall, every local proposition and global lemma/theorem is encoded as a sequence of tokens, and can be mostly decoded to the original term with type inference.

4.2 Free Variable Normalisation

Due to Isabelle’s use of de Bruijn indices, bound variables have already been normalised: x. P x is no different from y. P y, as both x and y are encoded as BOUND 0. However, arbitrary variable names can be introduced by the command fix in declarative proofs or unbounded variables in lemma statements (e.g. False  P and False  Q are semantically equivalent but with different free variables). To reduce the vocabulary size here, we normalised these free variables like the bound ones. For example, False  P would be normalised to False  <V0> as P is the first free variable in the proposition. Such normalisation reduced the vocabulary size by one third. The normalisation preserves the semantics, and we can always parse a normalised term back under a proper context.

4.3 Statistics

We have mined a total of 850K data points for IsarStep. We removed examples in which the length of the concatenation of the source propositions, i.e. 24 in §3, longer than 800 and the length of the target propositions, i.e. 1 in §3, longer than 200, which results in approximately 665K examples. From these examples we randomly sampled 8000 examples for validation and 8000 examples for testing. We removed duplicates and those whose target propositions exist in the training data. The final dataset split is 609K, 4143, 4145 for the training, validation, and test sets, respectively. The vocabulary size is 28,370.

5 Model

We define as the sequence of source propositions with propositions, and as the target proposition containing tokens. Let represent the th proposition in the set, consisting of tokens. Each source proposition belongs to a category 24 defined in §3. We annotate the category corresponding to as and therefore the sequence of categories corresponding to is . The generation of a target proposition is determined by finding the proposition , where is optimal,


We propose two approaches to parameterising the conditional probability

, which differ in the way of modelling the sequence of source propositions. The first method is simply appending a label to each source proposition indicating their category and then concatenating the source propositions using a special token <SEP>, treating the resulting long sequence as the input to a seq2seq model.

Our second approach models the encoding of source propositions hierarchically. As shown in Fig. 3, the encoder has two types of layers. The local layers build the proposition representations by modelling the correlations of tokens within each proposition; the global layers take the proposition representations as input and model the correlations across propositions. Both the local layers and global layers are transformer layers (Vaswani et al., 2017). Positional information is encoded separately for different source propositions. That is, suppose has tokens, then the position of the first token in is not but 1. The embedding of a token is obtained by adding the token embedding, the positional information, and the embedding of the category that the proposition belongs to. The category embedding is learnt together with the rest of the network. We call this model the hierarchical transformer (HAT). Intuitively, HAT models the structure of the source propositions more explicitly compared to the first approach and therefore should be better at capturing mathematical reasoning between source and target propositions. We will validate our hypothesis in §6.

Figure 3: Architecture of the encoder of the hierarchical transformer (HAT). There are two types of layers, the local layers model the correlation between tokens within a proposition, and the global layers model the correlation between propositions. The input to the network is the sum of the token embedding, the positional information, and the embedding of the corresponding category.

6 Experiments

We benchmark three models on IsarStep (§3), namely the seq2seq model with attention (RNNSearch) (Bahdanau et al., 2015; Wu et al., 2016), transformer Vaswani et al. (2017), and hierarchical transformer (HAT). The input to the RNNSearch and the transformer is a concatenation of source propositions (the first parameterisation approach described in §5). We train these models with the same training data and report their performance on test sets.

6.1 Experimental Setup

For RNNSearch555Codebase: (Bahdanau et al., 2015; Wu et al., 2016), we use 2-layer LSTMs Hochreiter and Schmidhuber (1997)

with 512 hidden units and 0.2 dropout rate. The hyperparameters for training the transformer

666Codebase: are the same as transformer base Vaswani et al. (2017), i.e. 512 hidden size, 2048 filter size, 8 attention heads, and 6 layers for both the encoder and decoder. The hyperparameters for HAT are the same, except that the number of local context layers is 4 and global context layers is 2. We share the source and target token embeddings for all the three models. We use beam search decoding with beam size 5 (for top1 accuracies) and 10 (for top10 accuracies). The configurations for different models are the best ones we found based on validation performance. We train these models for 100K steps and pick the checkpoint with the best BLEU on the validation set to evaluate on the test set. Training the transformer and HAT takes 48 hours on 4 Tesla-V100 GPUs.

6.2 Results

We consider an output sequence as correct if it matches the target sequence exactly at surface form. The top-1 accuracy is then defined as the percentage of the best output sequences that are correct in the given dataset. The top-10 accuracy is defined as the percentage of target sequences appearing in the top 10 generated sequences. Table 1

presents the results of different models for the IsarStep task. We report mean accuracies over 3 runs (standard deviations in parentheses). Overall, the neural seq2seq models achieve around 10–20% top-1 accuracies and 22–34% top-10 accuracies, which indicates that this task is non-trivial and yet not too difficult for neural networks. Of the three models, the transformer

Vaswani et al. (2017) outperforms the RNNSearch Bahdanau et al. (2015); Wu et al. (2016) significantly and our HAT performs best. As mentioned in §3, adding 5 is optional and is conjectured for better performance due to exploiting used lemmas explicitly. We experimented both cases and found adding this extra information indeed leads to further improvement.

Model Top-1 Acc. Top-10 Acc.
Base +5 Base +5
RNNSearch 10.50 (0.14) 12.33 (0.25) 22.86 (0.61) 26.64 (0.29)
Transformer 17.30 (0.08) 17.63 (0.32) 28.33 (0.54) 28.63 (0.45)
HAT 20.10 (0.35) 21.10 (0.33) 33.03 (0.47) 34.53 (0.17)
Table 1: Test set accurarcies on the IsarStep task. We report mean accuracies over 3 runs (standard deviations in parentheses).
Figure 4: Accuracy of different target sequence lengths.

To explore how well models perform on examples with various target sequence lengths, we categorise the examples on the IsarStep test set into 5 buckets based on their lengths and calculate the accuracies for different buckets. As shown in Fig. 4, the accuracies of all three models decrease as the target sequence lengths get longer. HAT is superior over the transformer on sequences shorter than 120 but is on par with the transformer for longer ones.

We subsequently investigate the effect of incorporating the category information for source propositions into the models by removing the category embedding for the input to the HAT encoder (Fig. 3), i.e. we are now modelling instead of . We see a dramatic drop in accuracy: 14.6 versus 20.1 obtained by the HAT with category embedding included, indicating the importance of category information. This is in line with human proofs: without causal relations, gap-filling intermediate propositions serve no purpose.

7 Analysis

In this section, we present an in-depth analysis of what has been learnt by the neural network models. To summarise our findings: 1) the seq2seq models can learn the syntax of propositions correctly; 2) the learned token embeddings are comprehensive in that related mathematical concepts are close in cosine space; 3) manual inspection of the generated propositions reveal that models can learn non-trivial mathematical reasoning and even more complicated multi-step reasoning.

Token Embeddings

To investigate whether the seq2seq models have learnt mathematical reasoning, we checked whether the learnt token embeddings were meaningful. We first projected the learnt embeddings for all the tokens in the vocabulary into a three-dimensional space via principal component analysis and chose random tokens and checked their 50 nearest neighbours in cosine distance. We found that the embeddings of related concepts in mathematics were close, indicating that the models have managed to learn the relations between mathematical concepts — the basic step towards reasoning mathematically. For example, in Fig.


, the neighbours of ‘Borel measurable’ are mostly measure theory related including ‘almost everywhere’, ‘integrable’, and ‘null set’, while ‘arrow’ is close to ‘isomorphism’ (EpiMonoIso.category.iso), ‘identity’(Category.partial_magma.ide), and ‘inverse arrow’(EpiMonoIso.category.inv), which are concepts in category theory. Additionally, vector arithmetic also seems to connect related mathematical definitions: for example, the three closest tokens next to ‘bounded’ + ‘closed’ are ‘bounded’,‘closed’, and ‘compact’, where compactness can be alternatively defined as boundedness and closedness (on a Euclidean space).

Attention Visulisations

We next investigate how reasoning has been learnt by visualising attentions from the transformer model (Vaswani et al., 2017). We find that important and related tokens are likely to attend to each other. For example, Fig. 6 illustrates the visulisation of the last layer of the transformer encoder for the source propositions 2: 3: 4: . The target proposition generated from the model is . The interpretation of those source propositions is that combining with (4) we would like to infer the intermediate step so that the goal can be proved. The transformer model gives the correct answer which implicitly applied the lemma


that relates and . On the last self-attention layer of the transformer encoder (Fig. 6), and attend to each other. Interestingly, the above reasoning seems robust. If we swap and in 4 (i.e., the source is now 2: 3: 4: ), the answer becomes . This totally makes sense since (2) no longer applies (despite that and still attend to each other similarly as in Fig. 6) and can only be discharged by proving itself.

Multi-Step Reasoning

By further inspecting the generated propositions, we find that the model can implicitly invoke multiple theorems as humans normally do. While this property can be found in quite a few examples, here we show one of them due to the limited space. We refer the readers to the appendix for more examples. Given the source 2: 3: 4: , , where , and refer to the dimensionality, the span, and the cardinality of a set of vectors, respectively, and the model gives the correct answer . Here, is derived by only if the model has implicitly learned the following theorem , while yields (in conjunction of ) only if the model has implicitly invoked the antisymmetry lemma .

Understandable Mistakes

Surprisingly, some of the incorrect answers given by the transformer model still make sense. For example, given the source 2: 3: 4: , , , where is a special binary relation on finite sets. The target answer is where is a new fixed variable introduced by the existential proposition in 2. The target is a valid answer because it can obviously be derived from 2 and combined with propositions in 4 to derive 3: putting together and yields . The model gives an alternative derivation of 2: , which can still be related to 3 by a property of : . Applying this property to yields , which, in conjunction with , leads to . In this particular example, the model might just have chosen a less straightforward step to bridge the gap between 2 and 3.


Some of the incorrect answers appear related to the difficulty of modelling fixed and unbounded variables (e.g.  and

), whose distributed representation vary from context to context (unlike global constants like ‘Borel measurable’). For example, consider the source

2: 3: 4:, where the function takes a polynomial and a value , and returns the evaluation . To derive the equality between two functions in 3, we naturally invoke functional extensionality and try to prove these two functions are equal on all inputs: , as suggested by the reference answer. Here, the reference answer introduces a new variable that does not appear in the source, which probably causes problems to our model that instead gives a non-sensible answer: .

Figure 5: Nearest neighbours of the tokens ‘Borel measurable’ (left) and ‘arrow’ (right) in cosine space. The 512-dimensional embeddings are projected into 3-dimensional embeddings. Neighbours are found by picking the top 50 tokens whose embeddings are closest to the selected token.
Figure 6: Attention visualisation of the last layer of the transformer encoder for the source propositions 2: 3: 4: . The generated target proposition is .

8 Related Work

Our work and the recent abundant work in tactic synthesis on Coq Sanchez-Stern et al. (2019); Yang and Deng (2019), HOL Light Bansal et al. (2019); Paliwal et al. (2019), and HOL4 Gauthier et al. (2017) are complementary. A key limitation in existing work is that the tactic arguments can only be a previously proved lemma, a small integer (e.g., within ) or a sub-term from the proof state; an arbitrary proposition/term as an argument is beyond the power of current tactic synthesis frameworks. This limitation prohibits current frameworks from synthesising the tactic-based proof in Fig. 1 due to its introduction of arbitrary propositions (via the subgoal_tac tactic). The ability to synthesise intermediate propositions can well complement current tactic synthesis frameworks. In the meantime tactic synthesis can be used to justify proposed intermediate propositions, which complements the Sledgehammer command we used in §2. Finally, our dataset is larger than previous ones in terms of the number mechanised lemmas, but it is not yet an interactive environment like CoqGym Yang and Deng (2019) or HOList Bansal et al. (2019).

A classic machine learning task for theorem proving is premise selection: to select a few lemmas that are likely to be useful for the current proof from a large library of previously proved lemmas. Various methods have been used for this task, ranging from hand-crafted features and distance functions Hoder and Voronkov (2011); Meng and Paulson (2009)

, to classic machine learning methods like Naive Bayes

Alama et al. (2014); Blanchette et al. (2016); Kühlwein et al. (2013)

, and recently to deep learning based approaches

Irving et al. (2016); Loos et al. (2017). Premise selection is a classification problem, whereas our proposition generation task is a generation problem with a countably infinite search space.

Sequence generation tasks have been proposed for mathematical reasoning Lample and Charton (2020); Saxton et al. (2019). Our task is different from the previous ones in the sense that ours is non-synthetic, with realistic vocabulary size (i.e., 28K vs. less than 100) and cover various topics in research-level mathematics and computer science that have no general algorithmic solutions.

Synthesising conjectures Johansson et al. (2011); Larson (2005); Yang et al. (2019)

have been explored before, but mostly not learning-based and limited to specific domains (e.g. in inductive proofs): propositions are enumerated against some grammars or templates, and then checked by automatic theorem provers. Numerous heuristics have been invented to make the enumeration process tractable. Statistical conjecture synthesis based on substitution has been explored in Metamath

Wang and Deng (2020) and earlier in Mizar Gauthier et al. (2016). In comparison, propositions in our work are synthesised purely from input tokens (without referring to semantic operations like substitution).

Seq2seq models have been used by Wang et al. Wang et al. (2020) to map informal LaTeX source to mathematical statements in the Mizar system.

9 Conclusion

We have defined a novel proposition generation task on a large-scale dataset mined from mechanised declarative proofs. We have benchmarked existing seq2seq models on this dataset and their performance shows their impressive reasoning capability despite the difficulty of the task. To further push the limit, we proposed HAT — a novel hierarchical adaptation of the transformer — that outperforms the transformer baseline by a large margin. The dataset is of significance both to the proof assistant community (for promoting automation) and to the machine learning community (for benchmarking the reasoning capability of new models).

10 Broader Impact

In the short term, proof assistants are mainly used to ensure rigorous reasoning and build trust: mathematicians (software engineers) can confidently build their proofs (systems) on top of formally verified ones without worrying about fatal errors. Our research will facilitate the laborious mechanisation process. In the long term, advance in effective reasoning within proof assistants will potentially lead to auto-programming and auto-mathematics. It may be a concern that machines will displace some engineers and mathematicians. However, we believe that such advance will mainly complement humans by allowing them to focus on the creative parts and leave routine reasoning being automated by machine learning algorithms.


  • J. Alama, T. Heskes, D. Kühlwein, E. Tsivtsivadze, and J. Urban (2014) Premise selection for mathematics by corpus analysis and kernel methods.

    Journal of Automated Reasoning

    52 (2), pp. 191–213.
    Cited by: §8.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §1, §6.1, §6.2, §6.
  • K. Bansal, S. M. Loos, M. N. Rabe, C. Szegedy, and S. Wilcox (2019) HOList: An Environment for Machine Learning of Higher Order Logic Theorem Proving. In Proceedings of International Conference on Machine Learning (ICML), Cited by: §1, §1, §1, §3, §8.
  • C. Barrett, C. L. Conway, M. Deters, L. Hadarean, D. Jovanović, T. King, A. Reynolds, and C. Tinelli (2011) CVC4. In Proceedings of International Conference on Computer Aided Verification, Cited by: §2.
  • Y. Bertot and P. Castéran (2013) Interactive theorem proving and program development: Coq’Art: the calculus of inductive constructions. Springer Science & Business Media. Cited by: §1.
  • J. C. Blanchette, S. Böhme, and L. C. Paulson (2011) Extending Sledgehammer with SMT solvers. In Proceedings of International Conference on Automated Deduction, Cited by: §1.
  • J. C. Blanchette, C. Kaliszyk, L. C. Paulson, and J. Urban (2016) Hammering towards QED. Journal of Formalized Reasoning 9 (1), pp. 101–148. Cited by: §1, §2, §8.
  • Ł. Czajka and C. Kaliszyk (2018) Hammer for Coq: Automation for dependent type theory. Journal of automated reasoning 61 (1-4), pp. 423–453. Cited by: §1.
  • L. De Moura and N. Bjørner (2008) Z3: An efficient SMT solver. In Proceedings of International conference on Tools and Algorithms for the Construction and Analysis of Systems, Cited by: §2.
  • M. Eberl and L. C. Paulson (2018) The Prime Number Theorem. Archive of Formal Proofs. Note:, Formal proof development External Links: ISSN 2150-914x Cited by: §1.
  • M. Eberl (2018) Randomised Binary Search Trees. Archive of Formal Proofs. Note:, Formal proof development External Links: ISSN 2150-914x Cited by: §1.
  • T. Gauthier, C. Kaliszyk, and J. Urban (2016) Initial experiments with statistical conjecturing over large formal corpora. In Joint Proceedings of the FM4M, MathUI, and ThEdu Workshops, Doctoral Program, and Work in Progress at the Conference on Intelligent Computer Mathematics 2016 (CICM-WiP 2016), Cited by: §8.
  • T. Gauthier, C. Kaliszyk, and J. Urban (2017) TacticToe: Learning to Reason with HOL4 Tactics. In

    Proceedings of International Conference on Logic for Programming, Artificial Intelligence and Reasoning (LPAR)

    Cited by: §8.
  • V. B. Gomes, M. Kleppmann, D. P. Mulligan, and A. R. Beresford (2017) Verifying strong eventual consistency in distributed systems. Proceedings of the ACM on Programming Languages. Cited by: §1.
  • G. Gonthier, A. Asperti, J. Avigad, Y. Bertot, C. Cohen, F. Garillot, S. Le Roux, A. Mahboubi, R. O’Connor, S. O. Biha, et al. (2013)

    A machine-checked proof of the odd order theorem

    In Proceedings of International Conference on Interactive Theorem Proving, Cited by: §1.
  • G. Gonthier and A. Mahboubi (2010) An introduction to small scale reflection in Coq. Journal of Formalized Reasoning 3 (2), pp. 95–152. Cited by: Appendix A.
  • M. J. Gordon, A. J. Milner, and C. P. Wadsworth (1979) Edinburgh LCF: A Mechanised Logic of Computation. Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg. Cited by: Appendix A.
  • S. Gouëzel (2017) Subadditive cocycles and horofunctions. In Proceedings of the International Congress of Mathematicians (ICM), Cited by: §1.
  • T. Hales, M. Adams, G. Bauer, T. D. Dang, J. Harrison, H. Le Truong, C. Kaliszyk, V. Magron, S. McLaughlin, T. T. Nguyen, et al. (2017) A formal proof of the Kepler conjecture. In Forum of Mathematics, Pi, Vol. 5. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §6.1.
  • K. Hoder and A. Voronkov (2011) Sine Qua Non for Large Theory Reasoning. In Proceedings of International Conference on Automated Deduction, Cited by: §8.
  • D. Huang, P. Dhariwal, D. Song, and I. Sutskever (2019) GamePad: A Learning Environment for Theorem Proving. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §1.
  • G. Irving, C. Szegedy, A. A. Alemi, N. Eén, F. Chollet, and J. Urban (2016) DeepMath - Deep Sequence Models for Premise Selection. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §8.
  • M. Johansson, L. Dixon, and A. Bundy (2011) Conjecture Synthesis for Inductive Theories. Journal of Automated Reasoning 47 (3), pp. 251–289. Cited by: §8.
  • C. Kaliszyk, F. Chollet, and C. Szegedy (2017) HolStep: A Machine Learning Dataset for Higher-order Logic Theorem Proving. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §1.
  • E. Karayel and E. Gonzàlez (2020) Strong Eventual Consistency of the Collaborative Editing Framework WOOT. Archive of Formal Proofs. Note:, Formal proof development External Links: ISSN 2150-914x Cited by: §1.
  • G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, R. Kolanski, M. Norrish, et al. (2009) seL4: Formal verification of an OS kernel. In Proceedings of the ACM SIGOPS symposium on Operating systems principles, Cited by: §1.
  • D. Kühlwein, J. C. Blanchette, C. Kaliszyk, and J. Urban (2013) MaSh: Machine Learning for Sledgehammer. In Proceedings of International Conference on Interactive Theorem Proving, Cited by: §8.
  • P. Lammich and T. Nipkow (2019) Priority Search Trees. Archive of Formal Proofs. Note:, Formal proof development External Links: ISSN 2150-914x Cited by: §1.
  • G. Lample and F. Charton (2020) Deep learning for symbolic mathematics. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §8.
  • C. E. Larson (2005) A survey of research in automated mathematical conjecture-making. DIMACS Series in Discrete Mathematics and Theoretical Computer Science 69, pp. 297. Cited by: §8.
  • X. Leroy (2009) Formal Verification of a Realistic Compiler. Communications of the ACM 52 (7), pp. 107–115. Cited by: §1.
  • A. Lochbihler (2017) CryptHOL. Archive of Formal Proofs. Note:, Formal proof development External Links: ISSN 2150-914x Cited by: §1.
  • S. M. Loos, G. Irving, C. Szegedy, and C. Kaliszyk (2017) Deep Network Guided Proof Search. In International Conference on Logic for Programming, Artificial Intelligence and Reasoning (LPAR), Cited by: §8.
  • J. Meng and L. C. Paulson (2009) Lightweight relevance filtering for machine-generated resolution problems. Journal of Applied Logic 7 (1), pp. 41–57. Cited by: §8.
  • B. Nordhoff, S. Körner, and P. Lammich (2010) Finger Trees. Archive of Formal Proofs. Note:, Formal proof development External Links: ISSN 2150-914x Cited by: §1.
  • A. Paliwal, S. Loos, M. Rabe, K. Bansal, and C. Szegedy (2019) Graph Representations for Higher-Order Logic and Theorem Proving. Vol. abs/1905.10006. External Links: Link Cited by: §1, §3, §8.
  • L. C. Paulson (2013) Gödel’s Incompleteness Theorems. Archive of Formal Proofs. Note:, Formal proof development External Links: ISSN 2150-914x Cited by: §1.
  • L. C. Paulson (1994) Isabelle: A generic theorem prover. Vol. 828, Springer Science & Business Media. Cited by: §1, §1, §3.
  • G. Pîrlea and I. Sergey (2018) Mechanising blockchain consensus. In Proceedings of ACM SIGPLAN International Conference on Certified Programs and Proofs, Cited by: §1.
  • A. Riazanov and A. Voronkov (2002) The design and implementation of VAMPIRE. AI communications 15 (2, 3), pp. 91–110. Cited by: §2.
  • P. Rudnicki (1992) An overview of the Mizar project. In Proceedings of the 1992 Workshop on Types for Proofs and Programs, pp. 311–330. Cited by: Appendix A.
  • A. Sanchez-Stern, Y. Alhessi, L. Saul, and S. Lerner (2019) Generating Correctness Proofs with Neural Networks. Vol. abs/1907.07794. External Links: Link Cited by: §1, §3, §8.
  • D. Saxton, E. Grefenstette, F. Hill, and P. Kohli (2019) Analysing mathematical reasoning abilities of neural models. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §8.
  • R. Thiemann and A. Yamada (2016) Polynomial Factorization. Archive of Formal Proofs. Note:, Formal proof development External Links: ISSN 2150-914x Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §5, §6.1, §6.2, §6, §7.
  • M. Wang and J. Deng (2020) Learning to Prove Theorems by Learning to Generate Theorems. Vol. abs/2002.07019. External Links: Link Cited by: §8.
  • Q. Wang, C. Brown, C. Kaliszyk, and J. Urban (2020) Exploration of neural machine translation in autoformalization of mathematics in mizar. Proceedings of ACM SIGPLAN International Conference on Certified Programs and Proofs. Cited by: §8.
  • M. Wenzel (2020) The Isabelle/Isar Implementation. Note:[Online; accessed 31-May-2020] Cited by: §4.1.
  • M. M. Wenzel (2002) Isabelle/Isar—a versatile environment for human-readable formal proof documents. Ph.D. Thesis, Technische Universität München. Cited by: Appendix A.
  • F. Wiedijk (2001) Mizar light for HOL light. In International Conference on Theorem Proving in Higher Order Logics, pp. 378–393. Cited by: Appendix A.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. External Links: Link Cited by: §6.1, §6.2, §6.
  • K. Yang and J. Deng (2019) Learning to Prove Theorems via Interacting with Proof Assistants. In Proceedings of International Conference on Machine Learning (ICML), Cited by: §1, §1, §1, §3, §8.
  • W. Yang, G. Fedyukovich, and A. Gupta (2019) Lemma Synthesis for Automating Induction over Algebraic Data Types. In Proceedings of International Conference on Principles and Practice of Constraint Programming, Cited by: §8.

Appendix A From Tactic Proofs to Declarative Ones

Figure 7: Proof state transition after applying tactics
Figure 8: A proved lemma in the library of Isabelle/HOL claiming that given a rational number there exists a coprime pair of natural numbers and such that and .
Figure 9: Tactic-style proof of the irrationality of in HOL Light (left) and Coq (right)

In the early days of proof assistants Gordon et al. [1979], people were generally concerned whether a theorem can be mechanically checked, which is essentially a process of transforming the proof state until all subgoals have been discharged. Tactics, due to their direct access to proof states and their programmable nature (i.e., primitive tactics can be easily composed into sophisticated ones), have become the building blocks of most proof assistants.

To illustrate the interaction between tactics and proof states, we briefly examine the tactic proof of the irrationality of in Isabelle/HOL. Given the statement theorem "sqrt 2  Q", the initial proof state is the conjecture itself – sqrt 2  Q. We then consider to deploy a proof-by-contradiction strategy and applied the clarsimp tactic (see Fig. 7), which transformed the state to sqrt 2  Q  False (i.e., assuming is rational, we want to derive false). To utilise the assumption, we find a relevant lemma called Rats_abs_nat_div_natE in the library (see Fig. 8), and then update the state by applying the elim tactic argumented with this found lemma. The resulted state requires us to derive false with fixed coprime natural numbers m and n such that and


Subsequently, we square both sides of (3) by invoking another tactic drule with the argument arg_cong[where f="x. x * x"], and obtain a new state. We repeat this process until we reach a state with no subgoal left, and the final mechanised proof is a sequence of tactics possibly with arguments as displayed in §2.

The proof of the irrationality of is also available in other tactic-based proof assistants including HOL Light777Example available from and Coq888Example available from (see Fig. 9). The proof scripts are similar to the tactic-based ones in Isabelle/HOL except that tactic combinators (which combines several tactics to produce a new one) are used more heavily.

In general, a tactic proof can be viewed as a sequence of commands issued to the proof assistant instructing how the proof state should be transformed. In this case, people are mainly interested in the theorem being checked leaving the proof itself like incomprehensible machine code.

Pioneered by the Mizar system Rudnicki [1992], it has been realised that a human-oriented declarative proof can become an object of interest itself. The declarative feature (i.e. focusing on intermediate propositions) was then incorporated into other systems including HOL Light Wiedijk [2001], Isabelle Wenzel [2002], and Coq Gonthier and Mahboubi [2010]. Nowadays mechanised proofs are developed in a mixed manner: tactic-style parts are for finely controlling the proof state (like an assembly language), and declarative parts are for sketching high-level reasoning (like the Python language). Nevertheless, due to technical (e.g. automation) and cultural reasons, proof style varies from one system to another. Proofs in Isabelle and Mizar are arguably much more declarative than that in other systems — this is in contrast to HOL Light, where most of its proofs are of tactic style despite having a declarative mode.

Appendix B Examples of correct synthesises

In this section, we present some correctly synthesised propositions which will be labelled as 1 in each case.


Here, and are measure spaces. For a measure space , , , and are the three components of (i.e., ), where is the carrier set, is a collection of subsets on , and is a measure defined on . Eq. (5) (4): being a subalgebra of (i.e., (5)) implies , so that in (i.e., is countably addictive on which is implied by being a measure space) can be substituted with which yields (4). Eq. (4) (6): deriving (6) requires unfolding the definition of measure spaces (i.e., (9)), which requires is a sigma algebra on , the measure is non-negative on , and is countably additive on . Two of the three requirements have already been satisfied by (7) and (8) respectively, while (4) entails the last one and eventually leads to (6).


Eq. (11) (10): a Cauchy sequence is naturally a bounded sequence (i.e., ). Eq. (10) (12 - 13) by unfolding the definition of bounded sequences (i.e., (14)).


Here, is the image of the path function on the interval ; and are, respectively, the roots of a polynomial and the roots (of ) within a set ; and are (bounded) boxes on an Euclidean space. Eq. (16 - 17) (15): is a root of (by (16)) that does not intersect with the path of (i.e., (17)). Eq. (15) (18): combining with (19), (18) is equivalent to , which follows from joining (20) with (15).


Eq. (22 - 23) (21): (23) implies

hence (21) by the definition of Max. Eq. (21) (24) by arithmetic and the positivity of the denominator (i.e., (25)).


Both (27 - 28) (26) and (26) (29) can be derived from limit arithmetic.


Eq. (31) (30) by arithmetic. Eq. (30) (32):


Here, refers to the th element in the list ; is the length function on a list; is a list where the element is concatenated to the front of the list . Eq. (36 - 38) (35) by instantiating the quantified variable in (36) to and combining with (37 - 38). Eq. (35) (39):

Appendix C Distribution of Sequence Lengths

Fig. 10 shows the distribution of the source and target sequence lengths. The source sequence is a concatenation of the source propositions.

Figure 10: Distribution of source and target sequence lengths for the IsarStep task.