1 Introduction
Language modeling using Transformers [Vaswani et al., 2017]
has been hugely successful for applications like translation and text generation. Models like GPT2 are able to generate impressive news articles and stories given just an abstract
[Radford et al., 2018]. These models are usually first trained on a proxy task, such as predicting missing words in the case of BERT [Devlin et al., 2019], before fine tuning the models on more specific (downstream) tasks such as machine translation and questionanswering. The proxy tasks are not reliant on labeled data, and thus can be trained on large corpora of unlabeled data. Even the models trained on the proxy tasks alone, have shown impressive language understanding [Brown et al., 2020].Prior work in deep learning for mathematics has focused on learning directly on logical reasoning tasks. In this work, we apply the paradigms of language modeling to formal mathematics and define proxy tasks on unlabeled mathematical expressions that allows us to use much more data. We start with the HOList dataset
[Bansal et al., 2019], which spans a wide range of mathematical topics, including topology, multivariate calculus, real and complex analysis, geometric algebra, and measure theory, formalized in the HOL Light proof assistant [Harrison, 1996]. We consider a standard skipsequence task and a novel skiptree task. Our skiptree task is an instance of the skipsequence task that respects the tree structure of expressions. We show that models trained on the skiptree task results in models that significantly outperform those trained skipsequence task when evaluated on various downstream tasks.Reasoning can refer to a wide range of abilities, and thus we measure the mathematical reasoning abilities of language models on a variety of tasks, including mechanical derivations, such as type inference, and also creative tasks, such as predicting under which assumptions a statement is true. In contrast to works in language modeling, we do not finetune the models to the evaluation (downstream) tasks, as we want to study what reasoning capabilities can be acquired just through language modeling proxy tasks.
An advantage of formal language compared to natural language is that we can attempt to automatically evaluate statements. That is, even if the language models fail to predict the ground truth, the statements they predicted might still be true and useful. We evaluate these conjectures by attempting to prove them and checking if they are can be used in the context of other proofs.
Our contributions are twofold:

We introduce several evaluation tasks that test logical reasoning abilities.

We introduce a new skiptree language modeling task that outperforms skipsequence approaches in our evaluation on the logical reasoning tasks.
The remainder of this paper is structured as follows: First, we review related work on language modeling and deep learning for mathematics in Section 2. Then, in Section 3 we discuss the source corpus of formal mathematical statements from which we generate our training data. In Section 4, we present our novel language modeling task for formal languages, as well as several variations that we used in our ablation studies. We present the evaluation tasks in Section 5, present our experimental findings in Section 6, and conclude in Section 7.
2 Related work
Recently, we have seen a series of rapid improvements in language modeling. Many of the improvements result from better pretraining tasks [Devlin et al., 2019, Zhang et al., 2019, Song et al., 2019, Dong et al., 2019, Raffel et al., 2019, Conneau and Lample, 2019]. BERT [Devlin et al., 2019] is a pretraining task for Transformers [Vaswani et al., 2017], which masks out a certain fraction of the input tokens that the model has to predict. UniLM uses multiple pretraining tasks at once [Dong et al., 2019]. One of them is a sequencetosequence task, which is to predict the next sentence from the previous sentence. MASS considers a generalized sequencetosequence pretraining task [Song et al., 2019], which is to predict a masked out subsequence of the input. SpanBERT additionally considers a span boundary objective, which is to predict the masked out subsequence only from the tokens adjacent to the missing subsequence [Joshi et al., 2020]. However, both MASS and SpanBERT reveal the length of the sequence to predict as they replace it by a number of mask tokens equal to the length of the sequence.
T5 introduced a generalization of sequencetosequence pretraining tasks that is crucial to our work [Raffel et al., 2019]. They replace the subsequence to be predicted by a single token (not a number of mask tokens equal to the length of the subsequence, as in MASS). Further, T5 allows multiple subsequences to be masked out and predicted. [Zhang et al., 2019] additionally exploit the sentence structure of natural language. They suggest the pretraining task Pegasus, which masks out entire sentences of a given text, and additionally masks out randomly selected tokens in the remaining text (or alternatively replace them by other tokens). In a similar way Pegasus’ exploitation of the sentence structure of natural language, our skiptree task exploits the tree structure of formal expressions. [Zhang et al., 2019] also suggest sampling the sentences to be masked with the help of ROUGE1F1 [Lin, 2004].
We work with the HOList dataset by Bansal et al. [2019]. There are other datasets which might be suitable for our approach as well, including proofs extracted from HOL4 [Gauthier et al., 2017], and from Coq [Huang et al., 2019, Yang and Deng, 2019, SanchezStern et al., 2019].
Lample and Charton [2020] use a Transformer model for symbolic integration. They train their model directly on the reasoning task they want to learn, and their approach requires that the inverse of the prediction task can be computed effectively with classical algorithms. In contrast, we train language models on a proxy task and evaluate them on several logical reasoning tasks that are substantially different from the training task. Also, our dataset spans a much wider range of mathematical theories. We imagine that the language modeling approach explored in this paper, could be used as a pretraining task for symbolic integration.
Finkbeiner et al. [2020] explore the generalization properties of Transformers predicting the solutions to formulas in lineartime temporal logic. Unlike the language modeling approach followed in our work, their training regime requires a data generator that can solve formulas, which is currently not feasible for higherorder logic. Transformer models for program understanding have focused on providing inductive biases in the architecture [Shiv and Quirk, 2019, Hellendoorn et al., 2020], whereas this work suggests to use a modified language modeling proxy task. Perhaps, these approaches could be combined to improve performance on treestructured data.
3 Dataset
We start from the HOList dataset introduced by Bansal et al. [2019]. The complete dataset includes 29465 theorems and their proofs. We here consider only the “core” and “complex” datasets which comprise 18943 theorems, 637 definitions and 603,950 proof steps. These proof steps were extracted from the human proof logs. The theorems and proofs were written (by humans) using the HOL Light proof assistant, and span various areas of mathematics such as set theory, arithmetic, linear algebra, topology, and multivariate complex analysis. The proofs contain a lot of intermediate goals which are the result of applying “tactics” on previous proof goals. For example, one of the tactics is to rewrite the current proof goal with a set of equations selected from the theorem database.
From this dataset we extract all theorem statements as well as all intermediate proof goals. We use Sexpressions to represent all statements. For example, (v bool x) represents a boolean variable named x, and (a (v (fun (bool) (bool)) f) (v bool x) represents the function application where is a function from bool to bool. The Sexpression syntax is thus very verbose, which can cause some expressions to not fit into the size constraints of our Transformer model.
We use the same split into training/validation/testing data as defined in HOList. The split is defined on the theorems, and the entire proof of each theorem is assigned to the same split as the theorem. This means that we have used the proof of 11,655 theorems in the training split of the core and complex libraries. This avoids partially revealing the proofs of theorems in the validation and test sets during training. We derive all training data from the theorems and proofs in the training set, and use only the theorems (not the proofs) for the evaluation tasks. This addresses the possibility that some proof steps for training theorems and for validation theorems might be shared. In Figure 1 we depict our choice of training and evaluation data.
4 Skiptree Training
In this section we define the skiptree training task. We parse a given mathematical statement into a tree of subexpressions, and replace one of the subexpressions by a <PREDICT> token. The task is to predict the subexpression replaced by <PREDICT>. See Figure 2 for an example.
For training, the trees are converted back to a sequence of tokens; the target sequence is extended by a <START> token in the front and an <END> token in the back. We exclude training examples where the output sequence is longer than the length of the decoder (512 tokens), and we cut off input sequences when they exceed the length of the encoder (1024 tokens).
Additional masked subexpressions.
In addition to the subexpression to be masked out by <PREDICT>, we select subexpressions to be masked out by a different mask token <MASK>. In contrast to the <PREDICT> token, we replace all occurrences of these subexpressions by the <MASK> token. Note that it can happen that the subexpressions we want to replace by the <MASK> tokens overlap with each other or with the subexpression replaced by the <PREDICT> token. In this case, we give the highest preference to the <PREDICT> token, and then in decreasing order of size for the expression to be replaced by the <MASK> tokens.
The subexpressions masked by <MASK> do not have to be predicted. They are only hidden to make the task harder and to make the model tolerant to having partial information. A beneficial side effect of replacing some expressions by a <MASK> token is that the input sequences get substantially shorter and more mathematical expressions fit in the size constraints of the Transformer architecture.
Distributions of subexpressions.
Sampling subexpressions uniformly at random results in very short sequences to be predicted: since our trees are mostly ternary, two thirds of the subexpressions are leaves. Besides picking subexpressions uniformly at random, we thus experiment with weighting the subexpressions by the number of tokens they contain. We refer to these variants as “uniform” and “weighted”. This results in a much more diverse set of expressions to be sampled.
Multiple samples per statement.
Since we started with a data source that is small compared datasets in natural language modeling, we use each mathematical statement from the training set to generate training examples. Our initial data consists of about 360K intermediate statements from the proofs of 10K statements in the training split of the core and complex library of the HOList corpus. To avoid duplicates, we sample the subexpressions that are replaced by a <PREDICT> token for each original formula without replacement.
4.1 Ablations
To verify the design choices of the skiptree training task we generated multiple variants of the training task and trained a model on each of them.
No mask tokens.
To answer the question of whether it helps to mask out subexpressions besides the one to predict, we generated a dataset with , called “skiptree (no <MASK>)”.
Fewer samples per statement.
Instead of sampling many training examples from each formula, we could train on a fewer training examples for more epochs. We generated a smaller version with
of the skiptree training data, which we call “skiptree (small)”.Skipsequence.
MASS [Song et al., 2019], SpanBERT [Joshi et al., 2020], and T5 [Raffel et al., 2019] pretrain their sequencetosequence natural language models by predicting subsequences of the tokens. The skiptree task is similar, but exploits our ability to parse the formulas as trees. To examine if this makes a difference, we consider a “skipsequence” task that samples subsequences of the list of tokens instead of sampling subexpressions. We generated three datasets for the skipsequence task, where we sample subsequences of different lengths (short/medium/long). For the task “skipsequence (long)”, we pick two positions in the token sequence at uniformly at random and select the sequence that is between them. For the tasks “skipsequence (medium)” and “skipsequence (short)”, we limit their distance to 100 and 50 tokens, respectively.
Dataset  # examples  # tokens (input/output)  avg length (input/output) 

Skiptree (weighted)  25.8M  17.4B/1.6B  675/61 
Skiptree (uniform)  25.7M  18.8B/316M  732/12 
Skiptree (small)  5.2M  3.5B/521M  673/100 
Skiptree (no <MASK>)  25.8M  19.4B/1.6B  750/61 
Skipsequence (long)  19.2M  11.9B/2.8B  620/146 
Skipsequence (medium)  26.0M  19.4B/884M  744/34 
Skipsequence (short)  26.0M  19.6B/479M  752/18 
of the data sets. Number of tokens in the training set measured before padding.
5 Evaluation Tasks
In this section we suggest several logical reasoning tasks on which our language models can be evaluated. These tasks require different levels of logical reasoning, ranging from mostly mechanical application of typing rules to conjecturing under which assumptions a statement might hold.
We intentionally define them to be outofdistribution compared to the training data. Not only do we generate the examples in a slightly different way, we also generate them from the validation set of the theorem database. That is, the model has never seen the source data, nor has it seen the proofs of these theorems. This makes the tasks more challenging, and also ensures that we force the models to go beyond memorization. To give the interested reader a better impression of the evaluation tasks, we provide a list of randomly selected examples in Appendix D.
Type Inference.
We generate type inference problems similar to how we generated the skiptree training data, which we described in Section 4. However, we restrict the sampling of subexpressions to subtrees that represent types of variables or constants (i.e. not fragments of other types).
We generated two variants of the type inference task: In the task we call “Type Inference,” we replace only the selected type by the <PREDICT> token and do not mask out anything else. In the second variant we name “Hard Type Inference,” we additionally replace all other types by the <MASK> token. The two tasks loosely correspond to the deriving the first and the last type during type inference.
For example, consider , which in the sexpression syntax is represented as follows:
(a (a (c (fun (A) (fun (A) (bool))) =) (v A x)) (v A x)) 
Each subexpression here is either a leaf or a triple. The first element of these triples indicates their kind: a indicates function applications, c indicates constants (i.e. symbols that have been defined in the formal system), v indicates a variable, and finally fun indicates a function type. The equality operator “=” is represented by (c (fun (A) (fun (A) (bool))) =), which indicates that it is a constant that has a function type taking two arguments of arbitrary type A and returns a bool. Since functions are typically curried in this representation, we have two function applications, both times with the variable x as the argument.
An example for the “Type Inference” evaluation task would be:
(a (a (c <PREDICT> =) (v A x)) (v A x)) 
The type of the equality operator is still uniquely defined, as we know what the equality is applied to (two arguments of type A) and because toplevel application always has to return a boolean value. In this example the type could have been computed by a classical type inference algorithm.
For the “Hard Type Inference” evaluation task, the input would look as follows:
(a (a (c <PREDICT> =) (v <MASK> x)) (v <MASK> x)) 
Now, the type inference task is highly ambiguous. In fact, in this case, variable x could have any type, and the equality operator would have to adapt to the type of its arguments accordingly. Further, note that the hard type inference task masks out many more subtrees compared to the training data.
Assumptions.
This evaluation task is to predict missing assumptions for theorems in the validation set. We extract these tasks by searching for “toplevel implications” and replacing their left operand by the <PREDICT> token. We define an implication operator “” in an expression to be a toplevel implication if it is either the topmost operator of the expression, or occurs only under quantifiers, conjunctions, disjunctions, or on the right side of other toplevel implications. This definition helps us to avoid picking assumptions in negated parts of formulas.
Note that we can have multiple toplevel implications per validation theorem. Consider the abstracted example . In this case, , , and are all considered to be assumptions of toplevel implications.
An example from the theorem database is , for which the task is to predict given <PREDICT> . (We omit the presentation of this example as an sexpression for the sake of readability.) At first, the expression to predict in this case may seem unique, but there are actually many ways to complete the task into a true statement; e.g. or . Still, most humans would likely guess as it is simple and general, and because occurs before in the alphabet. To make a correct prediction, our language models thus have to understand which statements are more general and also know about naming conventions.
Below we give some examples of this reasoning task that we selected for their simplicity. (For a representative selection, see Appendix D.) While it is often easy to “see” that a given solution to such a task is correct, it can be nontrivial to come up with a solution in the first place. We encourage the reader to make their own predictions before looking up the ground truth in Appendix C:
Equalities.
Similar to the task of predicting missing assumptions, we ask to predict one side of a toplevel equality in this task. Again, we define toplevel equalities to be any equality that occurs as the toplevel operator of the formula or occurs inside quantifiers, conjunctions, disjunctions, or on the right side of implications. For example, from the theorem we extract two evaluation examples: and .
Again, we present some simple example tasks (in mathematical notation for the sake of readability) and provide the ground truth as well as the model predictions in Appendix C:
6 Results and Discussion
We trained a Transformer with the hyperparameters specified in the appendix on the skiptree dataset and each of the ablations for 1M steps with a batch size of 256.
In language modeling for natural language one of the key metrics is how often the next token in the ground truth is correctly predicted. This is not an ideal measurement for formal mathematics as even a single incorrect token can invalidate the entire statement. Also, the sexpression representation is relatively lengthy and barely humanreadable, so a tokenlevel measurement does not allow us to compare our models to the natural language models in any case. Therefore, we focus on exact matches of the entire predicted statement.
Dataset  Type Inference  Hard Type Inference  Assumptions  Equalities 

Skiptree (uniform)  96.21%  94.40%  40.85%  46.57% 
Skiptree (weighted)  96.23%  93.32%  40.86%  42.89% 
Skiptree (small)  95.89%  90.42%  39.23%  40.91% 
Skiptree (no <MASK>)  96.07%  32.50%  38.38%  41.60% 
Skipsequence (long)  9.44%  0.06%  0.53%  0.56% 
Skipsequence (medium)  48.94%  5.97%  3.32%  3.55% 
Skipsequence (short)  77.25%%  3.21%  0.68%  2.06% 
In Table 2 we present how well the Transformer model, trained on different datasets, can predict the ground truth sequences. We can observe that for type inference, i.e. the more mechanical reasoning tasks, the models achieve a pretty high accuracy  even in the Hard Type Inference case where the expression was stripped of all types. We see that the skiptree task and its ablations clearly dominate the skipsequence language modeling task.
A closer inspection of the skipsequence model shows that its singletoken accuracy is almost as high as the singletoken accuracy of the skiptree model, and higher than the singletoken accuracy of the skiptree (uniform) model (all measured on the validation data of the different training tasks). However, its predictions rarely parse or typecheck. On manual inspection of the predictions, it seems that the skipsequence models consistently add surplus tokens at the end, or stop expressions too early; they appear to be unable to correctly identify the end of the expression to predict. The problem may be amplified by the sexpression syntax, which requires counting parentheses to some extent.
6.1 Conjecturing
In the experiments above, we measured how often the models predicted the ground truth in the evaluation tasks. We now change our point of view, and examine whether the models can be used to generate new conjectures. We define conjectures as mathematical statements that differ from the ground truth and any expression the model has seen during training. Additionally, a meaningful conjecture should be syntactically correct, typecheck, be provable, and be useful in the context of other proofs.
Since the training data is derived exclusively from true statements (i.e. human proof steps), the language models are incentivized to complete partial statements in a way that makes them true. Presented with one of the evaluation tasks, to predict missing assumptions or to predict the missing side of an equation, the models may thus complete these statements in multiple ways that make them true. The predictions that do not match the ground truth may still be true and useful statements. In the following we describe experiments that help us estimate how often this is the case.
Freeform conjecturing.
In addition to the “assumptions” and the “equalities” evaluation tasks, we consider a third task for producing conjectures. In this task, which we call “freeform conjecturing”, we query the model with a single prompt: (<theorem> <PREDICT>). This helps us to analyze what the language models produce when given no context. The <theorem> tag indicates only that the statement should be a theorem, and not an intermediate proof step, which would start with the <goal> tag. For freeform conjecturing we want to produce a variety of different predictions, and thus use a beam search with high beam width of 1024. We did not include the freeform conjecturing task in Table 2, as there is no ground truth to match against.
How often are predictions true and new?
For this measurement, we replace the <PREDICT> token with the predicted sequence and attempt to prove the resulting statement in the DeepHOL theorem prover [Bansal et al., 2019]. Note that this can only give us a lower bound to the number of true statements, because of the limitations of the prover: The version of the DeepHOL theorem prover used here can prove around 58% of the validation theorems. So we expect the estimates here to be considerably below the number of actually true statements.
In Table 3 we report two numbers for each evaluation task: The first number is the percentage of generated statements known to be provable, including exact matches, statements from the training set, and statements provable with DeepHOL. The second number is the percentage of generated statements that are provable and new  excluding exact matches with the ground truth and statements from the training set. The denominator for both numbers is the same: the set of all predictions from the beam searches in Table 2.
We believe that these measurements show a significant bias towards true statements. While in some tasks, less than half of the statements were provable, there are simply many more ways to write a false statement than a true statement.
Dataset  Assumptions  Equalities  Freeform Conjecturing 

Skiptree (weighted)  32.41%/26.91%  17.96%/11.63%  97.75%/0.59% 
Skiptree (uniform)  32.19%/26.20%  19.61%/12.28%  82.32%/12.70% 
Are the conjectures useful?
For some evaluation tasks, the models could “cheat” on the truth metric by making the statements trivially true. For example, the models can predict False as an assumption, or complete the missing part of an equation by making it an identity (e.g. complete x = <PREDICT> by predicting x). In fact, manual inspection revealed several such cases.
To make this measurable, we added the provable statements to the theorem database, and ran the reinforcement learning experiments of the DeepHOL theorem prover
[Bansal et al., 2019]to measure how many of the statements were used as premises. In this experiment we also make sure that the new theorems cannot be used in the proofs of their premises. In a “pruning” step DeepHOL minimizes proofs by removing each individual premise in a proof and checking if the proof still holds. Only the premises that survive this step are classified as
useful. While this measurement is a relatively low bar, it filters out statements that have no effect in any proof.We ran three reinforcement learning experiments, one for each of the evaluation tasks. We then measured how many of the theorems generated by each task are used as a premise in one of the over 200,000 proofs found for each of the experiments. For the assumptions task, 3445 of the 3857 theorems were used at least once. For the equalities task and the freeform conjectures it was 979 out of 3440 and 49 out of 130, respectively. We provide usage histograms in Appendix B.
While some of the most frequently used conjectures turned out to be alphaequivalent variations of existing theorems in the theorem database, we found some interesting examples among the most used conjectures:

Assumptions task, 1728 usages:
. Humans have used this theorem over vector arithmetic in many proofs. However, this theorem has always been defined as a
local lemma and thus did not made it into the theorem database. This conjecture apparently filled a gap in the theorem database. 
Freeform conjecturing task, 15 usages: . In contrast to the previous example, there are no occurrences of this statement (or an equivalent statement) in the theorem database or any human proof, not even as a local lemma.
These results suggest that language models show a limited ability to produce new, useful conjectures, even without fine tuning or specialized training. However, overall, the new conjectures appear to be mostly variations of existing theorems. This falls in line with current expectations of Transformer models. To effectively produce conjectures that are useful in a specific context, we expect that a more targeted training approach is needed.
7 Conclusion
In this work, we applied the paradigms of language modeling to formal mathematics. We introduced a novel selfsupervised training task for formal mathematics that clearly outperforms language modeling tasks used for natural language. We suggested several evaluation tasks and metrics for measuring mathematical reasoning capabilities of language models for formal mathematics. Our experiments demonstrate that language models are already surprisingly capable at a variety of reasoning tasks that they were not trained for directly. We also explored the ability of language models to produce new conjectures by measuring how many of the new predictions are provable and useful for other proofs.
References
 Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 5998–6008, 2017. URL http://papers.nips.cc/paper/7181attentionisallyouneed.
 Radford et al. [2018] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. In OpenAI Blog, 2018. URL https://d4mucfpksywv.cloudfront.net/betterlanguagemodels/language_models_are_unsupervised_multitask_learners.pdf.
 Devlin et al. [2019] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
 Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel HerbertVoss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are fewshot learners, 2020. URL https://arxiv.org/abs/2005.14165.

Bansal et al. [2019]
Kshitij Bansal, Sarah M Loos, Markus N Rabe, Christian Szegedy, and Stewart
Wilcox.
HOList: An environment for machine learning of higherorder theorem proving.
In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 454–463. PMLR, 2019. URL http://proceedings.mlr.press/v97/bansal19a/bansal19a.pdf.  Harrison [1996] John Harrison. HOL Light: A tutorial introduction. In Mandayam K. Srivas and Albert John Camilleri, editors, Formal Methods in ComputerAided Design, First International Conference, FMCAD ’96, Palo Alto, California, USA, November 68, 1996, Proceedings, volume 1166 of Lecture Notes in Computer Science, pages 265–269. Springer, 1996. URL https://doi.org/10.1007/BFb0031795.
 Zhang et al. [2019] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. PEGASUS: pretraining with extracted gapsentences for abstractive summarization. CoRR, abs/1912.08777, 2019. URL http://arxiv.org/abs/1912.08777.
 Song et al. [2019] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. MASS: masked sequence to sequence pretraining for language generation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 5926–5936. PMLR, 2019. URL http://proceedings.mlr.press/v97/song19d.html.
 Dong et al. [2019] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and HsiaoWuen Hon. Unified language model pretraining for natural language understanding and generation. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’AlchéBuc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 814 December 2019, Vancouver, BC, Canada, pages 13042–13054, 2019. URL http://papers.nips.cc/paper/9464unifiedlanguagemodelpretrainingfornaturallanguageunderstandingandgeneration.
 Raffel et al. [2019] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified texttotext transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
 Conneau and Lample [2019] Alexis Conneau and Guillaume Lample. Crosslingual language model pretraining. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’AlchéBuc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 814 December 2019, Vancouver, BC, Canada, pages 7057–7067, 2019. URL http://papers.nips.cc/paper/8928crosslinguallanguagemodelpretraining.
 Joshi et al. [2020] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pretraining by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77, 2020. doi: 10.1162/tacl_a_00300. URL https://doi.org/10.1162/tacl_a_00300.
 Lin [2004] ChinYew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W041013.

Gauthier et al. [2017]
Thibault Gauthier, Cezary Kaliszyk, and Josef Urban.
TacticToe: Learning to reason with HOL4 tactics.
In Thomas Eiter and David Sands, editors,
LPAR21, 21st International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Maun, Botswana, May 712, 2017
, volume 46 of EPiC Series in Computing, pages 125–143. EasyChair, 2017. URL https://easychair.org/publications/volume/LPAR21.  Huang et al. [2019] Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. GamePad: A learning environment for theorem proving. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=r1xwKoR9Y7.
 Yang and Deng [2019] Kaiyu Yang and Jia Deng. Learning to prove theorems via interacting with proof assistants. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6984–6994. PMLR, 2019. URL http://proceedings.mlr.press/v97/yang19a/yang19a.pdf.
 SanchezStern et al. [2019] Alex SanchezStern, Yousef Alhessi, Lawrence Saul, and Sorin Lerner. Generating correctness proofs with neural networks. CoRR, abs/1907.07794, 2019. URL http://arxiv.org/abs/1907.07794.
 Lample and Charton [2020] Guillaume Lample and François Charton. Deep learning for symbolic mathematics. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=Ske31kBtPr.
 Finkbeiner et al. [2020] Bernd Finkbeiner, Christopher Hahn, Markus N. Rabe, and Frederik Schmitt. Teaching temporal logics to neural networks. CoRR, abs/2003.04218, 2020. URL https://arxiv.org/abs/2003.04218.
 Shiv and Quirk [2019] Vighnesh Leonardo Shiv and Chris Quirk. Novel positional encodings to enable treebased transformers. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’AlchéBuc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 814 December 2019, Vancouver, BC, Canada, pages 12058–12068, 2019. URL http://papers.nips.cc/paper/9376novelpositionalencodingstoenabletreebasedtransformers.
 Hellendoorn et al. [2020] Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. Global relational models of source code. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=B1lnbRNtwr.
Appendix A Hyperparameters
We trained the Transformers with these hyperparameters:

vocabulary size: 1200

embedding size: 128

attention dropout: 0.1

nonlinearity: gelu

hidden layer dropout: 0.1

hidden layer size: 512

initializer range: 0.02

intermediate size: 768

number of attention heads: 8

number of hidden layers in encoder: 2

number of hidden layers in decoder: 4
Appendix B Usage Statistics of Conjectures
Appendix C A Close Look at Simple Example Tasks
Assumptions.
In Section 5 we presented the following three examples of the task to predict missing assumptions. For the sake of readability we here discuss only the pretty printed versions. For examples in sexpression syntax, please visit Appendix D.
The ground truth answers are as follows:



, note that would be a more general assumption.
For the first and the third task, the language model “skiptree (weighted)” makes a correct prediction in the top 3 candidates in a beam search of width 8. For the seconds task, the language model mostly produces incorrectly typed expressions: it appears to think that is a set of the same type as .
Equalities.
We presented these examples for the equality evaluation task:
The ground truth for the tasks is:
Examples two and three are predicted correctly in a beam search with beam width 8. For the first example, the model almost gets it correct in two of the 8 attempts: , and . We find it surprising that the model apparently understands that there are two cases to consider, but that the exact combination of constants (1 and 0) is a challenge.
Appendix D Randomly Selected Example Tasks
In the following, we provide a list of 5 examples for each of the evaluation tasks, sampled uniformly at random.
Type Inference.

(<theorem> (a (c <PREDICT> !) (l (v (fun (cart (real) ?1) (bool)) t) (a (c (fun (fun (fun (cart (real) ?1) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) ?1) (bool)) u) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) ?1) (bool)) (bool)) !) (l (v (cart (real) ?1) b) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) ?1) (bool)) (bool)) ?) (l (v (cart (real) ?1) w) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (v (cart (real) ?1) w)) (v (fun (cart (real) ?1) (bool)) t))) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (v (cart (real) ?1) w)) (a (c (fun (prod (cart (real) ?1) (real)) (fun (cart (real) ?1) (bool))) ball) (a (a (c (fun (cart (real) ?1) (fun (real) (prod (cart (real) ?1) (real)))) ,) (v (cart (real) ?1) b)) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT1) (c (num) _0))))))))))) (a (c (fun (fun (cart (real) ?1) (bool)) (bool)) ?) (l (v (cart (real) ?1) w) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (v (cart (real) ?1) w)) (v (fun (cart (real) ?1) (bool)) u))) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (v (cart (real) ?1) w)) (a (c (fun (prod (cart (real) ?1) (real)) (fun (cart (real) ?1) (bool))) ball) (a (a (c (fun (cart (real) ?1) (fun (real) (prod (cart (real) ?1) (real)))) ,) (v (cart (real) ?1) b)) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT1) (c (num) _0)))))))))))))) (a (c (fun (fun ?0 (bool)) (bool)) !) (l (v ?0 x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun ?0 (fun (fun ?0 (bool)) (bool))) IN) (v ?0 x)) (v (fun ?0 (bool)) d))) (a (c (fun (bool) (bool)) ) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (a (v (fun ?0 (cart (real) ?1)) g) (v ?0 x))) (a (a (c (fun (fun (cart (real) ?1) (bool)) (fun (fun (cart (real) ?1) (bool)) (fun (cart (real) ?1) (bool)))) UNION) (v (fun (cart (real) ?1) (bool)) t)) (v (fun (cart (real) ?1) (bool)) u))))))))) (a (c (fun (bool) (bool)) ) (a (c (fun (fun (cart (real) ?1) (bool)) (bool)) ?) (l (v (cart (real) ?1) b) (a (a (c (fun (fun (cart (real) ?1) (bool)) (fun (fun (cart (real) ?1) (bool)) (bool))) SUBSET) (a (c (fun (prod (cart (real) ?1) (real)) (fun (cart (real) ?1) (bool))) ball) (a (a (c (fun (cart (real) ?1) (fun (real) (prod (cart (real) ?1) (real)))) ,) (v (cart (real) ?1) b)) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT1) (c (num) _0))))))) (a (a (c (fun (fun ?0 (cart (real) ?1)) (fun (fun ?0 (bool)) (fun (cart (real) ?1) (bool)))) IMAGE) (v (fun ?0 (cart (real) ?1)) g)) (v (fun ?0 (bool)) d))))))))))))
Ground truth: <START> (fun (fun (fun (cart (real) ?1) (bool)) (bool)) (bool)) <END>

(<theorem> (a (c <PREDICT> !) (l (v (fun (cart (real) N) (bool)) s) (a (a (c (fun (bool) (fun (bool) (bool))) =) (a (c (fun (fun (cart (real) N) (bool)) (bool)) is_interval) (a (a (c (fun (fun (cart (real) N) (cart (real) N)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) IMAGE) (c (fun (cart (real) N) (cart (real) N)) vector_neg)) (v (fun (cart (real) N) (bool)) s)))) (a (c (fun (fun (cart (real) N) (bool)) (bool)) is_interval) (v (fun (cart (real) N) (bool)) s))))))
Ground truth: <START> (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) <END>

(<theorem> (a (c (fun (fun (real) (bool)) (bool)) !) (l (v (real) x) (a (a (a (c (fun (fun (real) (real)) (fun (real) (fun (net (real)) (bool)))) has_real_derivative) (c (fun (real) (real)) atn)) (a (c (fun (real) (real)) real_inv) (a (a (c (fun (real) (fun (real) (real))) real_add) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT1) (c (num) _0))))) (a (a (c (fun (real) (fun (num) (real))) real_pow) (v <PREDICT> x)) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT0) (a (c (fun (num) (num)) BIT1) (c (num) _0)))))))) (a (c (fun (real) (net (real))) atreal) (v (real) x))))))
Ground truth: <START> (real) <END>

(<theorem> (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (bool))) =) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) INTER) (v (fun ?0 (bool)) s)) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) UNION) (v (fun ?0 (bool)) t)) (v (fun ?0 (bool)) u)))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) UNION) (a (a (c <PREDICT> INTER) (v (fun ?0 (bool)) s)) (v (fun ?0 (bool)) t))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) INTER) (v (fun ?0 (bool)) s)) (v (fun ?0 (bool)) u)))))
Ground truth: <START> (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) <END>

(<theorem> (a (a (c (fun (real) (fun (real) (bool))) =) (a (c (fun (cart (real) ?0) (real)) infnorm) (a (c (fun (num) (cart (real) ?0)) vec) (a (c (fun (num) (num)) NUMERAL) (c (num) _0))))) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (c <PREDICT> _0)))))
Ground truth: <START> (num) <END>
Hard Type Inference.

(<theorem> (a (c <MASK> !) (l (v <MASK> s) (a (a (c <MASK> =) (a (c <MASK> INTERS) (v <MASK> s))) (a (a (c <PREDICT> DIFF) (c <MASK> UNIV)) (a (c <MASK> UNIONS) (a (c <MASK> GSPEC) (l (v <MASK> GEN%PVAR%0) (a (c <MASK> ?) (l (v <MASK> t) (a (a (a (c <MASK> SETSPEC) (v <MASK> GEN%PVAR%0)) (a (a (c <MASK> IN) (v <MASK> t)) (v <MASK> s))) (a (a (c <MASK> DIFF) (c <MASK> UNIV)) (v <MASK> t)))))))))))))
Ground truth: <START> (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) <END>

(<theorem> (a (c <MASK> !) (l (v <MASK> f) (a (c <MASK> !) (l (v <MASK> s) (a (a (c <MASK> =) (a (a (c <MASK> uniformly_continuous_on) (v <MASK> f)) (v <MASK> s))) (a (c <MASK> !) (l (v <MASK> e) (a (a (c <MASK> ==>) (a (a (c <MASK> real_lt) (a (c <MASK> real_of_num) (a (c <MASK> NUMERAL) (c <MASK> _0)))) (v <MASK> e))) (a (c <MASK> ?) (l (v <MASK> d) (a (a (c <MASK> ) (a (a (c <MASK> real_lt) (a (c <MASK> real_of_num) (a (c <MASK> NUMERAL) (c <MASK> _0)))) (v <MASK> d))) (a (c <MASK> !) (l (v <MASK> t) (a (c <MASK> !) (l (v <MASK> t’) (a (a (c <MASK> ==>) (a (a (c <MASK> ) (a (a (c <MASK> SUBSET) (v <MASK> t)) (v <MASK> s))) (a (a (c <MASK> ) (a (a (c <MASK> SUBSET) (v <PREDICT> t’)) (v <MASK> s))) (a (a (c <MASK> ) (a (c <MASK> bounded) (v <MASK> t))) (a (a (c <MASK> ) (a (c <MASK> bounded) (v <MASK> t’))) (a (a (c <MASK> real_lt) (a (c <MASK> hausdist) (a (a (c <MASK> ,) (v <MASK> t’)) (v <MASK> t)))) (v <MASK> d))))))) (a (a (c <MASK> real_lt) (a (c <MASK> hausdist) (a (a (c <MASK> ,) (a (a (c <MASK> IMAGE) (v <MASK> f)) (v <MASK> t’))) (a (a (c <MASK> IMAGE) (v <MASK> f)) (v <MASK> t))))) (v <MASK> e)))))))))))))))))))
Ground truth: <START> (fun (cart (real) M) (bool)) <END>

(<theorem> (a (a (c <MASK> ==>) (a (a (c <PREDICT> IN) (v <MASK> a)) (v <MASK> s))) (a (a (c <MASK> =) (a (a (c <MASK> DIFF) (a (a (c <MASK> INSERT) (v <MASK> a)) (a (a (c <MASK> DELETE) (v <MASK> t)) (v <MASK> b)))) (v <MASK> s))) (a (a (c <MASK> DELETE) (a (a (c <MASK> DIFF) (v <MASK> t)) (v <MASK> s))) (v <MASK> b)))))
Ground truth: <START> (fun ?0 (fun (fun ?0 (bool)) (bool))) <END>

(<theorem> (a (c <MASK> !) (l (v <PREDICT> b) (a (c <MASK> convex) (a (c <MASK> GSPEC) (l (v <MASK> GEN%PVAR%0) (a (c <MASK> ?) (l (v <MASK> z) (a (a (a (c <MASK> SETSPEC) (v <MASK> GEN%PVAR%0)) (a (a (c <MASK> real_gt) (a (c <MASK> Im) (v <MASK> z))) (v <MASK> b))) (v <MASK> z))))))))))
Ground truth: <START> (real) <END>

(<theorem> (a (c <MASK> !) (l (v <MASK> x) (a (a (c <MASK> ==>) (a (c <MASK> ) (a (a (c <MASK> nadd_eq) (v <MASK> x)) (a (c <MASK> nadd_of_num) (a (c <MASK> NUMERAL) (c <MASK> _0)))))) (a (c <MASK> ?) (l (v <MASK> B) (a (c <MASK> ?) (l (v <MASK> N) (a (c <MASK> !) (l (v <MASK> m) (a (c <MASK> !) (l (v <MASK> n) (a (a (c <MASK> ==>) (a (a (c <MASK> ) (a (a (c <MASK> <=) (v <MASK> N)) (v <MASK> m))) (a (a (c <MASK> <=) (v <MASK> N)) (v <MASK> n)))) (a (a (c <MASK> <=) (a (a (c <MASK> *) (a (a (c <MASK> *) (a (a (c <MASK> dest_nadd) (v <MASK> x)) (v <MASK> m))) (a (a (c <MASK> dest_nadd) (v <MASK> x)) (v <MASK> n)))) (a (c <MASK> dist) (a (a (c <MASK> ,) (a (a (c <MASK> *) (v <MASK> m)) (a (a (c <MASK> nadd_rinv) (v <MASK> x)) (v <PREDICT> n)))) (a (a (c <MASK> *) (v <MASK> n)) (a (a (c <MASK> nadd_rinv) (v <MASK> x)) (v <MASK> m))))))) (a (a (c <MASK> *) (v <MASK> B)) (a (a (c <MASK> *) (a (a (c <MASK> *) (v <MASK> m)) (v <MASK> n))) (a (a (c <MASK> +) (v <MASK> m)) (v <MASK> n))))))))))))))))))
Ground truth: <START> (num) <END>
Assumptions.

Prompt: (<theorem> (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (fun ?1 (bool)) (fun (fun ?1 (bool)) (bool))) =) (a (c (fun (fun ?1 (bool)) (fun ?1 (bool))) GSPEC) (l (v ?1 GEN%PVAR%0) (a (c (fun (fun ?1 (bool)) (bool)) ?) (l (v ?1 x) (a (a (a (c (fun ?1 (fun (bool) (fun ?1 (bool)))) SETSPEC) (v ?1 GEN%PVAR%0)) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun ?1 (fun (fun ?1 (bool)) (bool))) IN) (v ?1 x)) (v (fun ?1 (bool)) s))) (a (a (c (fun ?0 (fun ?0 (bool))) =) (a (v (fun ?1 ?0) f) (v ?1 x))) (v ?0 a)))) (v ?1 x))))))) (v (fun ?1 (bool)) t))) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (c (fun (fun ?1 (bool)) (bool)) !) (l (v ?1 x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (v (fun ?1 (bool)) P) (v ?1 x))) (a (v (fun ?1 (bool)) Q) (v ?1 x)))) (a (c (fun (bool) (bool)) ) (a (a (c (fun ?0 (fun ?0 (bool))) =) (a (v (fun ?1 ?0) f) (v ?1 x))) (v ?0 a)))))))))
Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun ?1 (bool)) (bool)) !) (l (v ?1 x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (v (fun ?1 (bool)) P) (v ?1 x))) (a (a (c (fun ?1 (fun (fun ?1 (bool)) (bool))) IN) (v ?1 x)) (v (fun ?1 (bool)) s)))))) (a (c (fun (fun ?1 (bool)) (bool)) !) (l (v ?1 x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (v (fun ?1 (bool)) P) (v ?1 x))) (a (v (fun ?1 (bool)) Q) (v ?1 x)))) (a (c (fun (bool) (bool)) ) (a (a (c (fun ?1 (fun (fun ?1 (bool)) (bool))) IN) (v ?1 x)) (v (fun ?1 (bool)) t))))))) <END>
Source theorem pretty printed: {x  x IN s f x = a} = t ==> (!x. P x ==> x IN s) (!x. P x Q x ==> (x IN t)) ==> (!x. P x Q x ==> (f x = a))

Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (bool)) s) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (bool))) =) (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) inside) (v (fun (cart (real) N) (bool)) s))) (c (fun (cart (real) N) (bool)) EMPTY))))))
Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) N) (bool)) (bool)) connected) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) DIFF) (c (fun (cart (real) N) (bool)) UNIV)) (v (fun (cart (real) N) (bool)) s)))) (a (c (fun (bool) (bool)) ) (a (c (fun (fun (cart (real) N) (bool)) (bool)) bounded) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) DIFF) (c (fun (cart (real) N) (bool)) UNIV)) (v (fun (cart (real) N) (bool)) s))))) <END>
Source theorem pretty printed: !s. connected ((:realN) DIFF s) bounded ((:realN) DIFF s) ==> inside s = {}

Prompt: (<theorem> (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (v (bool) q)) (a (c (fun (bool) (bool)) ) (v (bool) p)))) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (v (bool) r))))
Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) =) (v (bool) p)) (v (bool) q)) <END>
Source theorem pretty printed: q p ==> (p <=> q) ==> r

Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) N) (real)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (real)) f) (a (c (fun (fun (fun (real) (real)) (bool)) (bool)) !) (l (v (fun (real) (real)) g) (a (c (fun (fun (cart (real) N) (bool)) (bool)) !) (l (v (cart (real) N) x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (a (c (fun (fun (cart (real) N) (real)) (fun (net (cart (real) N)) (bool))) real_continuous) (a (a (c (fun (fun (real) (real)) (fun (fun (cart (real) N) (real)) (fun (cart (real) N) (real)))) o) (v (fun (real) (real)) g)) (v (fun (cart (real) N) (real)) f))) (a (c (fun (cart (real) N) (net (cart (real) N))) at) (v (cart (real) N) x)))))))))))
Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun (fun (cart (real) N) (real)) (fun (net (cart (real) N)) (bool))) real_continuous) (v (fun (cart (real) N) (real)) f)) (a (c (fun (cart (real) N) (net (cart (real) N))) at) (v (cart (real) N) x)))) (a (a (c (fun (fun (real) (real)) (fun (net (real)) (bool))) real_continuous) (v (fun (real) (real)) g)) (a (a (c (fun (net (real)) (fun (fun (real) (bool)) (net (real)))) within) (a (c (fun (real) (net (real))) atreal) (a (v (fun (cart (real) N) (real)) f) (v (cart (real) N) x)))) (a (a (c (fun (fun (cart (real) N) (real)) (fun (fun (cart (real) N) (bool)) (fun (real) (bool)))) IMAGE) (v (fun (cart (real) N) (real)) f)) (c (fun (cart (real) N) (bool)) UNIV))))) <END>
Source theorem pretty printed: !f g x. f real_continuous at x g real_continuous atreal (f x) within IMAGE f (:realN) ==> g o f real_continuous at x

Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) M) (cart (real) N)) (bool)) (bool)) !) (l (v (fun (cart (real) M) (cart (real) N)) f) (a (c (fun (fun (fun (cart (real) M) (cart (real) P)) (bool)) (bool)) !) (l (v (fun (cart (real) M) (cart (real) P)) g) (a (c (fun (fun (fun (cart (real) M) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) M) (bool)) s) (a (c (fun (fun (num) (bool)) (bool)) !) (l (v (num) n) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (a (a (c (fun (num) (fun (fun (cart (real) M) (bool)) (fun (fun (cart (real) M) (cart (real) (finite_sum N P))) (bool)))) baire) (v (num) n)) (v (fun (cart (real) M) (bool)) s)) (l (v (cart (real) M) x) (a (a (c (fun (cart (real) N) (fun (cart (real) P) (cart (real) (finite_sum N P)))) pastecart) (a (v (fun (cart (real) M) (cart (real) N)) f) (v (cart (real) M) x))) (a (v (fun (cart (real) M) (cart (real) P)) g) (v (cart (real) M) x)))))))))))))))
Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (a (c (fun (num) (fun (fun (cart (real) M) (bool)) (fun (fun (cart (real) M) (cart (real) N)) (bool)))) baire) (v (num) n)) (v (fun (cart (real) M) (bool)) s)) (v (fun (cart (real) M) (cart (real) N)) f))) (a (a (a (c (fun (num) (fun (fun (cart (real) M) (bool)) (fun (fun (cart (real) M) (cart (real) P)) (bool)))) baire) (v (num) n)) (v (fun (cart (real) M) (bool)) s)) (v (fun (cart (real) M) (cart (real) P)) g))) <END>
Source theorem pretty printed: !f g s n. baire n s f baire n s g ==> baire n s (lambda x. pastecart (f x) (g x))
Equalities.

Prompt: (<theorem> (a (c (fun (fun (fun ?0 (cart (real) (2))) (bool)) (bool)) !) (l (v (fun ?0 (cart (real) (2))) f) (a (c (fun (fun (fun ?0 (cart (real) (2))) (bool)) (bool)) !) (l (v (fun ?0 (cart (real) (2))) g) (a (c (fun (fun (fun ?0 (bool)) (bool)) (bool)) !) (l (v (fun ?0 (bool)) s) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (c (fun (fun ?0 (bool)) (bool)) FINITE) (v (fun ?0 (bool)) s))) (a (a (c (fun (cart (real) (2)) (fun (cart (real) (2)) (bool))) =) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (cart (real) (2))) (cart (real) (2)))) cproduct) (v (fun ?0 (bool)) s)) (l (v ?0 x) (a (a (c (fun (cart (real) (2)) (fun (cart (real) (2)) (cart (real) (2)))) complex_mul) (a (v (fun ?0 (cart (real) (2))) f) (v ?0 x))) (a (v (fun ?0 (cart (real) (2))) g) (v ?0 x)))))) <PREDICT>)))))))))
Ground truth: <START> (a (a (c (fun (cart (real) (2)) (fun (cart (real) (2)) (cart (real) (2)))) complex_mul) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (cart (real) (2))) (cart (real) (2)))) cproduct) (v (fun ?0 (bool)) s)) (v (fun ?0 (cart (real) (2))) f))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (cart (real) (2))) (cart (real) (2)))) cproduct) (v (fun ?0 (bool)) s)) (v (fun ?0 (cart (real) (2))) g))) <END>
Source theorem pretty printed: !f g s. FINITE s ==> cproduct s (x. f x * g x) = cproduct s f * cproduct s g

Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (bool)) s) (a (c (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (bool)) t) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) N) (bool)) (bool)) convex) (v (fun (cart (real) N) (bool)) s))) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) N) (bool)) (bool)) affine) (v (fun (cart (real) N) (bool)) t))) (a (c (fun (bool) (bool)) ) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (bool))) =) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) INTER) (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) relative_interior) (v (fun (cart (real) N) (bool)) s))) (v (fun (cart (real) N) (bool)) t))) (c (fun (cart (real) N) (bool)) EMPTY)))))) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (bool))) =) <PREDICT>) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) INTER) (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) closure) (v (fun (cart (real) N) (bool)) s))) (v (fun (cart (real) N) (bool)) t)))))))))
Ground truth: <START> (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) closure) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) INTER) (v (fun (cart (real) N) (bool)) s)) (v (fun (cart (real) N) (bool)) t))) <END>
Source theorem pretty printed: !s t. convex s affine t (relative_interior s INTER t = {}) ==> closure (s INTER t) = closure s INTER t

Prompt: (<theorem> (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (bool))) SUBSET) (v (fun ?0 (bool)) t)) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) DIFF) (c (fun ?0 (bool)) UNIV)) (v (fun ?0 (bool)) s)))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (bool))) =) <PREDICT>) (c (fun ?0 (bool)) EMPTY))))
Ground truth: <START> (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) INTER) (v (fun ?0 (bool)) s)) (v (fun ?0 (bool)) t)) <END>
Source theorem pretty printed: t SUBSET (:?0) DIFF s ==> s INTER t = {}

Prompt: (<theorem> (a (c (fun (fun (real) (bool)) (bool)) !) (l (v (real) x) (a (a (c (fun (real) (fun (real) (bool))) =) <PREDICT>) (a (c (fun (real) (real)) real_abs) (v (real) x))))))
Ground truth: <START> (a (a (c (fun (real) (fun (num) (real))) real_pow) (a (c (fun (real) (real)) sqrt) (v (real) x))) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT0) (a (c (fun (num) (num)) BIT1) (c (num) _0))))) <END>
Source theorem pretty printed: !x. sqrt x pow 2 = abs x

Prompt: (<theorem> (a (a (c (fun (fun A (bool)) (fun (fun A (bool)) (bool))) =) <PREDICT>) (a (c (fun (fun A (bool)) (fun A (bool))) GSPEC) (l (v A GEN%PVAR%0) (a (c (fun (fun A (bool)) (bool)) ?) (l (v A y) (a (a (a (c (fun A (fun (bool) (fun A (bool)))) SETSPEC) (v A GEN%PVAR%0)) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun A (fun (fun A (bool)) (bool))) IN) (v A y)) (v (fun A (bool)) s))) (a (a (c (fun A (fun A (bool))) =) (v A y)) (v A x)))) (v A y))))))))
Ground truth: <START> (a (a (c (fun A (fun (fun A (bool)) (fun A (bool)))) INSERT) (v A x)) (v (fun A (bool)) s)) <END>
Source theorem pretty printed: x INSERT s = {y y IN s y = x}
Comments
There are no comments yet.