Mathematical Reasoning via Self-supervised Skip-tree Training

06/08/2020 ∙ by Markus N. Rabe, et al. ∙ Google 1

We examine whether self-supervised language modeling applied to mathematical formulas enables logical reasoning. We suggest several logical reasoning tasks that can be used to evaluate language models trained on formal mathematical statements, such as type inference, suggesting missing assumptions and completing equalities. To train language models for formal mathematics, we propose a novel skip-tree task. We find that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on standard skip-sequence tasks. We also analyze the models' ability to formulate new conjectures by measuring how often the predictions are provable and useful in other proofs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language modeling using Transformers [Vaswani et al., 2017]

has been hugely successful for applications like translation and text generation. Models like GPT-2 are able to generate impressive news articles and stories given just an abstract 

[Radford et al., 2018]. These models are usually first trained on a proxy task, such as predicting missing words in the case of BERT [Devlin et al., 2019], before fine tuning the models on more specific (downstream) tasks such as machine translation and question-answering. The proxy tasks are not reliant on labeled data, and thus can be trained on large corpora of unlabeled data. Even the models trained on the proxy tasks alone, have shown impressive language understanding [Brown et al., 2020].

Prior work in deep learning for mathematics has focused on learning directly on logical reasoning tasks. In this work, we apply the paradigms of language modeling to formal mathematics and define proxy tasks on unlabeled mathematical expressions that allows us to use much more data. We start with the HOList dataset 

[Bansal et al., 2019], which spans a wide range of mathematical topics, including topology, multivariate calculus, real and complex analysis, geometric algebra, and measure theory, formalized in the HOL Light proof assistant [Harrison, 1996]. We consider a standard skip-sequence task and a novel skip-tree task. Our skip-tree task is an instance of the skip-sequence task that respects the tree structure of expressions. We show that models trained on the skip-tree task results in models that significantly outperform those trained skip-sequence task when evaluated on various downstream tasks.

Reasoning can refer to a wide range of abilities, and thus we measure the mathematical reasoning abilities of language models on a variety of tasks, including mechanical derivations, such as type inference, and also creative tasks, such as predicting under which assumptions a statement is true. In contrast to works in language modeling, we do not fine-tune the models to the evaluation (downstream) tasks, as we want to study what reasoning capabilities can be acquired just through language modeling proxy tasks.

An advantage of formal language compared to natural language is that we can attempt to automatically evaluate statements. That is, even if the language models fail to predict the ground truth, the statements they predicted might still be true and useful. We evaluate these conjectures by attempting to prove them and checking if they are can be used in the context of other proofs.

Our contributions are two-fold:

  1. We introduce several evaluation tasks that test logical reasoning abilities.

  2. We introduce a new skip-tree language modeling task that outperforms skip-sequence approaches in our evaluation on the logical reasoning tasks.

The remainder of this paper is structured as follows: First, we review related work on language modeling and deep learning for mathematics in Section 2. Then, in Section 3 we discuss the source corpus of formal mathematical statements from which we generate our training data. In Section 4, we present our novel language modeling task for formal languages, as well as several variations that we used in our ablation studies. We present the evaluation tasks in Section 5, present our experimental findings in Section 6, and conclude in Section 7.

2 Related work

Recently, we have seen a series of rapid improvements in language modeling. Many of the improvements result from better pretraining tasks [Devlin et al., 2019, Zhang et al., 2019, Song et al., 2019, Dong et al., 2019, Raffel et al., 2019, Conneau and Lample, 2019]. BERT [Devlin et al., 2019] is a pretraining task for Transformers [Vaswani et al., 2017], which masks out a certain fraction of the input tokens that the model has to predict. UniLM uses multiple pretraining tasks at once [Dong et al., 2019]. One of them is a sequence-to-sequence task, which is to predict the next sentence from the previous sentence. MASS considers a generalized sequence-to-sequence pretraining task [Song et al., 2019], which is to predict a masked out subsequence of the input. SpanBERT additionally considers a span boundary objective, which is to predict the masked out subsequence only from the tokens adjacent to the missing subsequence [Joshi et al., 2020]. However, both MASS and SpanBERT reveal the length of the sequence to predict as they replace it by a number of mask tokens equal to the length of the sequence.

T5 introduced a generalization of sequence-to-sequence pretraining tasks that is crucial to our work [Raffel et al., 2019]. They replace the subsequence to be predicted by a single token (not a number of mask tokens equal to the length of the subsequence, as in MASS). Further, T5 allows multiple subsequences to be masked out and predicted. [Zhang et al., 2019] additionally exploit the sentence structure of natural language. They suggest the pretraining task Pegasus, which masks out entire sentences of a given text, and additionally masks out randomly selected tokens in the remaining text (or alternatively replace them by other tokens). In a similar way Pegasus’ exploitation of the sentence structure of natural language, our skip-tree task exploits the tree structure of formal expressions. [Zhang et al., 2019] also suggest sampling the sentences to be masked with the help of ROUGE1-F1 [Lin, 2004].

We work with the HOList dataset by Bansal et al. [2019]. There are other datasets which might be suitable for our approach as well, including proofs extracted from HOL4 [Gauthier et al., 2017], and from Coq [Huang et al., 2019, Yang and Deng, 2019, Sanchez-Stern et al., 2019].

Lample and Charton [2020] use a Transformer model for symbolic integration. They train their model directly on the reasoning task they want to learn, and their approach requires that the inverse of the prediction task can be computed effectively with classical algorithms. In contrast, we train language models on a proxy task and evaluate them on several logical reasoning tasks that are substantially different from the training task. Also, our dataset spans a much wider range of mathematical theories. We imagine that the language modeling approach explored in this paper, could be used as a pretraining task for symbolic integration.

Finkbeiner et al. [2020] explore the generalization properties of Transformers predicting the solutions to formulas in linear-time temporal logic. Unlike the language modeling approach followed in our work, their training regime requires a data generator that can solve formulas, which is currently not feasible for higher-order logic. Transformer models for program understanding have focused on providing inductive biases in the architecture [Shiv and Quirk, 2019, Hellendoorn et al., 2020], whereas this work suggests to use a modified language modeling proxy task. Perhaps, these approaches could be combined to improve performance on tree-structured data.

3 Dataset

We start from the HOList dataset introduced by Bansal et al. [2019]. The complete dataset includes 29465 theorems and their proofs. We here consider only the “core” and “complex” datasets which comprise 18943 theorems, 637 definitions and 603,950 proof steps. These proof steps were extracted from the human proof logs. The theorems and proofs were written (by humans) using the HOL Light proof assistant, and span various areas of mathematics such as set theory, arithmetic, linear algebra, topology, and multivariate complex analysis. The proofs contain a lot of intermediate goals which are the result of applying “tactics” on previous proof goals. For example, one of the tactics is to rewrite the current proof goal with a set of equations selected from the theorem database.

From this dataset we extract all theorem statements as well as all intermediate proof goals. We use S-expressions to represent all statements. For example, (v bool x) represents a boolean variable named x, and (a (v (fun (bool) (bool)) f) (v bool x) represents the function application where is a function from bool to bool. The S-expression syntax is thus very verbose, which can cause some expressions to not fit into the size constraints of our Transformer model.

Figure 1: We use the theorems and proofs of the training split, marked in green, for training. For our evaluation tasks, we only use the theorems of the validation set, marked in red, to ensure that the model has never seen the statements from which the evaluation tasks are derived.

We use the same split into training/validation/testing data as defined in HOList. The split is defined on the theorems, and the entire proof of each theorem is assigned to the same split as the theorem. This means that we have used the proof of 11,655 theorems in the training split of the core and complex libraries. This avoids partially revealing the proofs of theorems in the validation and test sets during training. We derive all training data from the theorems and proofs in the training set, and use only the theorems (not the proofs) for the evaluation tasks. This addresses the possibility that some proof steps for training theorems and for validation theorems might be shared. In Figure 1 we depict our choice of training and evaluation data.

4 Skip-tree Training

In this section we define the skip-tree training task. We parse a given mathematical statement into a tree of subexpressions, and replace one of the subexpressions by a <PREDICT> token. The task is to predict the subexpression replaced by <PREDICT>. See Figure 2 for an example.

For training, the trees are converted back to a sequence of tokens; the target sequence is extended by a <START> token in the front and an <END> token in the back. We exclude training examples where the output sequence is longer than the length of the decoder (512 tokens), and we cut off input sequences when they exceed the length of the encoder (1024 tokens).

Figure 2: The skip-tree training task for the example of the equality operator on boolean constants (original formula). In this example we assume that a part of the type was sampled to be the subexpression to be predicted, and that subexpression c was sampled to be masked out additionally. Note the input to the decoder is shifted to the right, such that the next token prediction task yields the target sequence.

Additional masked subexpressions.

In addition to the subexpression to be masked out by <PREDICT>, we select subexpressions to be masked out by a different mask token <MASK>. In contrast to the <PREDICT> token, we replace all occurrences of these subexpressions by the <MASK> token. Note that it can happen that the subexpressions we want to replace by the <MASK> tokens overlap with each other or with the subexpression replaced by the <PREDICT> token. In this case, we give the highest preference to the <PREDICT> token, and then in decreasing order of size for the expression to be replaced by the <MASK> tokens.

The subexpressions masked by <MASK> do not have to be predicted. They are only hidden to make the task harder and to make the model tolerant to having partial information. A beneficial side effect of replacing some expressions by a <MASK> token is that the input sequences get substantially shorter and more mathematical expressions fit in the size constraints of the Transformer architecture.

Distributions of subexpressions.

Sampling subexpressions uniformly at random results in very short sequences to be predicted: since our trees are mostly ternary, two thirds of the subexpressions are leaves. Besides picking subexpressions uniformly at random, we thus experiment with weighting the subexpressions by the number of tokens they contain. We refer to these variants as “uniform” and “weighted”. This results in a much more diverse set of expressions to be sampled.

Multiple samples per statement.

Since we started with a data source that is small compared datasets in natural language modeling, we use each mathematical statement from the training set to generate training examples. Our initial data consists of about 360K intermediate statements from the proofs of 10K statements in the training split of the core and complex library of the HOList corpus. To avoid duplicates, we sample the subexpressions that are replaced by a <PREDICT> token for each original formula without replacement.

4.1 Ablations

To verify the design choices of the skip-tree training task we generated multiple variants of the training task and trained a model on each of them.

No mask tokens.

To answer the question of whether it helps to mask out subexpressions besides the one to predict, we generated a dataset with , called “skip-tree (no <MASK>)”.

Fewer samples per statement.

Instead of sampling many training examples from each formula, we could train on a fewer training examples for more epochs. We generated a smaller version with

of the skip-tree training data, which we call “skip-tree (small)”.


MASS [Song et al., 2019], SpanBERT [Joshi et al., 2020], and T5 [Raffel et al., 2019] pretrain their sequence-to-sequence natural language models by predicting subsequences of the tokens. The skip-tree task is similar, but exploits our ability to parse the formulas as trees. To examine if this makes a difference, we consider a “skip-sequence” task that samples subsequences of the list of tokens instead of sampling subexpressions. We generated three datasets for the skip-sequence task, where we sample subsequences of different lengths (short/medium/long). For the task “skip-sequence (long)”, we pick two positions in the token sequence at uniformly at random and select the sequence that is between them. For the tasks “skip-sequence (medium)” and “skip-sequence (short)”, we limit their distance to 100 and 50 tokens, respectively.

Dataset # examples # tokens (input/output) avg length (input/output)
Skip-tree (weighted) 25.8M 17.4B/1.6B 675/61
Skip-tree (uniform) 25.7M 18.8B/316M 732/12
Skip-tree (small) 5.2M 3.5B/521M 673/100
Skip-tree (no <MASK>) 25.8M 19.4B/1.6B 750/61
Skip-sequence (long) 19.2M 11.9B/2.8B 620/146
Skip-sequence (medium) 26.0M 19.4B/884M 744/34
Skip-sequence (short) 26.0M 19.6B/479M 752/18
Table 1: Basic statistics of the training splits

of the data sets. Number of tokens in the training set measured before padding.

5 Evaluation Tasks

In this section we suggest several logical reasoning tasks on which our language models can be evaluated. These tasks require different levels of logical reasoning, ranging from mostly mechanical application of typing rules to conjecturing under which assumptions a statement might hold.

We intentionally define them to be out-of-distribution compared to the training data. Not only do we generate the examples in a slightly different way, we also generate them from the validation set of the theorem database. That is, the model has never seen the source data, nor has it seen the proofs of these theorems. This makes the tasks more challenging, and also ensures that we force the models to go beyond memorization. To give the interested reader a better impression of the evaluation tasks, we provide a list of randomly selected examples in Appendix D.

Type Inference.

We generate type inference problems similar to how we generated the skip-tree training data, which we described in Section 4. However, we restrict the sampling of subexpressions to subtrees that represent types of variables or constants (i.e. not fragments of other types).

We generated two variants of the type inference task: In the task we call “Type Inference,” we replace only the selected type by the <PREDICT> token and do not mask out anything else. In the second variant we name “Hard Type Inference,” we additionally replace all other types by the <MASK> token. The two tasks loosely correspond to the deriving the first and the last type during type inference.

For example, consider , which in the s-expression syntax is represented as follows:

(a (a (c (fun (A) (fun (A) (bool))) =) (v A x)) (v A x))

Each subexpression here is either a leaf or a triple. The first element of these triples indicates their kind: a indicates function applications, c indicates constants (i.e. symbols that have been defined in the formal system), v indicates a variable, and finally fun indicates a function type. The equality operator “=” is represented by (c (fun (A) (fun (A) (bool))) =), which indicates that it is a constant that has a function type taking two arguments of arbitrary type A and returns a bool. Since functions are typically curried in this representation, we have two function applications, both times with the variable x as the argument.

An example for the “Type Inference” evaluation task would be:

(a (a (c <PREDICT> =) (v A x)) (v A x))

The type of the equality operator is still uniquely defined, as we know what the equality is applied to (two arguments of type A) and because top-level application always has to return a boolean value. In this example the type could have been computed by a classical type inference algorithm.

For the “Hard Type Inference” evaluation task, the input would look as follows:

(a (a (c <PREDICT> =) (v <MASK> x)) (v <MASK> x))

Now, the type inference task is highly ambiguous. In fact, in this case, variable x could have any type, and the equality operator would have to adapt to the type of its arguments accordingly. Further, note that the hard type inference task masks out many more subtrees compared to the training data.


This evaluation task is to predict missing assumptions for theorems in the validation set. We extract these tasks by searching for “top-level implications” and replacing their left operand by the <PREDICT> token. We define an implication operator “” in an expression to be a top-level implication if it is either the top-most operator of the expression, or occurs only under quantifiers, conjunctions, disjunctions, or on the right side of other top-level implications. This definition helps us to avoid picking assumptions in negated parts of formulas.

Note that we can have multiple top-level implications per validation theorem. Consider the abstracted example . In this case, , , and are all considered to be assumptions of top-level implications.

An example from the theorem database is , for which the task is to predict given <PREDICT> . (We omit the presentation of this example as an s-expression for the sake of readability.) At first, the expression to predict in this case may seem unique, but there are actually many ways to complete the task into a true statement; e.g. or . Still, most humans would likely guess as it is simple and general, and because occurs before in the alphabet. To make a correct prediction, our language models thus have to understand which statements are more general and also know about naming conventions.

Below we give some examples of this reasoning task that we selected for their simplicity. (For a representative selection, see Appendix D.) While it is often easy to “see” that a given solution to such a task is correct, it can be non-trivial to come up with a solution in the first place. We encourage the reader to make their own predictions before looking up the ground truth in Appendix C:


Similar to the task of predicting missing assumptions, we ask to predict one side of a top-level equality in this task. Again, we define top-level equalities to be any equality that occurs as the top-level operator of the formula or occurs inside quantifiers, conjunctions, disjunctions, or on the right side of implications. For example, from the theorem we extract two evaluation examples: and .

Again, we present some simple example tasks (in mathematical notation for the sake of readability) and provide the ground truth as well as the model predictions in Appendix C:

6 Results and Discussion

We trained a Transformer with the hyperparameters specified in the appendix on the skip-tree dataset and each of the ablations for 1M steps with a batch size of 256.

In language modeling for natural language one of the key metrics is how often the next token in the ground truth is correctly predicted. This is not an ideal measurement for formal mathematics as even a single incorrect token can invalidate the entire statement. Also, the s-expression representation is relatively lengthy and barely human-readable, so a token-level measurement does not allow us to compare our models to the natural language models in any case. Therefore, we focus on exact matches of the entire predicted statement.

Dataset Type Inference Hard Type Inference Assumptions Equalities
Skip-tree (uniform) 96.21% 94.40% 40.85% 46.57%
Skip-tree (weighted) 96.23% 93.32% 40.86% 42.89%
Skip-tree (small) 95.89% 90.42% 39.23% 40.91%
Skip-tree (no <MASK>) 96.07% 32.50% 38.38% 41.60%
Skip-sequence (long) 9.44% 0.06% 0.53% 0.56%
Skip-sequence (medium) 48.94% 5.97% 3.32% 3.55%
Skip-sequence (short) 77.25%% 3.21% 0.68% 2.06%
Table 2: Success rate of predicting the ground truth in a beam search of width 8 after training a model on various datasets. Grayed out values indicate experiments where the training data did not include the <MASK> token but the evaluation data did.

In Table 2 we present how well the Transformer model, trained on different datasets, can predict the ground truth sequences. We can observe that for type inference, i.e. the more mechanical reasoning tasks, the models achieve a pretty high accuracy - even in the Hard Type Inference case where the expression was stripped of all types. We see that the skip-tree task and its ablations clearly dominate the skip-sequence language modeling task.

A closer inspection of the skip-sequence model shows that its single-token accuracy is almost as high as the single-token accuracy of the skip-tree model, and higher than the single-token accuracy of the skip-tree (uniform) model (all measured on the validation data of the different training tasks). However, its predictions rarely parse or typecheck. On manual inspection of the predictions, it seems that the skip-sequence models consistently add surplus tokens at the end, or stop expressions too early; they appear to be unable to correctly identify the end of the expression to predict. The problem may be amplified by the s-expression syntax, which requires counting parentheses to some extent.

6.1 Conjecturing

In the experiments above, we measured how often the models predicted the ground truth in the evaluation tasks. We now change our point of view, and examine whether the models can be used to generate new conjectures. We define conjectures as mathematical statements that differ from the ground truth and any expression the model has seen during training. Additionally, a meaningful conjecture should be syntactically correct, typecheck, be provable, and be useful in the context of other proofs.

Since the training data is derived exclusively from true statements (i.e. human proof steps), the language models are incentivized to complete partial statements in a way that makes them true. Presented with one of the evaluation tasks, to predict missing assumptions or to predict the missing side of an equation, the models may thus complete these statements in multiple ways that make them true. The predictions that do not match the ground truth may still be true and useful statements. In the following we describe experiments that help us estimate how often this is the case.

Free-form conjecturing.

In addition to the “assumptions” and the “equalities” evaluation tasks, we consider a third task for producing conjectures. In this task, which we call “free-form conjecturing”, we query the model with a single prompt: (<theorem> <PREDICT>). This helps us to analyze what the language models produce when given no context. The <theorem> tag indicates only that the statement should be a theorem, and not an intermediate proof step, which would start with the <goal> tag. For free-form conjecturing we want to produce a variety of different predictions, and thus use a beam search with high beam width of 1024. We did not include the free-form conjecturing task in Table 2, as there is no ground truth to match against.

How often are predictions true and new?

For this measurement, we replace the <PREDICT> token with the predicted sequence and attempt to prove the resulting statement in the DeepHOL theorem prover [Bansal et al., 2019]. Note that this can only give us a lower bound to the number of true statements, because of the limitations of the prover: The version of the DeepHOL theorem prover used here can prove around 58% of the validation theorems. So we expect the estimates here to be considerably below the number of actually true statements.

In Table 3 we report two numbers for each evaluation task: The first number is the percentage of generated statements known to be provable, including exact matches, statements from the training set, and statements provable with DeepHOL. The second number is the percentage of generated statements that are provable and new - excluding exact matches with the ground truth and statements from the training set. The denominator for both numbers is the same: the set of all predictions from the beam searches in Table 2.

We believe that these measurements show a significant bias towards true statements. While in some tasks, less than half of the statements were provable, there are simply many more ways to write a false statement than a true statement.

Dataset Assumptions Equalities Free-form Conjecturing
Skip-tree (weighted) 32.41%/26.91% 17.96%/11.63% 97.75%/0.59%
Skip-tree (uniform) 32.19%/26.20% 19.61%/12.28% 82.32%/12.70%
Table 3: Percentage of “provable statements”/“provable new statements”. The type inference tasks are not included as we are only interested in the predictions that do not match the ground truth. For the type inference tasks, these statements are either semantically equivalent to existing statements or statements that do not type check.

Are the conjectures useful?

For some evaluation tasks, the models could “cheat” on the truth metric by making the statements trivially true. For example, the models can predict False as an assumption, or complete the missing part of an equation by making it an identity (e.g. complete x = <PREDICT> by predicting x). In fact, manual inspection revealed several such cases.

To make this measurable, we added the provable statements to the theorem database, and ran the reinforcement learning experiments of the DeepHOL theorem prover 

[Bansal et al., 2019]

to measure how many of the statements were used as premises. In this experiment we also make sure that the new theorems cannot be used in the proofs of their premises. In a “pruning” step DeepHOL minimizes proofs by removing each individual premise in a proof and checking if the proof still holds. Only the premises that survive this step are classified as

useful. While this measurement is a relatively low bar, it filters out statements that have no effect in any proof.

We ran three reinforcement learning experiments, one for each of the evaluation tasks. We then measured how many of the theorems generated by each task are used as a premise in one of the over 200,000 proofs found for each of the experiments. For the assumptions task, 3445 of the 3857 theorems were used at least once. For the equalities task and the free-form conjectures it was 979 out of 3440 and 49 out of 130, respectively. We provide usage histograms in Appendix B.

While some of the most frequently used conjectures turned out to be alpha-equivalent variations of existing theorems in the theorem database, we found some interesting examples among the most used conjectures:

  • Assumptions task, 1728 usages:

    . Humans have used this theorem over vector arithmetic in many proofs. However, this theorem has always been defined as a

    local lemma and thus did not made it into the theorem database. This conjecture apparently filled a gap in the theorem database.

  • Free-form conjecturing task, 15 usages: . In contrast to the previous example, there are no occurrences of this statement (or an equivalent statement) in the theorem database or any human proof, not even as a local lemma.

These results suggest that language models show a limited ability to produce new, useful conjectures, even without fine tuning or specialized training. However, overall, the new conjectures appear to be mostly variations of existing theorems. This falls in line with current expectations of Transformer models. To effectively produce conjectures that are useful in a specific context, we expect that a more targeted training approach is needed.

7 Conclusion

In this work, we applied the paradigms of language modeling to formal mathematics. We introduced a novel self-supervised training task for formal mathematics that clearly outperforms language modeling tasks used for natural language. We suggested several evaluation tasks and metrics for measuring mathematical reasoning capabilities of language models for formal mathematics. Our experiments demonstrate that language models are already surprisingly capable at a variety of reasoning tasks that they were not trained for directly. We also explored the ability of language models to produce new conjectures by measuring how many of the new predictions are provable and useful for other proofs.


Appendix A Hyperparameters

We trained the Transformers with these hyperparameters:

  • vocabulary size: 1200

  • embedding size: 128

  • attention dropout: 0.1

  • nonlinearity: gelu

  • hidden layer dropout: 0.1

  • hidden layer size: 512

  • initializer range: 0.02

  • intermediate size: 768

  • number of attention heads: 8

  • number of hidden layers in encoder: 2

  • number of hidden layers in decoder: 4

Appendix B Usage Statistics of Conjectures

Figure 3: Histograms of premise usage of the conjectures generated through the assumptions task (left), the equality task (middle), and through free-form conjecturing (right). X-axes are the new theorems, sorted by number of usages. Y-axes indicate the number of usages on a log scale.

Appendix C A Close Look at Simple Example Tasks


In Section 5 we presented the following three examples of the task to predict missing assumptions. For the sake of readability we here discuss only the pretty printed versions. For examples in s-expression syntax, please visit Appendix D.

The ground truth answers are as follows:

  • , note that would be a more general assumption.

For the first and the third task, the language model “skip-tree (weighted)” makes a correct prediction in the top 3 candidates in a beam search of width 8. For the seconds task, the language model mostly produces incorrectly typed expressions: it appears to think that is a set of the same type as .


We presented these examples for the equality evaluation task:

The ground truth for the tasks is:

Examples two and three are predicted correctly in a beam search with beam width 8. For the first example, the model almost gets it correct in two of the 8 attempts: , and . We find it surprising that the model apparently understands that there are two cases to consider, but that the exact combination of constants (1 and 0) is a challenge.

Appendix D Randomly Selected Example Tasks

In the following, we provide a list of 5 examples for each of the evaluation tasks, sampled uniformly at random.

Type Inference.

  • (<theorem> (a (c <PREDICT> !) (l (v (fun (cart (real) ?1) (bool)) t) (a (c (fun (fun (fun (cart (real) ?1) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) ?1) (bool)) u) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) ?1) (bool)) (bool)) !) (l (v (cart (real) ?1) b) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) ?1) (bool)) (bool)) ?) (l (v (cart (real) ?1) w) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (v (cart (real) ?1) w)) (v (fun (cart (real) ?1) (bool)) t))) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (v (cart (real) ?1) w)) (a (c (fun (prod (cart (real) ?1) (real)) (fun (cart (real) ?1) (bool))) ball) (a (a (c (fun (cart (real) ?1) (fun (real) (prod (cart (real) ?1) (real)))) ,) (v (cart (real) ?1) b)) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT1) (c (num) _0))))))))))) (a (c (fun (fun (cart (real) ?1) (bool)) (bool)) ?) (l (v (cart (real) ?1) w) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (v (cart (real) ?1) w)) (v (fun (cart (real) ?1) (bool)) u))) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (v (cart (real) ?1) w)) (a (c (fun (prod (cart (real) ?1) (real)) (fun (cart (real) ?1) (bool))) ball) (a (a (c (fun (cart (real) ?1) (fun (real) (prod (cart (real) ?1) (real)))) ,) (v (cart (real) ?1) b)) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT1) (c (num) _0)))))))))))))) (a (c (fun (fun ?0 (bool)) (bool)) !) (l (v ?0 x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun ?0 (fun (fun ?0 (bool)) (bool))) IN) (v ?0 x)) (v (fun ?0 (bool)) d))) (a (c (fun (bool) (bool)) ) (a (a (c (fun (cart (real) ?1) (fun (fun (cart (real) ?1) (bool)) (bool))) IN) (a (v (fun ?0 (cart (real) ?1)) g) (v ?0 x))) (a (a (c (fun (fun (cart (real) ?1) (bool)) (fun (fun (cart (real) ?1) (bool)) (fun (cart (real) ?1) (bool)))) UNION) (v (fun (cart (real) ?1) (bool)) t)) (v (fun (cart (real) ?1) (bool)) u))))))))) (a (c (fun (bool) (bool)) ) (a (c (fun (fun (cart (real) ?1) (bool)) (bool)) ?) (l (v (cart (real) ?1) b) (a (a (c (fun (fun (cart (real) ?1) (bool)) (fun (fun (cart (real) ?1) (bool)) (bool))) SUBSET) (a (c (fun (prod (cart (real) ?1) (real)) (fun (cart (real) ?1) (bool))) ball) (a (a (c (fun (cart (real) ?1) (fun (real) (prod (cart (real) ?1) (real)))) ,) (v (cart (real) ?1) b)) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT1) (c (num) _0))))))) (a (a (c (fun (fun ?0 (cart (real) ?1)) (fun (fun ?0 (bool)) (fun (cart (real) ?1) (bool)))) IMAGE) (v (fun ?0 (cart (real) ?1)) g)) (v (fun ?0 (bool)) d))))))))))))

    Ground truth: <START> (fun (fun (fun (cart (real) ?1) (bool)) (bool)) (bool)) <END>

  • (<theorem> (a (c <PREDICT> !) (l (v (fun (cart (real) N) (bool)) s) (a (a (c (fun (bool) (fun (bool) (bool))) =) (a (c (fun (fun (cart (real) N) (bool)) (bool)) is_interval) (a (a (c (fun (fun (cart (real) N) (cart (real) N)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) IMAGE) (c (fun (cart (real) N) (cart (real) N)) vector_neg)) (v (fun (cart (real) N) (bool)) s)))) (a (c (fun (fun (cart (real) N) (bool)) (bool)) is_interval) (v (fun (cart (real) N) (bool)) s))))))

    Ground truth: <START> (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) <END>

  • (<theorem> (a (c (fun (fun (real) (bool)) (bool)) !) (l (v (real) x) (a (a (a (c (fun (fun (real) (real)) (fun (real) (fun (net (real)) (bool)))) has_real_derivative) (c (fun (real) (real)) atn)) (a (c (fun (real) (real)) real_inv) (a (a (c (fun (real) (fun (real) (real))) real_add) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT1) (c (num) _0))))) (a (a (c (fun (real) (fun (num) (real))) real_pow) (v <PREDICT> x)) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT0) (a (c (fun (num) (num)) BIT1) (c (num) _0)))))))) (a (c (fun (real) (net (real))) atreal) (v (real) x))))))

    Ground truth: <START> (real) <END>

  • (<theorem> (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (bool))) =) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) INTER) (v (fun ?0 (bool)) s)) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) UNION) (v (fun ?0 (bool)) t)) (v (fun ?0 (bool)) u)))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) UNION) (a (a (c <PREDICT> INTER) (v (fun ?0 (bool)) s)) (v (fun ?0 (bool)) t))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) INTER) (v (fun ?0 (bool)) s)) (v (fun ?0 (bool)) u)))))

    Ground truth: <START> (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) <END>

  • (<theorem> (a (a (c (fun (real) (fun (real) (bool))) =) (a (c (fun (cart (real) ?0) (real)) infnorm) (a (c (fun (num) (cart (real) ?0)) vec) (a (c (fun (num) (num)) NUMERAL) (c (num) _0))))) (a (c (fun (num) (real)) real_of_num) (a (c (fun (num) (num)) NUMERAL) (c <PREDICT> _0)))))

    Ground truth: <START> (num) <END>

Hard Type Inference.

  • (<theorem> (a (c <MASK> !) (l (v <MASK> s) (a (a (c <MASK> =) (a (c <MASK> INTERS) (v <MASK> s))) (a (a (c <PREDICT> DIFF) (c <MASK> UNIV)) (a (c <MASK> UNIONS) (a (c <MASK> GSPEC) (l (v <MASK> GEN%PVAR%0) (a (c <MASK> ?) (l (v <MASK> t) (a (a (a (c <MASK> SETSPEC) (v <MASK> GEN%PVAR%0)) (a (a (c <MASK> IN) (v <MASK> t)) (v <MASK> s))) (a (a (c <MASK> DIFF) (c <MASK> UNIV)) (v <MASK> t)))))))))))))

    Ground truth: <START> (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) <END>

  • (<theorem> (a (c <MASK> !) (l (v <MASK> f) (a (c <MASK> !) (l (v <MASK> s) (a (a (c <MASK> =) (a (a (c <MASK> uniformly_continuous_on) (v <MASK> f)) (v <MASK> s))) (a (c <MASK> !) (l (v <MASK> e) (a (a (c <MASK> ==>) (a (a (c <MASK> real_lt) (a (c <MASK> real_of_num) (a (c <MASK> NUMERAL) (c <MASK> _0)))) (v <MASK> e))) (a (c <MASK> ?) (l (v <MASK> d) (a (a (c <MASK> ) (a (a (c <MASK> real_lt) (a (c <MASK> real_of_num) (a (c <MASK> NUMERAL) (c <MASK> _0)))) (v <MASK> d))) (a (c <MASK> !) (l (v <MASK> t) (a (c <MASK> !) (l (v <MASK> t’) (a (a (c <MASK> ==>) (a (a (c <MASK> ) (a (a (c <MASK> SUBSET) (v <MASK> t)) (v <MASK> s))) (a (a (c <MASK> ) (a (a (c <MASK> SUBSET) (v <PREDICT> t’)) (v <MASK> s))) (a (a (c <MASK> ) (a (c <MASK> bounded) (v <MASK> t))) (a (a (c <MASK> ) (a (c <MASK> bounded) (v <MASK> t’))) (a (a (c <MASK> real_lt) (a (c <MASK> hausdist) (a (a (c <MASK> ,) (v <MASK> t’)) (v <MASK> t)))) (v <MASK> d))))))) (a (a (c <MASK> real_lt) (a (c <MASK> hausdist) (a (a (c <MASK> ,) (a (a (c <MASK> IMAGE) (v <MASK> f)) (v <MASK> t’))) (a (a (c <MASK> IMAGE) (v <MASK> f)) (v <MASK> t))))) (v <MASK> e)))))))))))))))))))

    Ground truth: <START> (fun (cart (real) M) (bool)) <END>

  • (<theorem> (a (a (c <MASK> ==>) (a (a (c <PREDICT> IN) (v <MASK> a)) (v <MASK> s))) (a (a (c <MASK> =) (a (a (c <MASK> DIFF) (a (a (c <MASK> INSERT) (v <MASK> a)) (a (a (c <MASK> DELETE) (v <MASK> t)) (v <MASK> b)))) (v <MASK> s))) (a (a (c <MASK> DELETE) (a (a (c <MASK> DIFF) (v <MASK> t)) (v <MASK> s))) (v <MASK> b)))))

    Ground truth: <START> (fun ?0 (fun (fun ?0 (bool)) (bool))) <END>

  • (<theorem> (a (c <MASK> !) (l (v <PREDICT> b) (a (c <MASK> convex) (a (c <MASK> GSPEC) (l (v <MASK> GEN%PVAR%0) (a (c <MASK> ?) (l (v <MASK> z) (a (a (a (c <MASK> SETSPEC) (v <MASK> GEN%PVAR%0)) (a (a (c <MASK> real_gt) (a (c <MASK> Im) (v <MASK> z))) (v <MASK> b))) (v <MASK> z))))))))))

    Ground truth: <START> (real) <END>

  • (<theorem> (a (c <MASK> !) (l (v <MASK> x) (a (a (c <MASK> ==>) (a (c <MASK> ) (a (a (c <MASK> nadd_eq) (v <MASK> x)) (a (c <MASK> nadd_of_num) (a (c <MASK> NUMERAL) (c <MASK> _0)))))) (a (c <MASK> ?) (l (v <MASK> B) (a (c <MASK> ?) (l (v <MASK> N) (a (c <MASK> !) (l (v <MASK> m) (a (c <MASK> !) (l (v <MASK> n) (a (a (c <MASK> ==>) (a (a (c <MASK> ) (a (a (c <MASK> <=) (v <MASK> N)) (v <MASK> m))) (a (a (c <MASK> <=) (v <MASK> N)) (v <MASK> n)))) (a (a (c <MASK> <=) (a (a (c <MASK> *) (a (a (c <MASK> *) (a (a (c <MASK> dest_nadd) (v <MASK> x)) (v <MASK> m))) (a (a (c <MASK> dest_nadd) (v <MASK> x)) (v <MASK> n)))) (a (c <MASK> dist) (a (a (c <MASK> ,) (a (a (c <MASK> *) (v <MASK> m)) (a (a (c <MASK> nadd_rinv) (v <MASK> x)) (v <PREDICT> n)))) (a (a (c <MASK> *) (v <MASK> n)) (a (a (c <MASK> nadd_rinv) (v <MASK> x)) (v <MASK> m))))))) (a (a (c <MASK> *) (v <MASK> B)) (a (a (c <MASK> *) (a (a (c <MASK> *) (v <MASK> m)) (v <MASK> n))) (a (a (c <MASK> +) (v <MASK> m)) (v <MASK> n))))))))))))))))))

    Ground truth: <START> (num) <END>


  • Prompt: (<theorem> (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (fun ?1 (bool)) (fun (fun ?1 (bool)) (bool))) =) (a (c (fun (fun ?1 (bool)) (fun ?1 (bool))) GSPEC) (l (v ?1 GEN%PVAR%0) (a (c (fun (fun ?1 (bool)) (bool)) ?) (l (v ?1 x) (a (a (a (c (fun ?1 (fun (bool) (fun ?1 (bool)))) SETSPEC) (v ?1 GEN%PVAR%0)) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun ?1 (fun (fun ?1 (bool)) (bool))) IN) (v ?1 x)) (v (fun ?1 (bool)) s))) (a (a (c (fun ?0 (fun ?0 (bool))) =) (a (v (fun ?1 ?0) f) (v ?1 x))) (v ?0 a)))) (v ?1 x))))))) (v (fun ?1 (bool)) t))) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (c (fun (fun ?1 (bool)) (bool)) !) (l (v ?1 x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (v (fun ?1 (bool)) P) (v ?1 x))) (a (v (fun ?1 (bool)) Q) (v ?1 x)))) (a (c (fun (bool) (bool)) ) (a (a (c (fun ?0 (fun ?0 (bool))) =) (a (v (fun ?1 ?0) f) (v ?1 x))) (v ?0 a)))))))))

    Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun ?1 (bool)) (bool)) !) (l (v ?1 x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (v (fun ?1 (bool)) P) (v ?1 x))) (a (a (c (fun ?1 (fun (fun ?1 (bool)) (bool))) IN) (v ?1 x)) (v (fun ?1 (bool)) s)))))) (a (c (fun (fun ?1 (bool)) (bool)) !) (l (v ?1 x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (v (fun ?1 (bool)) P) (v ?1 x))) (a (v (fun ?1 (bool)) Q) (v ?1 x)))) (a (c (fun (bool) (bool)) ) (a (a (c (fun ?1 (fun (fun ?1 (bool)) (bool))) IN) (v ?1 x)) (v (fun ?1 (bool)) t))))))) <END>

    Source theorem pretty printed: {x | x IN s f x = a} = t ==> (!x. P x ==> x IN s) (!x. P x Q x ==> (x IN t)) ==> (!x. P x Q x ==> (f x = a))

  • Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (bool)) s) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (bool))) =) (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) inside) (v (fun (cart (real) N) (bool)) s))) (c (fun (cart (real) N) (bool)) EMPTY))))))

    Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) N) (bool)) (bool)) connected) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) DIFF) (c (fun (cart (real) N) (bool)) UNIV)) (v (fun (cart (real) N) (bool)) s)))) (a (c (fun (bool) (bool)) ) (a (c (fun (fun (cart (real) N) (bool)) (bool)) bounded) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) DIFF) (c (fun (cart (real) N) (bool)) UNIV)) (v (fun (cart (real) N) (bool)) s))))) <END>

    Source theorem pretty printed: !s. connected ((:realN) DIFF s) bounded ((:realN) DIFF s) ==> inside s = {}

  • Prompt: (<theorem> (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (v (bool) q)) (a (c (fun (bool) (bool)) ) (v (bool) p)))) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (v (bool) r))))

    Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) =) (v (bool) p)) (v (bool) q)) <END>

    Source theorem pretty printed: q p ==> (p <=> q) ==> r

  • Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) N) (real)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (real)) f) (a (c (fun (fun (fun (real) (real)) (bool)) (bool)) !) (l (v (fun (real) (real)) g) (a (c (fun (fun (cart (real) N) (bool)) (bool)) !) (l (v (cart (real) N) x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (a (c (fun (fun (cart (real) N) (real)) (fun (net (cart (real) N)) (bool))) real_continuous) (a (a (c (fun (fun (real) (real)) (fun (fun (cart (real) N) (real)) (fun (cart (real) N) (real)))) o) (v (fun (real) (real)) g)) (v (fun (cart (real) N) (real)) f))) (a (c (fun (cart (real) N) (net (cart (real) N))) at) (v (cart (real) N) x)))))))))))

    Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun (fun (cart (real) N) (real)) (fun (net (cart (real) N)) (bool))) real_continuous) (v (fun (cart (real) N) (real)) f)) (a (c (fun (cart (real) N) (net (cart (real) N))) at) (v (cart (real) N) x)))) (a (a (c (fun (fun (real) (real)) (fun (net (real)) (bool))) real_continuous) (v (fun (real) (real)) g)) (a (a (c (fun (net (real)) (fun (fun (real) (bool)) (net (real)))) within) (a (c (fun (real) (net (real))) atreal) (a (v (fun (cart (real) N) (real)) f) (v (cart (real) N) x)))) (a (a (c (fun (fun (cart (real) N) (real)) (fun (fun (cart (real) N) (bool)) (fun (real) (bool)))) IMAGE) (v (fun (cart (real) N) (real)) f)) (c (fun (cart (real) N) (bool)) UNIV))))) <END>

    Source theorem pretty printed: !f g x. f real_continuous at x g real_continuous atreal (f x) within IMAGE f (:realN) ==> g o f real_continuous at x

  • Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) M) (cart (real) N)) (bool)) (bool)) !) (l (v (fun (cart (real) M) (cart (real) N)) f) (a (c (fun (fun (fun (cart (real) M) (cart (real) P)) (bool)) (bool)) !) (l (v (fun (cart (real) M) (cart (real) P)) g) (a (c (fun (fun (fun (cart (real) M) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) M) (bool)) s) (a (c (fun (fun (num) (bool)) (bool)) !) (l (v (num) n) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (a (a (c (fun (num) (fun (fun (cart (real) M) (bool)) (fun (fun (cart (real) M) (cart (real) (finite_sum N P))) (bool)))) baire) (v (num) n)) (v (fun (cart (real) M) (bool)) s)) (l (v (cart (real) M) x) (a (a (c (fun (cart (real) N) (fun (cart (real) P) (cart (real) (finite_sum N P)))) pastecart) (a (v (fun (cart (real) M) (cart (real) N)) f) (v (cart (real) M) x))) (a (v (fun (cart (real) M) (cart (real) P)) g) (v (cart (real) M) x)))))))))))))))

    Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (a (c (fun (num) (fun (fun (cart (real) M) (bool)) (fun (fun (cart (real) M) (cart (real) N)) (bool)))) baire) (v (num) n)) (v (fun (cart (real) M) (bool)) s)) (v (fun (cart (real) M) (cart (real) N)) f))) (a (a (a (c (fun (num) (fun (fun (cart (real) M) (bool)) (fun (fun (cart (real) M) (cart (real) P)) (bool)))) baire) (v (num) n)) (v (fun (cart (real) M) (bool)) s)) (v (fun (cart (real) M) (cart (real) P)) g))) <END>

    Source theorem pretty printed: !f g s n. baire n s f baire n s g ==> baire n s (lambda x. pastecart (f x) (g x))


  • Prompt: (<theorem> (a (c (fun (fun (fun ?0 (cart (real) (2))) (bool)) (bool)) !) (l (v (fun ?0 (cart (real) (2))) f) (a (c (fun (fun (fun ?0 (cart (real) (2))) (bool)) (bool)) !) (l (v (fun ?0 (cart (real) (2))) g) (a (c (fun (fun (fun ?0 (bool)) (bool)) (bool)) !) (l (v (fun ?0 (bool)) s) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (c (fun (fun ?0 (bool)) (bool)) FINITE) (v (fun ?0 (bool)) s))) (a (a (c (fun (cart (real) (2)) (fun (cart (real) (2)) (bool))) =) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (cart (real) (2))) (cart (real) (2)))) cproduct) (v (fun ?0 (bool)) s)) (l (v ?0 x) (a (a (c (fun (cart (real) (2)) (fun (cart (real) (2)) (cart (real) (2)))) complex_mul) (a (v (fun ?0 (cart (real) (2))) f) (v ?0 x))) (a (v (fun ?0 (cart (real) (2))) g) (v ?0 x)))))) <PREDICT>)))))))))

    Ground truth: <START> (a (a (c (fun (cart (real) (2)) (fun (cart (real) (2)) (cart (real) (2)))) complex_mul) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (cart (real) (2))) (cart (real) (2)))) cproduct) (v (fun ?0 (bool)) s)) (v (fun ?0 (cart (real) (2))) f))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (cart (real) (2))) (cart (real) (2)))) cproduct) (v (fun ?0 (bool)) s)) (v (fun ?0 (cart (real) (2))) g))) <END>

    Source theorem pretty printed: !f g s. FINITE s ==> cproduct s (x. f x * g x) = cproduct s f * cproduct s g

  • Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (bool)) s) (a (c (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (bool)) t) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) N) (bool)) (bool)) convex) (v (fun (cart (real) N) (bool)) s))) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (c (fun (fun (cart (real) N) (bool)) (bool)) affine) (v (fun (cart (real) N) (bool)) t))) (a (c (fun (bool) (bool)) ) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (bool))) =) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) INTER) (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) relative_interior) (v (fun (cart (real) N) (bool)) s))) (v (fun (cart (real) N) (bool)) t))) (c (fun (cart (real) N) (bool)) EMPTY)))))) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (bool))) =) <PREDICT>) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) INTER) (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) closure) (v (fun (cart (real) N) (bool)) s))) (v (fun (cart (real) N) (bool)) t)))))))))

    Ground truth: <START> (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) closure) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) INTER) (v (fun (cart (real) N) (bool)) s)) (v (fun (cart (real) N) (bool)) t))) <END>

    Source theorem pretty printed: !s t. convex s affine t (relative_interior s INTER t = {}) ==> closure (s INTER t) = closure s INTER t

  • Prompt: (<theorem> (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (bool))) SUBSET) (v (fun ?0 (bool)) t)) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) DIFF) (c (fun ?0 (bool)) UNIV)) (v (fun ?0 (bool)) s)))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (bool))) =) <PREDICT>) (c (fun ?0 (bool)) EMPTY))))

    Ground truth: <START> (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) INTER) (v (fun ?0 (bool)) s)) (v (fun ?0 (bool)) t)) <END>

    Source theorem pretty printed: t SUBSET (:?0) DIFF s ==> s INTER t = {}

  • Prompt: (<theorem> (a (c (fun (fun (real) (bool)) (bool)) !) (l (v (real) x) (a (a (c (fun (real) (fun (real) (bool))) =) <PREDICT>) (a (c (fun (real) (real)) real_abs) (v (real) x))))))

    Ground truth: <START> (a (a (c (fun (real) (fun (num) (real))) real_pow) (a (c (fun (real) (real)) sqrt) (v (real) x))) (a (c (fun (num) (num)) NUMERAL) (a (c (fun (num) (num)) BIT0) (a (c (fun (num) (num)) BIT1) (c (num) _0))))) <END>

    Source theorem pretty printed: !x. sqrt x pow 2 = abs x

  • Prompt: (<theorem> (a (a (c (fun (fun A (bool)) (fun (fun A (bool)) (bool))) =) <PREDICT>) (a (c (fun (fun A (bool)) (fun A (bool))) GSPEC) (l (v A GEN%PVAR%0) (a (c (fun (fun A (bool)) (bool)) ?) (l (v A y) (a (a (a (c (fun A (fun (bool) (fun A (bool)))) SETSPEC) (v A GEN%PVAR%0)) (a (a (c (fun (bool) (fun (bool) (bool))) ) (a (a (c (fun A (fun (fun A (bool)) (bool))) IN) (v A y)) (v (fun A (bool)) s))) (a (a (c (fun A (fun A (bool))) =) (v A y)) (v A x)))) (v A y))))))))

    Ground truth: <START> (a (a (c (fun A (fun (fun A (bool)) (fun A (bool)))) INSERT) (v A x)) (v (fun A (bool)) s)) <END>

    Source theorem pretty printed: x INSERT s = {y y IN s y = x}