miniF2F
An updated version of miniF2F with lots of fixes and informal statements / solutions.
view repo
The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9 39.3
READ FULL TEXT VIEW PDFAn updated version of miniF2F with lots of fixes and informal statements / solutions.
Formal proof automation is a challenging task that has been the focus of increased attention in recent years (bansal2019holist; polu2020generative; lample2022hypertree; jiang2022thor; wu2022autoformalization)
. However, deep learning approaches have not been as successful as in other domains, mainly because of the scarcity of formal data. Indeed, formalizing proofs is notoriously difficult and only accessible to a handful of experts, which makes large annotation endeavors unrealistic
(wiedijk2008formal). The largest formal proof corpus is written in Isabelle (paulson1994isabelle), and amounts to less than GB in size, orders of magnitude smaller than datasets commonly used in vision (deng2009imagenet)or natural language processing
(gpt3). To address the scarcity of formal proofs, previous studies have proposed to use synthetic data (wu2021int), self-supervision (polu2020generative; pact)(DBLP:journals/corr/abs-1905-10501; polu2022formal) to synthesize additional formal training data. Although these methods alleviate the data insufficiency to some degree, none are able to capitalize on the bulk of human-written mathematical proofs.Unlike formal mathematics, informal mathematical data is abundant and widely available. Recently, large language models trained on informal mathematical data showcased impressive quantitative reasoning abilities (lewkowycz2022solving; welleck2022naturalprover). However, they often generate erroneous proofs and it is challenging to detect the faulty reasoning in these proofs automatically. Our work devises a novel approach called Draft, Sketch, and Prove (DSP) to translate informal mathematical proofs into formal ones and thus enjoy both the logical rigor provided by formal systems and the wealth of informal data. We give a schematic diagram of the DSP method in Figure 1 and describe it in Section 3. Recent work (wu2022autoformalization) demonstrates the feasibility of automatically translating informal statements into formal ones with large language models. DSP goes beyond and leverages large language models to generate formal proof sketches (wiedijk2003formal) from informal proofs. Proof sketches consist of high-level reasoning steps that can be interpreted by formal systems such as interactive theorem provers. They differ from complete formal proofs in that they contain sequences of intermediate conjectures without justification. An example of informal proof with its corresponding formal proof sketch is provided in Figure 2. In the last step of DSP, we elaborate the formal proof sketch into a full formal proof using an automated prover to prove all intermediate conjectures.
We perform experiments to generate formal proofs of problems from the miniF2F dataset (zheng2021minif2f) and show that a large portion of theorems can be proved automatically with this method. We investigate two settings where the informal proofs are either written by humans or drafted by a large language model trained on mathematical text. These two settings correspond to situations frequently occurring during the formalization of existing theories, where informal proofs are usually available, but sometimes left as exercises to the reader or missing due to space limits in the margin.
We introduce a novel approach to leverage informal proofs to guide automated provers with formal proof sketches.
To evaluate our approach, we build a dataset of manually curated informal statements and informal proofs aligned with formal statements in the miniF2F dataset (zheng2021minif2f).
We increase the proportion of problems solved by an automated prover on miniF2F from to given language-model-generated informal proofs, and up to when proofs are written by humans.
Through three ablation studies, we demonstrate the performance benefit of drafting informal proofs, annotating sketches with informal segments, and using automated provers to close open conjectures for the autoformalization of proofs.
Modern verification systems for mathematics are centered around interactive theorem provers (ITPs), such as Isabelle (paulson1994isabelle), Lean (moura2015lean), Coq (barras1997coq), or Metamath (metamath). ITPs embed the mathematical definitions and theorems onto a solid logical foundation (e.g., Higher-Order Logic, Dependent Type Theory) implemented by their kernels. Every theorem must be checked by the kernel to be recognized by the ITP. To be proved formally, a theorem is first stated in the ITP’s programming language, and iteratively simplified into simpler objectives (or subgoals), until it can be reduced to already proven facts. In this paper, we will refer to proofs verified by a formal theorem prover as formal proofs, and proofs written in “standard” mathematics (e.g. in LaTeX) as informal proofs.
Several approaches propose to combine machine learning with modern interactive theorem provers
(yang2019coqgym; gauthier2021tactictoe), and build upon the recent success of language models (polu2020generative; pact; polu2022formal; jiang2022thor; lample2022hypertree). These methods typically rely on sequence-to-sequence models (sutskever2014sequence) to generate the next step of a proof given the current proof state and perform search over the generated subgoals using powerful search methods such as MCTS (silver2018general). Because search is computationally expensive, these language models are relatively small (with fewer than billion parameters). Our method contrasts with these approaches in that we use a significantly reduced number of calls to the models, but also much larger language models (with up to billion parameters) that showcase outstanding few-shot learning abilities (gpt3).Language models have also been used in the context of purely informal mathematics (Lample2020Deep; hendrycksmath2021; welleck2021naturalproofs; drori2022neural; welleck2022naturalprover). Nevertheless, lewkowycz2022solving note that for quantitative question answering, models are prone to generate false positives: the model guesses the right answer while providing an incorrect proof. These errors are hard to spot without human inspection. Worryingly, the frequency of false positives increases with the difficulty of the problem. Our method builds on these findings and translates informal proofs into formal proofs. Since ITPs are logically grounded, once a formal proof is checked by them, we are guaranteed its correctness.
In a position paper, szegedy2020promising
argued for attaining formal mathematical data from informal sources with neural networks.
wang2020exploration performed preliminary experiments where the evaluation was limited to text-level similarities on synthetic datasets. Recently, wu2022autoformalization found that large language models (chen2021EvaluatingLL; chowdhery2022palm) are capable of few-shot statement autoformalization. Namely, a small number of examples are enough for them to learn to perform informal-to-formal translation of statements. In this paper, we investigate whether these findings can generalize to proof autoformalization, i.e., whether large language models can be used to translate informal proofs into formal ones.In this section, we describe our Draft, Sketch, and Prove (DSP) method for formal proof automation, which leverages informal proofs to guide automated formal theorem provers with proof sketches. We assume that each problem comes with an informal statement and a formal statement describing the problem. Our pipeline consists of three stages (depicted in Figure 1), which we present below.
The initial phase of the DSP method consists in finding informal proofs for a problem according to its description in natural mathematical language (possibly with LaTeX). The resulting informal proof is seen as a draft for the subsequent phases. In mathematical textbooks, proofs of theorems are in general provided, but are sometimes missing or incomplete. Therefore, we consider two settings corresponding to the presence or absence of the informal proofs. In the first, we assume that a “ground-truth” informal proof (i.e., one written by a human) is available, which is the typical scenario in the practice of formalizing existing mathematical theories. In the second setting, we make a more general assumption that the ground-truth informal proof is not given, and draft proof candidates with a large language model trained on informal mathematical data. The language model removes the dependence on human proofs and can produce multiple alternative solutions for every problem. Although there is no easy way to automatically verify the correctness of these proofs, the informal proof only needs to be useful for producing a good formal proof sketch in the next stage.
A formal proof sketch encodes the structure of a solution and leaves out low-level details (wiedijk2003formal). Intuitively, it is a partial proof that outlines high-level conjecture statements. A concrete example of a proof sketch is shown in Figure 2. Although informal proofs often leave aside low-level details, (e.g., by stating their triviality), these details cannot be discharged in a formal proof, making straightforward informal-to-formal proof translation difficult. Instead, we propose to map informal proofs to formal proof sketches that share the same high-level structures. The low-level details missing from a proof sketch can later be filled by an automated prover. Since large informal-formal parallel corpora do not exist, standard machine translation methods are unsuitable for this task. Rather, we use the few-shot learning abilities of a large language model. Specifically, we prompt the model with a few example pairs containing informal proofs and their corresponding formal sketches, followed by an informal proof yet to be translated. We then let the model generate the subsequent tokens to obtain the desired formal sketch. We refer to this model as an autoformalizer.
As the last part of the process, we execute off-the-shelf automated provers to fill in the missing details in proof sketches, where “automated provers” refers to systems capable of producing formally verifiable proofs. Our framework is agnostic to the specific choice of the automated prover: it can be symbolic provers such as heuristic proof automation tools, neural-network-based provers, or hybrid approaches. If the automated prover successfully closes all the gaps in the proof sketch, it returns the final formal proof which can be checked against the problem’s specification. If the automated prover fails (e.g., it exceeds the allocated time limit), we consider the evaluation to be unsuccessful.
We evaluate our method on the miniF2F dataset (zheng2021minif2f). The dataset contains the formal statements of problems from high-school mathematical competitions, written in three formal languages: Lean, HOL-Light, and Isabelle. They are split into a valid set and a test set, composed of problems each. In this work, we choose to experiment with Isabelle for three reasons: (1) Isabelle’s proof corpus is one of the largest among interactive theorem provers, conducive to the language models’ mastery of its syntax; (2) Isabelle supports the declarative proof style (detailed discussion in Appendix A), enabling formal proof sketches (wiedijk2003formal) which are central to our method; (3) although automated proving tools are available in other interactive theorem provers, none are as developed and effective as Sledgehammer (paulson2010three) in Isabelle for proving conjectures.
The miniF2F dataset is comprised of problems from three source categories: (1) problems sampled from the MATH dataset (hendrycksmath2021); (2) problems from actual high-school mathematical competitions (AMC, AIME, and IMO); (3) crafted problems at the same difficulty level as (2). We employ three methods to obtain informal statements and proofs from these sources. For source (1), we access the informal statements and proofs from the MATH dataset; for (2), we retrieve their informal statements and proofs from the AOPS website ^{1}^{1}1https://artofproblemsolving.com/community; and for (3), we manually write down their informal statements and proofs. Thus we gather a parallel set of informal statements, informal proofs, and formal statements. This dataset provides the informal statements and proofs for our experiment in the human-as-informal-proof-writer setting and will be made available shortly.
Our task is to generate formal proofs for problems as they are formally stated in miniF2F. We consider a proof valid if and only if it (a) does not contain “cheating” keywords (sorry and oops) that exit a proof without completing it, and (b) Isabelle is able to verify the corresponding formal statement with the proof. We use the Portal-to-ISAbelle API by jiang2021language to interact with Isabelle.
As a baseline, we attempt to prove the formal statement directly with Sledgehammer, a popular proof automation tool in Isabelle. We use the default Sledgehammer configuration in Isabelle2021, including a 120-second timeout and the five automated theorem provers (Z3, CVC4, SPASS, Vampire, E). Appendix B gives a more thorough introduction to Sledgehammer.
Occasionally, Sledgehammer may fail without trying simple yet effective tactics. As a second, stronger baseline, we create an automated prover that tries common tactics (auto, simp, blast, fastforce, force, eval, presburger, sos, arith, linarith, auto simp: field_simps) for high-school level algebra and number theory problems. If every attempted tactic fails, or times out after seconds, it falls back to Sledgehammer.
Finally, we include baselines which are representative of state-of-the-art neural theorem proving in Isabelle, specifically Thor (jiang2022thor) and Thor with expert iteration on autoformalized data (wu2022autoformalization). The methods GPT-f with expert iteration (polu2022formal), and HyperTree Proof Search (HTPS) (lample2022hypertree) can solve and of the problems on miniF2F-test. However, they rely on the Lean theorem prover instead of Isabelle, which greatly influences the performance due to the different tactics and automation, and are not directly comparable to our method.
When informal proofs are generated, we condition a large language model on informal statements to sample informal proofs per problem. Specifically, we use the Codex code-davinci-002 model (chen2021EvaluatingLL) through the OpenAI API, and the B, B, and B versions of the Minerva model from lewkowycz2022solving. We use greedy decoding for Codex, and nucleus sampling (holtzman2019curious) with temperature and for Minerva models.
For sketching, we manually prepare autoformalization examples of the format informal statement, informal proof, formal statement, formal sketch, to form a pool of high-quality demonstrations. Of these examples, are of the algebra type and are of the number theory type. All examples are from the validation set of the miniF2F dataset and can be found in the supplementary materials. The sketches contain in-line comments as in Figure 2. If the name of the problem gives away its type (algebra or number theory), we only use examples of the corresponding type. We also ensure that the sampled few-shot examples do not contain the problem being solved. The prompt is composed of uniformly randomly sampled example from the pool and the current problem’s (informal statement, informal proof, formal statement). We use this prompt to query the same Codex model to get the desired proof sketches. We use greedy decoding and a maximum of tokens in the generated sequence. For all the experiments, unless stated otherwise, we control the total number of queries made to Codex per problem to be . This means queries per human informal solution and one query per language-model-generated solution.
To prove the conjectures left open by the formal sketch, we use the Sledgehammer + heuristics automated prover described in Subsection 4.2. We execute the automated prover on every open conjecture in the sketch to synthesize a formal proof that can be verified by Isabelle.
Success rate | miniF2F-valid | miniF2F-test |
---|---|---|
Baselines | ||
Sledgehammer | ||
Sledgehammer + heuristics | ||
Thor (jiang2022thor) | ||
Thor + expert iteration (wu2022autoformalization) | ||
Draft, Sketch, and Prove | ||
Human informal proof | ||
Codex informal proof | ||
B Minerva informal proof | ||
B Minerva informal proof | ||
B Minerva informal proof | ||
Ablations (with human informal statements and proofs) | ||
– In-line comments | ||
– Informal proofs | ||
– Formal proof sketches |
In Table 1, we display the proportion of successful formal proofs found on the miniF2F dataset . The results include the four baselines described in Subsection 4.2 and the DSP method with human-written proofs and model-generated proofs. From the table, we can see that the automated prover with additional heuristic tactics significantly increases the performance of Sledgehammer, boosting its success rate from to on the validation set of miniF2F and from to on the test set. The two baselines using language models and proof search (Thor and Thor + expert iteration) achieve success rates of and on the test set of miniF2F, respectively.
With informal proofs written by humans, the DSP method achieves success rates of and on the validation and test sets of miniF2F. A total of out of problems can be proved in this way. The Codex model and the Minerva (B) model give very similar results in solving problems on miniF2F: they both guide the automated prover to solve and of problems on the validation and the test sets respectively. This is corroborated by lewkowycz2022solving’s observation that these two models have comparable performance in solving mathematical problems.
When we switch to the Minerva (B) model, the success rates rise up to and respectively. Compared to human-written informal proofs, its success rates are higher on the validation set and lower on the test set. In total, the Minerva (B) model is able to solve problems on miniF2F, one fewer than with human proofs. The Minerva (B) model solves and of problems in the validation and the test sets of miniF2F, also resulting in successful proofs. The DSP method is effective in guiding the automated prover under both settings: using human informal proofs or language-model-generated informal proofs. DSP almost doubles the prover’s success rate and results in a new state-of-the-art performance on miniF2F with Isabelle. Moreover, the larger Minerva models are almost as helpful as a human in guiding the automated formal prover.
To facilitate the alignment between the informal proofs and the formal proof sketches, we copy relevant segments of the informal proofs as in-line comments in the sketches. In the manually constructed prompt examples, these comments are prefixed to the corresponding Isabelle code blocks, as shown in Figure 2 (the text in red). We hypothesize that this technique is beneficial for large language models to synthesize formal sketches. To validate this hypothesis, we perform an ablation study by removing the in-line comments in the prompt examples before running the experiment. The results are displayed in Table 1. We find that without in-line comments, the success rates drop by and on the validation and test sets respectively. We conclude that having in-line comments is helpful for generating formal proof sketches.
Drafting informal proofs is the first step of the DSP method. To investigate the necessity of this step, we perform an experiment of formal sketching and proving without informal proofs at all. Because formal proof sketches are written in the declarative proof style, they are fairly similar to the informal proof drafts already. Concretely, we remove the informal proofs and the in-line comments (because they are copied segments of the informal proofs) in the prompt examples. This removes the need for the informal proof writer, whether a human or a neural network. The results of this setup are shown in Table 1. It can be seen that the success rates on the validation and the test sets of miniF2F drop by and respectively compared to with human-written proofs. They are also inferior to success rates obtained with language-model-generated informal proofs. This demonstrates the importance of drafting informal proofs before sketching and proving.
Using an autoformalizer to generate proof sketches which are then completed by an automated prover is central to our method. The effect of utilizing an automated prover to close open conjectures in proof sketches is worth studying, so we conduct an ablation experiment for it. Namely, we replace the proof sketches in the prompt examples with complete formal proofs. The complete formal proofs still follow the declarative proof style, but do not contain any open conjectures. As a result, the large language model will also generate full proofs instead of sketches, and we directly check whether these generated proofs are valid. The results in this setup are presented in Table 1. The results reveal that without an automated prover to close open conjectures, the success rate on miniF2F decreases by and on the validation and test sets respectively. The drastic performance difference indicates the essential role of automated provers in our approach.
To understand the effect of the ablations on the DSP method’s scaling properties, we vary the number of autoformalization attempts per problem and plot the number of successful proofs found on the miniF2F dataset in Figure 3 (left). Four methods are contrasted: the original DSP method with human informal proofs, the DSP method without in-line comments, the DSP method without informal proofs, and the DSP method without formal proof sketches. It can be seen from the figure that with the original DSP method, the performance reaches a plateau (no new proofs are found) after autoformalization attempts are made for each problem. For the ablation study with no in-line comments, the plateau is reached much faster, after around autoformalization attempts. This method solves problems in total. The ablation study without informal proofs also reaches a plateau at around autoformalization attempts, solving problems in total. The ablation study without sketching can solve problems on miniF2F. In comparison, with human informal proofs, only autoformalization attempts are required to reach this performance.
[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: Prove that the fraction is irreducible for every natural number .
Informal Proof (Minerva 62B):
We must show that .
The Euclidean algorithm gives
Since , we have .
Formal Proof:
Our experiments demonstrated that model-generated informal proofs from Minerva and Codex can help guide a formal theorem prover. In this section, we analyze the properties of these proofs further. We focus on the informal proofs the B and B Minerva models produce in this section, as they give the best overall performances and achieve the highest success rate on miniF2F.
Interestingly, our approach manages to solve one problem from the International Mathematical Olympiad (imo_1959_1) with a Minerva-generated solution, but not with the human proof. For this problem, we present the successful Minerva-generated informal proof draft and the formal proof in Figure 4. We hypothesize that the reason behind this phenomenon is that human proofs might leave gaps between conjectures that are too difficult for automated provers to solve. On the other hand, the diversity in language model informal proofs makes some of them more amenable to automated provers. In Appendix C, we analyze the human and the Minerva informal proofs for this problem in greater detail.
Next, we analyze the relationship between the validity of the formal proofs and the correctness of the informal proofs. For our analysis, we randomly sample Minerva proofs of different problems, which are then successfully converted to formal proofs. We then manually evaluate the correctness of these informal proofs. Among them, proofs () are entirely correct, are incorrect with a clearly identifiable incorrect step, and “proofs” are nonsensical and simply rephrase the final conclusions of the problems.
Seeing that a total of incorrect informal proofs can lead to successful formal proofs, we study how they guide the automated formal prover despite having flaws themselves. The proofs divide into cases: In the first case, we find problems for which the informal proofs are mostly ignored, and the automated prover can find proofs by itself; In the other problems, although the informal proofs are wrong, the autoformalizer manages to correct them, either by ignoring the erroneous steps or by stating their correct versions in the formal proof sketches. This suggests that the autoformalizer has some understanding of the mathematical statements and is not merely translating them from an informal language to a formal language. It is robust to slight noises in its input. In Appendix D, we present case studies comparing the human and Minerva informal proofs. Particularly, Appendix D shows a completely correct example and one example of each pathological case.
Is there a way to detect which Minerva proofs are correct, without human evaluation? For a preliminary investigation, we filter out all the problems that can be solved directly with the automated prover from the and are left with informal proofs. Of these , are completely correct, still contain small errors, but none are nonsensical. With this simple filter, we achieve a precision of and a recall of in identifying correct Minerva informal proofs.
To understand the influence of different informal proof sources on the scaling properties of DSP, we plot the number of successful proofs found on miniF2F against the number of autoformalization attempts per problem in Figure 3 (right). Note that for each problem, we have informal proof by a human and informal proof drafts by each language model. The one human proof is used times for formal proof sketch generation, while each language model proof draft is used only once. We notice that the B Minerva model and the B Minerva model always have comparable performances. Considering that the B Minerva model is more capable of mathematical reasoning (lewkowycz2022solving, Table 3) than the B model, we hypothesize that the bottleneck in the DSP process shifts from drafting to sketching and proving. I.e., informal proof drafts of higher quality do not necessarily lead to more successful formal proofs due to the limitation of sketching and proving. Both the B and the B models result in more successful proofs than the smaller (B) Minerva model and the Codex model, consistently for any number of attempts. The B Minerva model and the Codex model behave similarly, both finding proofs in the end. Informal proofs written by humans help solve more problems than those by Minerva models for autoformalization attempts. However, the difference is small ( problem) when are made.
Noticing that the number of successful proofs does not plateau for the Minerva-generated proofs, we investigate how further increasing the number of autoformalization attempts changes the number of problems solved for human-written and language-model-generated proofs. For each problem, we use human informal proof and sample sketches for it; we use the same informal proof drafts by the Minerva (B) language model and sample sketches for each draft. The total number of sketches per problem is in both settings. We plot the number of proofs solves with respect to the number of sketches in Figure 5 (right). We find that with human informal proofs, theorems ( on valid/test) have successful formal proofs after attempts. While with language-model-generated informal proofs, theorems ( on valid/test) have successful formal proofs after the same number of autoformalization attempts. This suggests that with enough autoformalization attempts, the diversity in language-model-generated informal proofs can benefit the automated formalization process more than the “ground-truth” human informal proofs.
In Section 4, language models generate
informal proof drafts for each mathematical problem and the autoformalizer is used once on each draft. It is likely that some drafts have the potential to be formalized correctly, but do not get to produce a successful sketch because the randomly sampled examples in the prompt are not suitable. We would like to reduce this variance by attempting autoformalization multiple times, but it is expensive to do so. Therefore we conduct an experiment to investigate what the optimal way of allocating drafts and sketches per draft. For the Minerva (
B) model, we vary the number of informal proof drafts and the number of formal proof sketches per draft, under the constraint that the total number of sketches per problem is fewer than . We present the number of miniF2F problems solved under every combination in Figure 5 (right). The plot shows that when the total number of autoformalization attempts is fixed, increasing the number of drafts per problem yields the most successes on miniF2F.This work utilizes two language models that have been trained on a large amount of internet data. Several prior works (trinh2018simple; carlini2022quantifying) pointed out that such models can memorize some fraction of the data they encounter during training. For drafting informal proofs, we mainly experimented with Minerva. lewkowycz2022solving discussed the memorization effects within Minerva and concluded that they could not find evidence that its abilities are due to memorization. For the autoformalization of proof sketches, the Codex (code-davinci-002) model was used. Its training data was collected before June 2021^{2}^{2}2https://beta.openai.com/docs/models/codex-series-private-beta, at which time the miniF2F dataset had not been made public. So the model cannot benefit from memorizing the exact problems and proofs. Therefore, it is inappropriate to attribute the abilities of models used in this paper to memorization.
In this paper, we introduced Draft, Sketch, and Prove (DSP), a novel approach that takes advantage of informal proofs to synthesize formal proofs. We demonstrated its feasibility and effectiveness by reaching state-of-the-art performance on the miniF2F dataset with the Isabelle theorem prover. Central to our method are formal proof sketches that mirror the high-level reasoning structures of informal proofs. Our ablations showed that the ability to automatically convert informal proofs to proof sketches is critical to the success of DSP.
Our DSP method differs fundamentally from previous applications of machine learning to formal proof synthesis in two aspects. Firstly, while most approaches in the field focus on improving proof search, our method seeks to construct the entire formal proof structure from the informal proof in one decoding operation. The task of the automated prover is then simplified to filling the gaps between intermediate conjectures. Secondly, while existing approaches operate exclusively on formal data, DSP by design benefits from informal proofs.
In this work, we utilized a purely symbolic automated prover to close the gaps in proof sketches. In the future, we aim to equip DSP with more powerful mechanisms, such as HyperTree Proof Search (lample2022hypertree), to broaden the scope of provable theorems. Similar to AlphaCode (li2022alphacode), we found that the number of generations is crucial for performance. The computational cost of the autoformalizer being a bottleneck in our method, we seek to develop approaches able to generate high-quality proof sketches more efficiently.
We thank Rui Yuan and Kunhao Zheng for helping with the informal solutions used in our dataset. We thank Christian Szegedy for his feedback on the early draft.
WL is supported by the ERC Advanced Grant ALEXANDRIA (Project GA 742178).
AQJ conceived the idea of using proof sketches and conducted the experiments. SW constructed the first version of the pipeline and the initial autoformalization prompts. JPZ produced the Minerva informal proofs, and helped conduct autoformalization experiments. GL proposed to use inline comments in formal proof sketches to improve alignment. JPZ, YW, and SW performed the case analyses of Minerva solutions. AQJ, TL, GL, SW, and JL contributed to the dataset. AQJ and WL wrote the final autoformalization prompts. MJ is AQJ’s PhD supervisor. AQJ, GL, SW, and TL wrote the paper. YW and GL directed the project.
Interactive theorem provers such as Isabelle and Mizar use a declarative proof style (syme1997declare), in which a proof is interleaved with conjectures and their corresponding proofs. syme1997declare stated that the list of conjectures in a declarative proof should be analogous to a proof sketch found in a mathematical textbook and sufficiently convincing for the reader. In practice, ITP users often prove a theorem by writing down a list of conjectures (a “formal sketch”), then attempt to find a proof of each conjecture (fill a “gap”) with an automated system.
Sledgehammer (paulson2010three)
is a powerful system that automates reasoning with the interactive theorem prover Isabelle. It works by flattening the goals encoded in the higher-order logic used by Isabelle/HOL into other logics (e.g., first-order logic) which can then be fed into automated theorem provers such as E
^{3}^{3}3https://wwwlehre.dhbw-stuttgart.de/ sschulz/E/E.html, CVC4 ^{4}^{4}4https://cvc4.github.io/index.html, Z3 ^{5}^{5}5https://github.com/Z3Prover/z3, Vampire ^{6}^{6}6https://vprover.github.io/, and SPASS ^{7}^{7}7https://www.spass-prover.org/download/index.html. If any of these automated theorem provers succeeds in finding the proof in their own corresponding format, Sledgehammer reconstructs the proof in Isabelle/HOL with certified provers (metis, meson, and smt), which is relatively more interpretable by humans.As a practical example of using Sledgehammer, one can declare a conjecture in Isabelle/HOL: have "4 dvd (a::nat) 2 dvd a" and call Sledgehammer immediately afterwards. If Sledgehammer succeeds, it will return a proof step that proves the conjecture. In this example, the step is by (meson dvd_trans even_numeral), which uses the meson resolution prover and two facts: that the division relation is transitive and that is an even number. If Sledgehammer does not find the proof or timeouts, it will report failure.
With the Minerva-generated solutions, a proof to the problem imo_1959_p1 is discovered. This is the first problem of the first ever International Mathematical Olympiad (IMO). The informal problem statement, Minerva-generated informal solution, and DSP’s formal proof are shown in Figure 6.
In Figure 6, we can see that the autoformalizer in DSP (a large language model), copies over parts of the informal proof generated by Minerva as in-line comments to precede their corresponding formal proof blocks. The formal proof does not use the first sentence of the informal proof solution as it is already identical to the formal statement. We also notice that the large language model selects relevant premises after writing down the conjectures (the steps starting with using) despite not every premise is strictly needed.
[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: Prove that the fraction is irreducible for every natural number .
Informal Proof (Minerva 62B):
We must show that .
The Euclidean algorithm gives
Since , we have .
Formal Proof:
The formal proof creates conjectures ( have statements and show statement) which are all subsequently proved by our automated theorem prover. The step to prove the statement have "gcd (21*n + 4) (14*n + 3) = 1" involves verified low-level provers smt and z3 and lemmas/facts from outside the scope of the language model. It is highly unlikely that either the large language model or the automated theorem prover can finish this proof on its own.
In contrast, the human-written informal proof of this IMO problem did not lead to a successful formal proof. The human-written proof is:
Denoting the greatest common divisor of as , we use the Euclidean algorithm:
It follows that is irreducible. Q.E.D.
A key difference between the Minerva proof and the human proof is the way that invoking the Euclidean algorithm is described. The Minerva proof explicitly writes out the results of the Euclidean algorithm (e.g. ), which are translated into the sketch (c1 in Figure 6). The human proof introduces new notation to express the results indirectly in terms of greatest common divisors, which ends up being less suitable for sketching. For example, below is a sketch generated with the human proof, which has a conjecture that is semantically incorrect and hence cannot be closed by the automated prover:
[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: For each plumbing repair job, Mr. Wrench charges dollars for coming out to the house plus dollars per hour that he works at the house. He charged for a one-hour repair job and for a five-hour repair job. What is his charge for a two-hour repair job? Show that it is .
Informal Proof (Minerva 62B):
Let be the number of dollars charged for coming out to the house and let be the number of dollars per hour that Mr. Wrench works.
For the one-hour repair job, we have that the total cost is
For the five-hour repair job, we have that the total cost is
We subtract the first equation from the second equation to obtain
Plugging this back into the first equation, we obtain that
Therefore, for a two-hour repair job, the total cost is
Formal Proof:
[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: Show that for any real number and any natural number , if , then .
Informal Proof (Minerva 62B):
This is true for . Now, suppose that this is true for .
Then we have that
and
Therefore, this is true for .
Formal Proof: