DeepAI
Log In Sign Up

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9 39.3

READ FULL TEXT VIEW PDF

page 17

page 19

page 20

09/07/2020

Generative Language Modeling for Automated Theorem Proving

We explore the application of transformer-based language models to autom...
04/27/2022

FindFacts: A Scalable Theorem Search

The Isabelle Archive of Formal Proofs (AFP) has grown to over 500 articl...
01/13/2020

Interactive Visualization of Saturation Attempts in Vampire

Many applications of formal methods require automated reasoning about sy...
02/08/2022

Four Geometry Problems to Introduce Automated Deduction in Secondary Schools

The introduction of automated deduction systems in secondary schools fac...
09/03/2018

Formal Verification of a Geometry Algorithm: A Quest for Abstract Views and Symmetry in Coq Proofs

This extended abstract is about an effort to build a formal description ...
04/02/2018

Learning to Reason with HOL4 tactics

Techniques combining machine learning with translation to automated reas...
02/12/2020

Using Automated Theorem Provers for Mistake Diagnosis in the Didactics of Mathematics

The Diproche system, an automated proof checker for natural language pro...

Code Repositories

miniF2F

An updated version of miniF2F with lots of fixes and informal statements / solutions.


view repo

1 Introduction

Formal proof automation is a challenging task that has been the focus of increased attention in recent years (bansal2019holist; polu2020generative; lample2022hypertree; jiang2022thor; wu2022autoformalization)

. However, deep learning approaches have not been as successful as in other domains, mainly because of the scarcity of formal data. Indeed, formalizing proofs is notoriously difficult and only accessible to a handful of experts, which makes large annotation endeavors unrealistic 

(wiedijk2008formal). The largest formal proof corpus is written in Isabelle (paulson1994isabelle), and amounts to less than GB in size, orders of magnitude smaller than datasets commonly used in vision (deng2009imagenet)

or natural language processing 

(gpt3). To address the scarcity of formal proofs, previous studies have proposed to use synthetic data (wu2021int), self-supervision (polu2020generative; pact)

, or reinforcement learning 

(DBLP:journals/corr/abs-1905-10501; polu2022formal) to synthesize additional formal training data. Although these methods alleviate the data insufficiency to some degree, none are able to capitalize on the bulk of human-written mathematical proofs.

Unlike formal mathematics, informal mathematical data is abundant and widely available. Recently, large language models trained on informal mathematical data showcased impressive quantitative reasoning abilities (lewkowycz2022solving; welleck2022naturalprover). However, they often generate erroneous proofs and it is challenging to detect the faulty reasoning in these proofs automatically. Our work devises a novel approach called Draft, Sketch, and Prove (DSP) to translate informal mathematical proofs into formal ones and thus enjoy both the logical rigor provided by formal systems and the wealth of informal data. We give a schematic diagram of the DSP method in Figure 1 and describe it in Section 3. Recent work (wu2022autoformalization) demonstrates the feasibility of automatically translating informal statements into formal ones with large language models. DSP goes beyond and leverages large language models to generate formal proof sketches (wiedijk2003formal) from informal proofs. Proof sketches consist of high-level reasoning steps that can be interpreted by formal systems such as interactive theorem provers. They differ from complete formal proofs in that they contain sequences of intermediate conjectures without justification. An example of informal proof with its corresponding formal proof sketch is provided in Figure 2. In the last step of DSP, we elaborate the formal proof sketch into a full formal proof using an automated prover to prove all intermediate conjectures.

We perform experiments to generate formal proofs of problems from the miniF2F dataset (zheng2021minif2f) and show that a large portion of theorems can be proved automatically with this method. We investigate two settings where the informal proofs are either written by humans or drafted by a large language model trained on mathematical text. These two settings correspond to situations frequently occurring during the formalization of existing theories, where informal proofs are usually available, but sometimes left as exercises to the reader or missing due to space limits in the margin.

Contributions:

  • We introduce a novel approach to leverage informal proofs to guide automated provers with formal proof sketches.

  • To evaluate our approach, we build a dataset of manually curated informal statements and informal proofs aligned with formal statements in the miniF2F dataset (zheng2021minif2f).

  • We increase the proportion of problems solved by an automated prover on miniF2F from to given language-model-generated informal proofs, and up to when proofs are written by humans.

  • Through three ablation studies, we demonstrate the performance benefit of drafting informal proofs, annotating sketches with informal segments, and using automated provers to close open conjectures for the autoformalization of proofs.

2 Background and Related Work

Interactive theorem proving

Modern verification systems for mathematics are centered around interactive theorem provers (ITPs), such as Isabelle (paulson1994isabelle), Lean (moura2015lean), Coq (barras1997coq), or Metamath (metamath). ITPs embed the mathematical definitions and theorems onto a solid logical foundation (e.g., Higher-Order Logic, Dependent Type Theory) implemented by their kernels. Every theorem must be checked by the kernel to be recognized by the ITP. To be proved formally, a theorem is first stated in the ITP’s programming language, and iteratively simplified into simpler objectives (or subgoals), until it can be reduced to already proven facts. In this paper, we will refer to proofs verified by a formal theorem prover as formal proofs, and proofs written in “standard” mathematics (e.g. in LaTeX) as informal proofs.

Machine learning for formal proof synthesis

Several approaches propose to combine machine learning with modern interactive theorem provers 

(yang2019coqgym; gauthier2021tactictoe), and build upon the recent success of language models (polu2020generative; pact; polu2022formal; jiang2022thor; lample2022hypertree). These methods typically rely on sequence-to-sequence models (sutskever2014sequence) to generate the next step of a proof given the current proof state and perform search over the generated subgoals using powerful search methods such as MCTS (silver2018general). Because search is computationally expensive, these language models are relatively small (with fewer than billion parameters). Our method contrasts with these approaches in that we use a significantly reduced number of calls to the models, but also much larger language models (with up to billion parameters) that showcase outstanding few-shot learning abilities (gpt3).

Machine learning for informal reasoning

Language models have also been used in the context of purely informal mathematics (Lample2020Deep; hendrycksmath2021; welleck2021naturalproofs; drori2022neural; welleck2022naturalprover). Nevertheless, lewkowycz2022solving note that for quantitative question answering, models are prone to generate false positives: the model guesses the right answer while providing an incorrect proof. These errors are hard to spot without human inspection. Worryingly, the frequency of false positives increases with the difficulty of the problem. Our method builds on these findings and translates informal proofs into formal proofs. Since ITPs are logically grounded, once a formal proof is checked by them, we are guaranteed its correctness.

Autoformalization

In a position paper, szegedy2020promising

argued for attaining formal mathematical data from informal sources with neural networks.

wang2020exploration performed preliminary experiments where the evaluation was limited to text-level similarities on synthetic datasets. Recently, wu2022autoformalization found that large language models (chen2021EvaluatingLL; chowdhery2022palm) are capable of few-shot statement autoformalization. Namely, a small number of examples are enough for them to learn to perform informal-to-formal translation of statements. In this paper, we investigate whether these findings can generalize to proof autoformalization, i.e., whether large language models can be used to translate informal proofs into formal ones.

3 Method

In this section, we describe our Draft, Sketch, and Prove (DSP) method for formal proof automation, which leverages informal proofs to guide automated formal theorem provers with proof sketches. We assume that each problem comes with an informal statement and a formal statement describing the problem. Our pipeline consists of three stages (depicted in Figure 1), which we present below.

3.1 Drafting informal proofs

The initial phase of the DSP method consists in finding informal proofs for a problem according to its description in natural mathematical language (possibly with LaTeX). The resulting informal proof is seen as a draft for the subsequent phases. In mathematical textbooks, proofs of theorems are in general provided, but are sometimes missing or incomplete. Therefore, we consider two settings corresponding to the presence or absence of the informal proofs. In the first, we assume that a “ground-truth” informal proof (i.e., one written by a human) is available, which is the typical scenario in the practice of formalizing existing mathematical theories. In the second setting, we make a more general assumption that the ground-truth informal proof is not given, and draft proof candidates with a large language model trained on informal mathematical data. The language model removes the dependence on human proofs and can produce multiple alternative solutions for every problem. Although there is no easy way to automatically verify the correctness of these proofs, the informal proof only needs to be useful for producing a good formal proof sketch in the next stage.

3.2 Mapping informal proofs into formal sketches

A formal proof sketch encodes the structure of a solution and leaves out low-level details (wiedijk2003formal). Intuitively, it is a partial proof that outlines high-level conjecture statements. A concrete example of a proof sketch is shown in Figure 2. Although informal proofs often leave aside low-level details, (e.g., by stating their triviality), these details cannot be discharged in a formal proof, making straightforward informal-to-formal proof translation difficult. Instead, we propose to map informal proofs to formal proof sketches that share the same high-level structures. The low-level details missing from a proof sketch can later be filled by an automated prover. Since large informal-formal parallel corpora do not exist, standard machine translation methods are unsuitable for this task. Rather, we use the few-shot learning abilities of a large language model. Specifically, we prompt the model with a few example pairs containing informal proofs and their corresponding formal sketches, followed by an informal proof yet to be translated. We then let the model generate the subsequent tokens to obtain the desired formal sketch. We refer to this model as an autoformalizer.

[colback=mybrown!5!white,colframe=mybrown!75!black] Informal Statement: Show that for any real number , .

Informal Proof:
It suffices to show . First, consider completing the square for and observe that . Since , we get . Multiplying by 28 and simplifying gives . Since , the result follows.

Formal Proof Sketch:

theorem algebra_binomnegdiscrineq_10alt28asqp1:
  fixes a :: real
  shows "10 * a  28 * a^2 + 1"
proof - (* it suffices to show 0 <= 28a^2 - 10a + 1 *)
  have c0: "0  28a^2 - 10a + 1"
  proof - (* observe that (a - (5/28))^2 = a^2 - (10/28)a + (5/28)^2 *)
    have c1: "(a - (5/28))^2 = a^2 - 10/28a + (5/28)^2" 
    (* we get 0 <= a^2 - (10/28)a + (5/28)^2 *)
    have c2: "0  a^2 - 10/28a + (5/28)^2" using c1 
    (* Multiplying by 28 and simplifying gives 0 <= 28a^2 - 10a + (25/28) *)
    have c3: "0  28a^2 - 10a + 28((5/28)^2)" using c2 
    have c4: "0  28a^2 - 10a + 28((5/28)*(5/28))" using c3 
    have c5: "0  28a^2 - 10a + (25/28)" using c4 
    (* Since 25/28 < 1, the result follows. *)
    show ?thesis using c5 
  qed
  show ?thesis 
qed
Figure 2: A proof sketch in Isabelle. The problem “Show that for any real number , ” is given with an informal proof and an associated formal proof sketch. The sketch first rewrites the original statement (c0), which is proved through 5 intermediary conjectures (c1..c5). We use a special token () to indicate that the conjecture is “open” and should be tackled by an automated prover later. To facilitate the alignment between the informal and formal languages, we annotate the formal proof sketch examples with informal proof segments (shown in red), which are immediately followed by their formal counterparts.

3.3 Proving open conjectures in the sketches

As the last part of the process, we execute off-the-shelf automated provers to fill in the missing details in proof sketches, where “automated provers” refers to systems capable of producing formally verifiable proofs. Our framework is agnostic to the specific choice of the automated prover: it can be symbolic provers such as heuristic proof automation tools, neural-network-based provers, or hybrid approaches. If the automated prover successfully closes all the gaps in the proof sketch, it returns the final formal proof which can be checked against the problem’s specification. If the automated prover fails (e.g., it exceeds the allocated time limit), we consider the evaluation to be unsuccessful.

4 Experiments

4.1 Dataset and evaluation

We evaluate our method on the miniF2F dataset (zheng2021minif2f). The dataset contains the formal statements of problems from high-school mathematical competitions, written in three formal languages: Lean, HOL-Light, and Isabelle. They are split into a valid set and a test set, composed of problems each. In this work, we choose to experiment with Isabelle for three reasons: (1) Isabelle’s proof corpus is one of the largest among interactive theorem provers, conducive to the language models’ mastery of its syntax; (2) Isabelle supports the declarative proof style (detailed discussion in Appendix A), enabling formal proof sketches (wiedijk2003formal) which are central to our method; (3) although automated proving tools are available in other interactive theorem provers, none are as developed and effective as Sledgehammer (paulson2010three) in Isabelle for proving conjectures.

The miniF2F dataset is comprised of problems from three source categories: (1) problems sampled from the MATH dataset (hendrycksmath2021); (2) problems from actual high-school mathematical competitions (AMC, AIME, and IMO); (3) crafted problems at the same difficulty level as (2). We employ three methods to obtain informal statements and proofs from these sources. For source (1), we access the informal statements and proofs from the MATH dataset; for (2), we retrieve their informal statements and proofs from the AOPS website 111https://artofproblemsolving.com/community; and for (3), we manually write down their informal statements and proofs. Thus we gather a parallel set of informal statements, informal proofs, and formal statements. This dataset provides the informal statements and proofs for our experiment in the human-as-informal-proof-writer setting and will be made available shortly.

Our task is to generate formal proofs for problems as they are formally stated in miniF2F. We consider a proof valid if and only if it (a) does not contain “cheating” keywords (sorry and oops) that exit a proof without completing it, and (b) Isabelle is able to verify the corresponding formal statement with the proof. We use the Portal-to-ISAbelle API by jiang2021language to interact with Isabelle.

4.2 Baselines

Sledgehammer

As a baseline, we attempt to prove the formal statement directly with Sledgehammer, a popular proof automation tool in Isabelle. We use the default Sledgehammer configuration in Isabelle2021, including a 120-second timeout and the five automated theorem provers (Z3, CVC4, SPASS, Vampire, E). Appendix B gives a more thorough introduction to Sledgehammer.

Sledgehammer + heuristics

Occasionally, Sledgehammer may fail without trying simple yet effective tactics. As a second, stronger baseline, we create an automated prover that tries common tactics (auto, simp, blast, fastforce, force, eval, presburger, sos, arith, linarith, auto simp: field_simps) for high-school level algebra and number theory problems. If every attempted tactic fails, or times out after seconds, it falls back to Sledgehammer.

Language models for proof search

Finally, we include baselines which are representative of state-of-the-art neural theorem proving in Isabelle, specifically Thor (jiang2022thor) and Thor with expert iteration on autoformalized data (wu2022autoformalization). The methods GPT-f with expert iteration (polu2022formal), and HyperTree Proof Search (HTPS) (lample2022hypertree) can solve and of the problems on miniF2F-test. However, they rely on the Lean theorem prover instead of Isabelle, which greatly influences the performance due to the different tactics and automation, and are not directly comparable to our method.

4.3 Experimental Setup

Drafting

When informal proofs are generated, we condition a large language model on informal statements to sample informal proofs per problem. Specifically, we use the Codex code-davinci-002 model (chen2021EvaluatingLL) through the OpenAI API, and the B, B, and B versions of the Minerva model from lewkowycz2022solving. We use greedy decoding for Codex, and nucleus sampling (holtzman2019curious) with temperature and for Minerva models.

Sketching

For sketching, we manually prepare autoformalization examples of the format informal statement, informal proof, formal statement, formal sketch, to form a pool of high-quality demonstrations. Of these examples, are of the algebra type and are of the number theory type. All examples are from the validation set of the miniF2F dataset and can be found in the supplementary materials. The sketches contain in-line comments as in Figure 2. If the name of the problem gives away its type (algebra or number theory), we only use examples of the corresponding type. We also ensure that the sampled few-shot examples do not contain the problem being solved. The prompt is composed of uniformly randomly sampled example from the pool and the current problem’s (informal statement, informal proof, formal statement). We use this prompt to query the same Codex model to get the desired proof sketches. We use greedy decoding and a maximum of tokens in the generated sequence. For all the experiments, unless stated otherwise, we control the total number of queries made to Codex per problem to be . This means queries per human informal solution and one query per language-model-generated solution.

Proving

To prove the conjectures left open by the formal sketch, we use the Sledgehammer + heuristics automated prover described in Subsection 4.2. We execute the automated prover on every open conjecture in the sketch to synthesize a formal proof that can be verified by Isabelle.

Success rate miniF2F-valid miniF2F-test
Baselines
Sledgehammer
Sledgehammer + heuristics
Thor (jiang2022thor)
Thor + expert iteration (wu2022autoformalization)
Draft, Sketch, and Prove
Human informal proof
Codex informal proof
B Minerva informal proof
B Minerva informal proof
B Minerva informal proof
Ablations (with human informal statements and proofs)
In-line comments
Informal proofs
Formal proof sketches
Table 1: Proving success rates on the miniF2F dataset with Isabelle In the table are the success rates of four baselines, the DSP method with human and language model informal proofs, as well as two ablation studies, on the validation and the test sets of miniF2F. The highest success rates on each set are highlighted in bold. The performance difference between ablation studies and DSP with human informal proofs are enclosed in brackets.

4.4 Results

In Table 1, we display the proportion of successful formal proofs found on the miniF2F dataset . The results include the four baselines described in Subsection 4.2 and the DSP method with human-written proofs and model-generated proofs. From the table, we can see that the automated prover with additional heuristic tactics significantly increases the performance of Sledgehammer, boosting its success rate from to on the validation set of miniF2F and from to on the test set. The two baselines using language models and proof search (Thor and Thor + expert iteration) achieve success rates of and on the test set of miniF2F, respectively.

With informal proofs written by humans, the DSP method achieves success rates of and on the validation and test sets of miniF2F. A total of out of problems can be proved in this way. The Codex model and the Minerva (B) model give very similar results in solving problems on miniF2F: they both guide the automated prover to solve and of problems on the validation and the test sets respectively. This is corroborated by lewkowycz2022solving’s observation that these two models have comparable performance in solving mathematical problems.

When we switch to the Minerva (B) model, the success rates rise up to and respectively. Compared to human-written informal proofs, its success rates are higher on the validation set and lower on the test set. In total, the Minerva (B) model is able to solve problems on miniF2F, one fewer than with human proofs. The Minerva (B) model solves and of problems in the validation and the test sets of miniF2F, also resulting in successful proofs. The DSP method is effective in guiding the automated prover under both settings: using human informal proofs or language-model-generated informal proofs. DSP almost doubles the prover’s success rate and results in a new state-of-the-art performance on miniF2F with Isabelle. Moreover, the larger Minerva models are almost as helpful as a human in guiding the automated formal prover.

5 Analysis

5.1 Ablation studies

Ablation of in-line comments

To facilitate the alignment between the informal proofs and the formal proof sketches, we copy relevant segments of the informal proofs as in-line comments in the sketches. In the manually constructed prompt examples, these comments are prefixed to the corresponding Isabelle code blocks, as shown in Figure 2 (the text in red). We hypothesize that this technique is beneficial for large language models to synthesize formal sketches. To validate this hypothesis, we perform an ablation study by removing the in-line comments in the prompt examples before running the experiment. The results are displayed in Table 1. We find that without in-line comments, the success rates drop by and on the validation and test sets respectively. We conclude that having in-line comments is helpful for generating formal proof sketches.

Ablation of informal proof drafts

Drafting informal proofs is the first step of the DSP method. To investigate the necessity of this step, we perform an experiment of formal sketching and proving without informal proofs at all. Because formal proof sketches are written in the declarative proof style, they are fairly similar to the informal proof drafts already. Concretely, we remove the informal proofs and the in-line comments (because they are copied segments of the informal proofs) in the prompt examples. This removes the need for the informal proof writer, whether a human or a neural network. The results of this setup are shown in Table 1. It can be seen that the success rates on the validation and the test sets of miniF2F drop by and respectively compared to with human-written proofs. They are also inferior to success rates obtained with language-model-generated informal proofs. This demonstrates the importance of drafting informal proofs before sketching and proving.

Ablation of automated provers

Using an autoformalizer to generate proof sketches which are then completed by an automated prover is central to our method. The effect of utilizing an automated prover to close open conjectures in proof sketches is worth studying, so we conduct an ablation experiment for it. Namely, we replace the proof sketches in the prompt examples with complete formal proofs. The complete formal proofs still follow the declarative proof style, but do not contain any open conjectures. As a result, the large language model will also generate full proofs instead of sketches, and we directly check whether these generated proofs are valid. The results in this setup are presented in Table 1. The results reveal that without an automated prover to close open conjectures, the success rate on miniF2F decreases by and on the validation and test sets respectively. The drastic performance difference indicates the essential role of automated provers in our approach.

Scaling properties of ablation studies

To understand the effect of the ablations on the DSP method’s scaling properties, we vary the number of autoformalization attempts per problem and plot the number of successful proofs found on the miniF2F dataset in Figure 3 (left). Four methods are contrasted: the original DSP method with human informal proofs, the DSP method without in-line comments, the DSP method without informal proofs, and the DSP method without formal proof sketches. It can be seen from the figure that with the original DSP method, the performance reaches a plateau (no new proofs are found) after autoformalization attempts are made for each problem. For the ablation study with no in-line comments, the plateau is reached much faster, after around autoformalization attempts. This method solves problems in total. The ablation study without informal proofs also reaches a plateau at around autoformalization attempts, solving problems in total. The ablation study without sketching can solve problems on miniF2F. In comparison, with human informal proofs, only autoformalization attempts are required to reach this performance.

Figure 3: Number of problems solved on miniF2F against the number of autoformalization attempts per problem. Left: The figure displays the experiments carried out with the DSP method and three ablations on it. The curves represent the DSP method (blue), formal proof sketches without the in-line comments (orange), without informal proofs altogether (green), and without the automated provers (red). Right: The figure compares the experimental results with informal proof drafts written by humans (blue), the B Minerva model (orange), the B Minerva model (green), the B Minerva model (red), and the Codex model (purple).

5.2 Language-model-generated proofs

[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: Prove that the fraction is irreducible for every natural number .

Informal Proof (Minerva 62B):
We must show that .
The Euclidean algorithm gives

Since , we have .

Formal Proof:

theorem imo_1959_p1:
  fixes n :: nat
  shows "gcd (21*n + 4) (14*n + 3) = 1"
proof -
  (* The Euclidean algorithm gives
  21n+4=1\cdot(14n+3)+7n+1
  14n+3=2\cdot(7n+1)+1. *)
  have c0: "21*n + 4 = 1*(14*n + 3) + 7*n + 1"
    <ATP> by auto  </ATP>
  have c1: "14*n + 3 = 2*(7*n + 1) + 1" using c0
    <ATP> by auto  </ATP>
  (* Since \gcd(7n+1,1)=1, we have \gcd(21n+4,14n+3)=1. *)
  then have "gcd (7*n + 1) 1 = 1"
    using c1
    <ATP> by auto  </ATP>
  then have "gcd (21*n + 4) (14*n + 3) = 1"
    using c1
    <ATP> by (smt (z3) BitM_plus_one ab_semigroup_add_class.add_ac(1)
    add.assoc c0 gcd.commute gcd_add2 gcd_add_mult mult_numeral_1
    numeral_One numeral_eq_Suc numerals(1) semiring_norm(3))  </ATP>
  then show ?thesis
    using c1
    <ATP> by blast  </ATP>
qed
Figure 4: IMO proof guided by a Minerva informal proof An informal proof of the International Math Olympiad problem imo_1959_p1 generated by Minerva that leads to a successful formal proof. The steps enclosed by the ATP delimiters are generated by an automated prover and all other steps are by the autoformalizer.

Our experiments demonstrated that model-generated informal proofs from Minerva and Codex can help guide a formal theorem prover. In this section, we analyze the properties of these proofs further. We focus on the informal proofs the B and B Minerva models produce in this section, as they give the best overall performances and achieve the highest success rate on miniF2F.

Minerva helps solve one IMO problem

Interestingly, our approach manages to solve one problem from the International Mathematical Olympiad (imo_1959_1) with a Minerva-generated solution, but not with the human proof. For this problem, we present the successful Minerva-generated informal proof draft and the formal proof in Figure 4. We hypothesize that the reason behind this phenomenon is that human proofs might leave gaps between conjectures that are too difficult for automated provers to solve. On the other hand, the diversity in language model informal proofs makes some of them more amenable to automated provers. In Appendix C, we analyze the human and the Minerva informal proofs for this problem in greater detail.

Manual evaluation of Minerva proofs

Next, we analyze the relationship between the validity of the formal proofs and the correctness of the informal proofs. For our analysis, we randomly sample Minerva proofs of different problems, which are then successfully converted to formal proofs. We then manually evaluate the correctness of these informal proofs. Among them, proofs () are entirely correct, are incorrect with a clearly identifiable incorrect step, and “proofs” are nonsensical and simply rephrase the final conclusions of the problems.

Seeing that a total of incorrect informal proofs can lead to successful formal proofs, we study how they guide the automated formal prover despite having flaws themselves. The proofs divide into cases: In the first case, we find problems for which the informal proofs are mostly ignored, and the automated prover can find proofs by itself; In the other problems, although the informal proofs are wrong, the autoformalizer manages to correct them, either by ignoring the erroneous steps or by stating their correct versions in the formal proof sketches. This suggests that the autoformalizer has some understanding of the mathematical statements and is not merely translating them from an informal language to a formal language. It is robust to slight noises in its input. In Appendix D, we present case studies comparing the human and Minerva informal proofs. Particularly, Appendix D shows a completely correct example and one example of each pathological case.

Is there a way to detect which Minerva proofs are correct, without human evaluation? For a preliminary investigation, we filter out all the problems that can be solved directly with the automated prover from the and are left with informal proofs. Of these , are completely correct, still contain small errors, but none are nonsensical. With this simple filter, we achieve a precision of and a recall of in identifying correct Minerva informal proofs.

Scaling properties of human and Minerva proofs

To understand the influence of different informal proof sources on the scaling properties of DSP, we plot the number of successful proofs found on miniF2F against the number of autoformalization attempts per problem in Figure 3 (right). Note that for each problem, we have informal proof by a human and informal proof drafts by each language model. The one human proof is used times for formal proof sketch generation, while each language model proof draft is used only once. We notice that the B Minerva model and the B Minerva model always have comparable performances. Considering that the B Minerva model is more capable of mathematical reasoning (lewkowycz2022solving, Table 3) than the B model, we hypothesize that the bottleneck in the DSP process shifts from drafting to sketching and proving. I.e., informal proof drafts of higher quality do not necessarily lead to more successful formal proofs due to the limitation of sketching and proving. Both the B and the B models result in more successful proofs than the smaller (B) Minerva model and the Codex model, consistently for any number of attempts. The B Minerva model and the Codex model behave similarly, both finding proofs in the end. Informal proofs written by humans help solve more problems than those by Minerva models for autoformalization attempts. However, the difference is small ( problem) when are made.

Figure 5: Number of problems solved on miniF2F. Left: The figure displays the number of successful proofs with human and Minerva-generated informal proof drafts when up to autoformalization attempts are made per problem. The Minerva (B) proof drafts solve more problems than human proof drafts when more than attempts are made per problem. Right: The figure displays the number of successful proofs with different combinations of drafts per problem and sketches per draft. The drafts are by the Minerva ( B) model.

Noticing that the number of successful proofs does not plateau for the Minerva-generated proofs, we investigate how further increasing the number of autoformalization attempts changes the number of problems solved for human-written and language-model-generated proofs. For each problem, we use human informal proof and sample sketches for it; we use the same informal proof drafts by the Minerva (B) language model and sample sketches for each draft. The total number of sketches per problem is in both settings. We plot the number of proofs solves with respect to the number of sketches in Figure 5 (right). We find that with human informal proofs, theorems ( on valid/test) have successful formal proofs after attempts. While with language-model-generated informal proofs, theorems ( on valid/test) have successful formal proofs after the same number of autoformalization attempts. This suggests that with enough autoformalization attempts, the diversity in language-model-generated informal proofs can benefit the automated formalization process more than the “ground-truth” human informal proofs.

Allocation of autoformalization budget

In Section 4, language models generate

informal proof drafts for each mathematical problem and the autoformalizer is used once on each draft. It is likely that some drafts have the potential to be formalized correctly, but do not get to produce a successful sketch because the randomly sampled examples in the prompt are not suitable. We would like to reduce this variance by attempting autoformalization multiple times, but it is expensive to do so. Therefore we conduct an experiment to investigate what the optimal way of allocating drafts and sketches per draft. For the Minerva (

B) model, we vary the number of informal proof drafts and the number of formal proof sketches per draft, under the constraint that the total number of sketches per problem is fewer than . We present the number of miniF2F problems solved under every combination in Figure 5 (right). The plot shows that when the total number of autoformalization attempts is fixed, increasing the number of drafts per problem yields the most successes on miniF2F.

5.3 Memorization

This work utilizes two language models that have been trained on a large amount of internet data. Several prior works (trinh2018simple; carlini2022quantifying) pointed out that such models can memorize some fraction of the data they encounter during training. For drafting informal proofs, we mainly experimented with Minerva. lewkowycz2022solving discussed the memorization effects within Minerva and concluded that they could not find evidence that its abilities are due to memorization. For the autoformalization of proof sketches, the Codex (code-davinci-002) model was used. Its training data was collected before June 2021222https://beta.openai.com/docs/models/codex-series-private-beta, at which time the miniF2F dataset had not been made public. So the model cannot benefit from memorizing the exact problems and proofs. Therefore, it is inappropriate to attribute the abilities of models used in this paper to memorization.

6 Conclusion

In this paper, we introduced Draft, Sketch, and Prove (DSP), a novel approach that takes advantage of informal proofs to synthesize formal proofs. We demonstrated its feasibility and effectiveness by reaching state-of-the-art performance on the miniF2F dataset with the Isabelle theorem prover. Central to our method are formal proof sketches that mirror the high-level reasoning structures of informal proofs. Our ablations showed that the ability to automatically convert informal proofs to proof sketches is critical to the success of DSP.

Our DSP method differs fundamentally from previous applications of machine learning to formal proof synthesis in two aspects. Firstly, while most approaches in the field focus on improving proof search, our method seeks to construct the entire formal proof structure from the informal proof in one decoding operation. The task of the automated prover is then simplified to filling the gaps between intermediate conjectures. Secondly, while existing approaches operate exclusively on formal data, DSP by design benefits from informal proofs.

In this work, we utilized a purely symbolic automated prover to close the gaps in proof sketches. In the future, we aim to equip DSP with more powerful mechanisms, such as HyperTree Proof Search (lample2022hypertree), to broaden the scope of provable theorems. Similar to AlphaCode (li2022alphacode), we found that the number of generations is crucial for performance. The computational cost of the autoformalizer being a bottleneck in our method, we seek to develop approaches able to generate high-quality proof sketches more efficiently.

Acknowledgements

We thank Rui Yuan and Kunhao Zheng for helping with the informal solutions used in our dataset. We thank Christian Szegedy for his feedback on the early draft.

Funding disclosure

WL is supported by the ERC Advanced Grant ALEXANDRIA (Project GA 742178).

List of contributions

AQJ conceived the idea of using proof sketches and conducted the experiments. SW constructed the first version of the pipeline and the initial autoformalization prompts. JPZ produced the Minerva informal proofs, and helped conduct autoformalization experiments. GL proposed to use inline comments in formal proof sketches to improve alignment. JPZ, YW, and SW performed the case analyses of Minerva solutions. AQJ, TL, GL, SW, and JL contributed to the dataset. AQJ and WL wrote the final autoformalization prompts. MJ is AQJ’s PhD supervisor. AQJ, GL, SW, and TL wrote the paper. YW and GL directed the project.

References

Appendix

Appendix A Conjectures and the declarative proof style

Interactive theorem provers such as Isabelle and Mizar use a declarative proof style (syme1997declare), in which a proof is interleaved with conjectures and their corresponding proofs. syme1997declare stated that the list of conjectures in a declarative proof should be analogous to a proof sketch found in a mathematical textbook and sufficiently convincing for the reader. In practice, ITP users often prove a theorem by writing down a list of conjectures (a “formal sketch”), then attempt to find a proof of each conjecture (fill a “gap”) with an automated system.

Appendix B Sledgehammer

Sledgehammer (paulson2010three)

is a powerful system that automates reasoning with the interactive theorem prover Isabelle. It works by flattening the goals encoded in the higher-order logic used by Isabelle/HOL into other logics (e.g., first-order logic) which can then be fed into automated theorem provers such as E 

333https://wwwlehre.dhbw-stuttgart.de/ sschulz/E/E.html, CVC4 444https://cvc4.github.io/index.html, Z3 555https://github.com/Z3Prover/z3, Vampire 666https://vprover.github.io/, and SPASS 777https://www.spass-prover.org/download/index.html. If any of these automated theorem provers succeeds in finding the proof in their own corresponding format, Sledgehammer reconstructs the proof in Isabelle/HOL with certified provers (metis, meson, and smt), which is relatively more interpretable by humans.

As a practical example of using Sledgehammer, one can declare a conjecture in Isabelle/HOL: have "4 dvd (a::nat) 2 dvd a" and call Sledgehammer immediately afterwards. If Sledgehammer succeeds, it will return a proof step that proves the conjecture. In this example, the step is by (meson dvd_trans even_numeral), which uses the meson resolution prover and two facts: that the division relation is transitive and that is an even number. If Sledgehammer does not find the proof or timeouts, it will report failure.

Appendix C A proof to an international mathematical olympiad problem

With the Minerva-generated solutions, a proof to the problem imo_1959_p1 is discovered. This is the first problem of the first ever International Mathematical Olympiad (IMO). The informal problem statement, Minerva-generated informal solution, and DSP’s formal proof are shown in Figure 6.

In Figure 6, we can see that the autoformalizer in DSP (a large language model), copies over parts of the informal proof generated by Minerva as in-line comments to precede their corresponding formal proof blocks. The formal proof does not use the first sentence of the informal proof solution as it is already identical to the formal statement. We also notice that the large language model selects relevant premises after writing down the conjectures (the steps starting with using) despite not every premise is strictly needed.

[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: Prove that the fraction is irreducible for every natural number .

Informal Proof (Minerva 62B):
We must show that .
The Euclidean algorithm gives

Since , we have .

Formal Proof:

theorem imo_1959_p1:
  fixes n :: nat
  shows "gcd (21*n + 4) (14*n + 3) = 1"
proof -
  (* The Euclidean algorithm gives
  21n+4=1\cdot(14n+3)+7n+1
  14n+3=2\cdot(7n+1)+1. *)
  have c0: "21*n + 4 = 1*(14*n + 3) + 7*n + 1"
    <ATP> by auto  </ATP>
  have c1: "14*n + 3 = 2*(7*n + 1) + 1" using c0
    <ATP> by auto  </ATP>
  (* Since \gcd(7n+1,1)=1, we have \gcd(21n+4,14n+3)=1. *)
  then have "gcd (7*n + 1) 1 = 1" using c1
    <ATP> by auto  </ATP>
  then have "gcd (21*n + 4) (14*n + 3) = 1" using c1
    <ATP> by (smt (z3) BitM_plus_one ab_semigroup_add_class.add_ac(1)
    add.assoc c0 gcd.commute gcd_add2 gcd_add_mult mult_numeral_1
    numeral_One numeral_eq_Suc numerals(1) semiring_norm(3))  </ATP>
  then show ?thesis using c1 <ATP> by blast  </ATP>
qed
Figure 6: IMO proof guided by a Minerva informal proof An informal proof of the International Math Olympiad problem imo_1959_p1 generated by Minerva that led to a successful formal proof. The steps enclosed by ATP delimiters are generated by an automated theorem prover and the rest are by the DSP autoformalizer.

The formal proof creates conjectures ( have statements and show statement) which are all subsequently proved by our automated theorem prover. The step to prove the statement have "gcd (21*n + 4) (14*n + 3) = 1" involves verified low-level provers smt and z3 and lemmas/facts from outside the scope of the language model. It is highly unlikely that either the large language model or the automated theorem prover can finish this proof on its own.

Unsuccessful human-written proof.

In contrast, the human-written informal proof of this IMO problem did not lead to a successful formal proof. The human-written proof is:

Denoting the greatest common divisor of as , we use the Euclidean algorithm:

It follows that is irreducible. Q.E.D.

A key difference between the Minerva proof and the human proof is the way that invoking the Euclidean algorithm is described. The Minerva proof explicitly writes out the results of the Euclidean algorithm (e.g. ), which are translated into the sketch (c1 in Figure 6). The human proof introduces new notation to express the results indirectly in terms of greatest common divisors, which ends up being less suitable for sketching. For example, below is a sketch generated with the human proof, which has a conjecture that is semantically incorrect and hence cannot be closed by the automated prover:

theorem
  fixes n :: nat
  shows "gcd (21*n + 4) (14*n + 3) = 1"
proof -
  have "(21*n + 4, 14*n + 3) = (7*n + 1, 14*n + 3)"
    ATP  (* <--- UNSUCCESSFUL *)
  also have "... = (7*n + 1, 1)"
    ATP
  finally show ?thesis
    ATP
qed

Appendix D More case analyses of human and Minerva informal proofs

[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: Show that for any real number , if , then .

Informal Proof (Human):
If , then . So, and , which is a contradiction. If , then . So, and , which is a contradiction. If , then . So, and , which is a contradiction. As a result, the only possible values for x are between 0 and 1 and 0 x 1.

Formal Proof:

theorem algebra_absxm1pabsxpabsxp1eqxp2_0leqxleq1:
  fixes x ::real assumes "abs (x - 1) + abs x + abs (x + 1) = x + 2"
  shows "0 \<le> x \<and> x \<le> 1"
proof -
  (* If , then .
     So,  and , which is a contradiction. *)
  have c0: "x \<le> -1 \<Longrightarrow> False"
  proof -
    assume c1: "x \<le> -1"
    have c2: "abs(x-1) + abs x + abs(x+1) = -(x-1) - x - (x+1)" using c1
      <ATP> by auto </ATP>
    then have c3: "abs(x-1) + abs x + abs(x+1) = -3*x" <ATP> by auto </ATP>
    then have c4: "-3*x = x+2" using assms c3 <ATP> by auto </ATP>
    then have c5: "x = -1/2" <ATP> by auto </ATP>
    then show ?thesis using c1 <ATP> by auto </ATP>
  qed
  (* If , then .
     So,  and , which is a contradiction. *)
  have c6: "-1 < x \<Longrightarrow> x < 0 \<Longrightarrow> False"
  proof -
    assume c7: "-1 < x" assume c8: "x < 0"
    have c9: "abs(x-1) + abs x + abs(x+1) = -(x-1) - x + (x+1)" using c7 c8
      <ATP> by auto </ATP>
    then have c10: "abs(x-1) + abs x + abs(x+1) = 2-x" <ATP> by auto </ATP>
    then have c11: "2-x = x+2" using assms c10 <ATP> by auto </ATP>
    then have c12: "x = 0" <ATP> by auto </ATP>
    then show ?thesis using c8 <ATP> by auto </ATP>
  qed
  (* If , then .
     So,  and , which is a contradiction. *)
  have c13: "x > 1 \<Longrightarrow> False"
  proof -
    assume c14: "x > 1"
    have c15: "abs(x-1) + abs x + abs(x+1) = x-1 + x + (x+1)" using c14
      <ATP> by auto </ATP>
    then have c16: "abs(x-1) + abs x + abs(x+1) = 3*x" <ATP> by auto </ATP>
    then have c17: "3*x = x+2" using assms c16 <ATP> by auto </ATP>
    then have c18: "x = 1" <ATP> by auto </ATP>
    then show ?thesis using c14 <ATP> by auto </ATP>
  qed
  (* As a result, the only possible values for x are between  and  and .*)
  then show ?thesis using c0 c6 c13 <ATP> by fastforce </ATP>
qed
Figure 7: Algebra example with human informal proof. A human informal proof is successful in guiding the formal proof sketch to divide the problem into three cases and drive contradictions to each of them before showing the final objective. The complexity and the consistency of the formal proof sketch are impressive.

[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: Given that and , show that and .

Informal Proof (Minerva 62B):
We can solve for in the first equation, obtaining .
Then we can substitute this expression for into the second equation,
obtaining .
Solving for , we obtain .
Then, we can substitute this value of into the expression for , obtaining .

Formal Proof:

theorem algebra_2varlineareq_fp3zeq11_3tfm1m5zeqn68_feqn10_zeq7:
  fixes f z::complex
  assumes h0: "f + 3*z = 11"
      and h1: "3*(f - 1) - 5*z = -68"
    shows "f = -10 \<and> z = 7"
proof -
  (* We can solve for f in the first equation, obtaining f = 11 - 3z. *)
  have c0: "f = 11 - 3*z"
    using h0
    <ATP> by (auto simp: field_simps)  </ATP>
  (* Then we can substitute this expression for f into the second equation,
     obtaining 3(11 - 3z - 1) - 5z = -68. *)
  have c1: "3*(11 - 3*z - 1) - 5*z = -68"
    using h1 c0
    <ATP> by auto  </ATP>
  (* Solving for z, we obtain z = 7. *)
  have c2: "z = 7"
    using c1
    <ATP> by auto  </ATP>
  (* Then, we can substitute this value of  into the expression for f,
     obtaining f = 11 - 3 \cdot 7 = -10.*)
  have "f = 11 - 3*7"
    using c0 c2
    <ATP> by auto  </ATP>
  then have c3: "f = -10"
    <ATP> by auto  </ATP>
  show ?thesis
    using c2 c3
    <ATP> by auto  </ATP>
qed
Figure 8: Algebra example with Minerva informal proof. An informal proof generated by Minerva that led to a successful formal proof. The autoformalizer generated a proof sketch containing all lines of the formal proof except for those delimited by the ATP tags. The sketch is structured according to the informal proof, containing five intermediate conjectures based on the informal proof. The autoformalizer generated in-line comments in the proof sketch (shown in red), which correctly identified an alignment between the formal and informal proofs.

[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: For any integer , show that or .

Informal Proof (Minerva 62B):
Let for some integer .
Then .
Since is divisible by , we have .
Now, let for some integer .
Then .
Since is an integer, is not divisible by .
Thus, .

Formal Proof:

theorem numbertheory_sqmod4in01d:
  fixes a :: int
  shows "(a^2 mod 4 = 0)
  \<or> (a^2 mod 4 = 1)"
proof (cases "even a")
case True
 (* Let a=2k for some integer k.
    Then a^2=4k^2.
    Since 4k^2 is divisible by 4,
    we have a^2 \equiv 0 \pmod{4}.*)
 then obtain k where "a=2*k"
   using evenE
   <ATP> by auto  </ATP>
 then have "a^2 = 4*k^2"
   unfolding power2_eq_square
   <ATP> by auto  </ATP>
 then have "a^2 mod 4 = 0"
   <ATP> by auto  </ATP>
 then show ?thesis
   <ATP> by auto  </ATP>
next
case False
 (* Now, let a=2k+1 for some integer k.
    Then a^2=4k^2+4k+1=4(k^2+k)+1.
    Since k^2+k is an integer,
    4(k^2+k)+1 is not divisible by 4.
    Thus, a^2 \equiv 1 \pmod{4}.*)
 then obtain k where "a=2*k+1"
  using oddE <ATP> by auto  </ATP>
 then have "a^2 = 4*k^2+4*k+1"
  unfolding power2_eq_square <ATP>
    by (auto simp: field_simps)  </ATP>
 then have "a^2 mod 4 = 1"
  <ATP> by presburger  </ATP>
 then show ?thesis
  <ATP> by auto  </ATP>
qed

[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: For any integer , show that or .

Informal Proof (Human):
.
Using that for any natural number , implies , we have .
Since and , the result follows.

Formal Proof:

theorem numbertheory_sqmod4in01d:
  fixes a :: int
  shows "(a^2 mod 4 = 0)
  \<or> (a^2 mod 4 = 1)"
proof -
(* a \pmod 4 \in {0, 1, 2, 3}. *)
have c0: "a mod 4 \<in> {0, 1, 2, 3}"
  <ATP> by auto  </ATP>
(* Using that for any natural number k,
   a \equiv b \pmod 4 implies
   a^k \equiv b^k \pmod 4,*)
have "a^2 mod 4 = (a mod 4)\<^sup>
      2 mod 4" <ATP> by (smt (z3)
      numeral_eq_Suc power_mod)  </ATP>
(* we have
   a^2 \pmod 4 \in {0, 1, 4, 9}. *)
also have "... \<in> {0, 1, 4, 9}"
  using c0
  <ATP> by auto  </ATP>
(* Since 4 \equiv 0 \pmod 4 and
   9 \equiv 1 \pmod 4,
   the result follows. *)
finally show ?thesis
  <ATP> by auto  </ATP>
qed
Figure 9: Alternative proofs: Minerva-generated (left) and human-written (right). In both proofs, the formal sketch is structured based on the informal proof. The Minerva informal proof and its sketch break the proof into even and odd cases. These cases are not explicitly stated in the informal proof, and the formal sketch makes them explicit (cases “even a”). Each case has three conjectures that are directly based on the informal proof. The sketch excludes parts that are not needed in the formal proof, for instance “Since is an integer, is not divisible by 4”. The human proof uses a different strategy than the Minerva proof, based on the facts that and . The sketch uses these as conjectures, synthesizes an intermediate step not in the informal proof, and excludes the last step of the informal proof.

[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: For each plumbing repair job, Mr. Wrench charges dollars for coming out to the house plus dollars per hour that he works at the house. He charged for a one-hour repair job and for a five-hour repair job. What is his charge for a two-hour repair job? Show that it is .

Informal Proof (Minerva 62B):
Let be the number of dollars charged for coming out to the house and let be the number of dollars per hour that Mr. Wrench works.
For the one-hour repair job, we have that the total cost is
For the five-hour repair job, we have that the total cost is
We subtract the first equation from the second equation to obtain
Plugging this back into the first equation, we obtain that
Therefore, for a two-hour repair job, the total cost is

Formal Proof:

theorem mathd_algebra_160:
  fixes n x ::real
  assumes "n + x = 97"
  and "n + 5 * x = 265"
  shows "n + 2 * x = 139"
proof -
  (* We subtract the first equation
  from the second equation to obtain
  . *)
  have "4 * x = 168"
  using assms <ATP> by auto  </ATP>
  then have "x = 42"
  <ATP> by auto  </ATP>
  (* Plugging this back into
  the first equation, we obtain that
  . *)
  then have "n = 55"
  using assms <ATP> by auto  </ATP>
  (* Therefore, for a two-hour repair
  job, the total cost is
  139}}\textit{\textcolor{patriarch}{<ATP>}}\langle\rangle\textit{\textcolor{patriarch}{ </ATP>}}
qed*)

[colback=mybrown!5!white,colframe=mybrown!75!black] Statement: Show that for any real number and any natural number , if , then .

Informal Proof (Minerva 62B):
This is true for . Now, suppose that this is true for .
Then we have that

and

Therefore, this is true for .

Formal Proof:

theorem induction_1pxpownlt1pnx:
  fixes x :: real and n :: nat
  assumes "-1