Towards Evolutionary Theorem Proving for Isabelle/HOL

04/17/2019 ∙ by Yutaka Nagashima, et al. ∙ 0

Mechanized theorem proving is becoming the basis of reliable systems programming and rigorous mathematics. Despite decades of progress in proof automation, writing mechanized proofs still requires engineers' expertise and remains labor intensive. Recently, researchers have extracted heuristics of interactive proof development from existing large proof corpora using supervised learning. However, such existing proof corpora present only one way of proving conjectures, while there are often multiple equivalently effective ways to prove one conjecture. In this abstract, we identify challenges in discovering heuristics for automatic proof search and propose our novel approach to improve heuristics of automatic proof search in Isabelle/HOL using evolutionary computation.



There are no comments yet.


page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Background

1.1. Interactive Theorem Proving

Interactive theorem provers (ITPs) are forming the basis of reliable software engineering. Klein et al. proved the correctness of the seL4 micro-kernel using Isabelle/HOL (Klein et al., 2010). Leroy developed a verified opimizing C compiler, CompCert, in Coq (Leroy, 2009). Kumar et al. built a verified compiler for a functional programming language, CakeML, in HOL4 (Kumar et al., 2014). In mathematics, mathematicians are substituting their pen-and-paper proofs with mechanized proofs to avoid human-errors in their proofs: Hales et al. mechanically proved the Kepler conjecture using HOL-light and Isabelle/HOL (Hales et al., 2015), whereas Gonthier et al. proved of the four colour theorem in Coq (Gonthier, 2007). In theoretical computer science, Paulson proved Gödel’s incompleteness theorems using Nominal Isabelle (Paulson, 2015).

1.2. Meta-Tool Approach for Proof Automation

To facilitate efficient proof developments in large scale verification projects, modern ITPs are equipped with many sub-tools, such as proof methods and tactics. For example, Isabelle/HOL comes with 160 proof methods defined in its standard library. These sub-tools provide useful automation for interactive proof development.


Nagashima et al. presented PSL, a proof strategy language (Nagashima and Kumar, 2017), for Isabelle/HOL. PSL is a programmable, extensible, meta-tool based framework, which allows Isabelle users to encode abstract descriptions of how to attack proof obligations.

Given a PSL strategy and proof obligation, PSL’s runtime system first creates various versions of proof methods specified by the strategy, each of which tailored out for the proof obligation, and combine them both sequentially and non-deterministically, while exploring search space by applying these created proof methods.

The default strategy, try_hard, outperformed, sledgehammer, the state-of-the-art proof automation for Isabelle/HOL, by 16 percentage points when tested against 1,526 proof obligations for 300 seconds of timeout; However, the dependence on the fixed default strategy impairs PSL’s runtime system: try_hard sometimes produces proof methods that are, for human engineers, obviously inappropriate to the given proof obligations.


Nagashima et al. developed PaMpeR (Nagashima and He, 2018), a proof method recommendation tool, trying to further automate proof development in Isabell/HOL. PaMpeR learns when to use which proof methods from human-written large proof corpora called the Archive of Formal Proofs (AFP)(Klein et al., 2004). The AFP is an online journal that hosts various formalization projects and mechanized proof scripts. Currently, the AFP consists of 460 articles with 126,100 lemmas written by 303 authors in total.


first preprocess this data base: it applies 108 assertions to each (possibly intermediate) proof obligation appearing in the AFP and converts each of them into a vector of boolean values. This way,

PaMpeR creates 425,334 data points, each of which is tagged with the name of proof method chosen by a human engineer to attack the obligation represented by the corresponding vector. Then, PaMpeR applies a multi-output regression tree construction algorithm to the database. This process builds a regression tree for each proof method. For instance, PaMpeR builds the following tree for the induct method:

(1, (10, expectation 0.0110944442872,
         expectation 0.00345987448177),
    (10, expectation 0.0510162518838,
         expectation 0.0102138733024))

where each of 1 and 10 in the first elements of the pairs represent the number of the corresponding assertion. For example, this tree tells that for proof obligations to which the assertion 1 returns false but the assertion 10 returns true, the chance of an experienced proof engineer using the induct method is about 5.1%.

When a user of PaMpeR seeks for a recommendation, PaMpeR transforms the proof obligation at hand into a vector of boolean values and looks up the trees and presents its recommendations.

PaMpeR’s regression tree construction is based on a problem transformation method, which handles a multi-output problem as a set of independent single-output problems: For each obligation, PaMpeR attempts to provide multiple promising proof methods to attack the obligation, by computing how likely each proof method is useful to the obligation one by one.

PaMpeR is not optimal to guide Psl

One would imagine that it is natural step forward to improve PSL’s default strategy by allowing PaMpeR to choose the most promising strategy for a given problem instead of always applying the fixed strategy, try_hard, naively.

Despite the positive results of cross-validation reported by Nagashima et al., PaMpeR’s recommendation is not necessarily optimal to guide an automatic meta-tool based proof search for two reasons. First, PaMpeR recommends only one step of proof method application, even though many proof methods, such as induction, can discharge proof obligations only when followed by appropriate proof methods, such as auto, which is a general purpose proof method in Isabelle/HOL. Second, when PaMpeR transforms a multi-output problem to a set of single-output problems, PaMpeR

preprocess the database introducing a conservative estimate of the correct choice of proof methods. In the above example,

PaMpeR’s pre-processor produces the following data point for all databases corresponding to proof methods that are not induction.

not, [1,0,0,1,0,0,0,0,1,0,0,1,0,...]

We know that this conservative estimate wrongfully lowers the expectation for other proof methods for this case. For example, Isabelle/HOL has multiple proof methods for induction, such as induct and induct_tac. Experienced engineers know induction is a valid choice for most proof obligations where induct is used. Unfortunately, it is not computationally plausible to find out all alternative proofs for a proof obligation, since many proof methods return intermediate proof obligations that have to be discharged by other methods and even equivalently effective methods for the same obligation may return distinct intermediate proof obligations. In the above example, even though both induct and induction are the right choice for many proof obligations, they return slightly different intermediate proof goals for most of the cases, making it difficult to decide systematically if induct was also the right method where human engineers used induction method.

2. Evolutionary Prover in Isabelle/HOL

We propose a novel approach based on evolutionary computation to overcome the aforementioned limitations of method recommendation based on supervised learning. Our objective is to discover heuristics to choose the most promising PSL strategy out of many hand written default strategies when applied to a given proof goal, so that PSL can exploit computational resources more effectively.

We represent programs as a sequence of floating point numbers, each of which corresponds to a combinations of results of applying assertions to a proof obligation. PaMpeR leaned 239 proof methods from the AFP and built a tree of height of two for each of them; Therefore, we represent a program as a sequence of floating number of length 956, which is the total number of leaf nodes in all regression trees. Then, we assign such sequence to each default proof strategy. Our prover first applies assertions to categorize a proof obligation, then determines and applies the most promising strategy for that obligation.

As a training data set, we randomly picks up a set of proof obligations from large proof corpora. And we measure how many obligations in this data set each version of our prover can discharge given a fixed timeout for each obligation. The more proof goals in the data set a prover can discharge, the better the prover is.

After each iteration, we mutate the program, which is a mapping function from a combination of results of assertions to the likelihood of each strategy being promising to the corresponding proof obligations. After each evaluation, we select provers with higher success rates and leave them for the next iteration, while discarding those with lower success rates.

We are still designing the details of the aforementioned experiment. We expect that when combined with the goal-oriented conjecturing mechanism (Nagashima and Parsert, 2018) this project leads to the meta-tool based smart proof search in Isabelle/HOL initially proposed in 2017 (Nagashima, 2017).