1 Introduction
Interactive Theorem Provers (ITP) are functional programming languages for writing and verifying proofs in type theory. Some of them, like Coq or Agda, feature dependent and higherorder inductive types. The ITP community has developed several methods to improve automation, e.g. special tactic languages (Ltac in Coq [Coq]) or special libraries (e.g. SSReflect [SSReflect]). Some provers are hybrid between automated theorem proving and ITP, e.g. ACL2 [KMM001]. The Isabelle/HOL community pioneered methods of direct interfacing of interactive provers with thirdparty automated provers [BlanchetteKPU16], which can also work for a subset of Coq [CzajkaK16].
The large volume of proof data coupled with growing popularity of machinelearning tools inspired an alternative approach to improving automation in ITP. Machinelearning can be used to learn proof strategies from existing proof corpora. This approach has been tested in different ITPs: in Isabelle/HOL via its connection with Sledgehammer [BlanchetteKPU16] or via the library Flyspeck [KaliszykU15], in ACL2 [HerasKJM13] and in Mizar [UrbanRS13]. For Coq, however, only partial solutions have been suggested [KHG13, HK14, GWR14, GWR15].
Several challenges arise when data mining proofs in Coq:

Unlike e.g. ACL2 or Mizar, Coq’s types play a role in proof construction and the proposed machinelearning methods should account for that.

Unlike e.g. Isabelle/HOL, Coq has dependent types, thus separation between proof term and type level syntax is impossible. Any machinelearning method for Coq should reflect the dependency between terms and types.

Coq additionally has a tactic language Ltac introduced to facilitate interactive proof construction. This feature being popular with users, it also needs to be captured in our machinelearning tools.
Challenge C3 was tackled in [KHG13, HK14, GWR14, GWR15], but we are not aware of any existing solutions to challenges C12. It was shown in ML4PG [KHG13, HK14] that clustering algorithms can analyse similarity between Ltac proofs and group them into clusters. Generally, this knowledge can be used either to aid the programmer in proof analysis [HK14]
, or as part of heuristics for construction of new similar proofs
[HerasKJM13]. This paper enhances ML4PG’s clustering methods with the analysis of proofterm structure, which takes into consideration mutual dependencies between terms and types, as well as the recursive nature of the proofs. The novel method thus addresses challenges C12. To complete the picture, we also propose a new premiss selection algorithm for Coq that generates new proof tactics in the Ltac language based upon the clustering output, thus addressing C3 in a new way.Together, the proposed proof clustering and the premiss selection methods offer a novel approach to proof automation with dependent types: “Data mine proof terms, generate proof tactics”. This allows to access and analyse the maximum of information at the data mining stage, and then reduce the search space by working with a simpler tactic language at the proof generation stage.
2 Bird’s Eye View of the Approach and Leading Example
1. Data mining Coq proof terms. In Coq, to prove that a theorem follows from a theory (or current proof context) , we would need to construct an inhabitant of type , which is a proof of in context ; denoted by . Sometimes, is also called the proof term for . A type checking problem () asks to verify whether is indeed a proof for , a type inference problem () asks to verify whether the alleged proof is a proof at all; and a type inhabitation problem ( ) asks to verify whether is provable. Type inhabitation problem is undecidable in Coq, and the special tactic language Ltac is used to aid the programmer in constructing the proof term .
We illustrate the interplay of Coq’s syntax and the Ltac tactic language syntax in the following example. Consider the following proof of associativity of that uses Ltac tactics:
What we see as a theorem statement above is in fact the type of that proof, and the tactics merely help us to construct the proof term that inhabits this type, which we can inspect by using Print app_assoc. command:
Because Coq is a dependently typed language, proof terms and types may share the signature. Context above is given by the List library defining, among other things, the two list constructors and operation of .
The first methodological decision we make here is to data mine proof terms, rather than Ltac tactics.
2. Capturing the structural properties of proofs by using term trees. The second decision we make when analysing the Coq syntax is to view both proof term and type associated with a Coq’s object as a tree. This makes structural properties of proofs apparent.
To use an example with a smaller tree, consider the Coq Lemma forall (n : nat) (H : even n), odd (n + 1). Its term tree is depicted in Figure 1.
3. Clustering term trees.
We next decide on a suitable machine learning approach to analyse Coq proofs, and settle on unsupervised learning.
Clustering algorithms [Bishop] have become a standard tool for finding patterns in large data sets. We use the term data object to refer to an individual element within a data set. In our case, the data set is given by Coq libraries, and the data objects are given by term trees obtained from Coq terms and types. Theorem app_assoc is one object in the List library.Clustering algorithms divide data objects into similar groups (clusters), based on (numeric) vector representation of the objects’ properties. The process of extracting such properties from data is called
feature extraction [Bishop]. Features are parameters chosen to represent all objects in the data set. Feature values are usually given by numbers that instantiate the features for every given object. If features are chosen to represent all objects, the features form feature vectors of length ; and clustering is conducted in an dimensional space.3.1. Converting tree structures to feature matrices. We now need to decide how to convert each proof tree into a feature vector. A variety of methods exists to represent trees as matrices, for instance using adjacency or incidence matrices. The adjacency matrix method shares the following common properties with various previous methods of feature extraction for proofs [lparurban, K13]: different library symbols are represented by distinct features, and the feature values are binary. For large libraries and growing syntax, such feature vectors grow very large (up to in some experiments).
We propose an alternative method that can characterise a large (potentially unlimited) number of Coq terms by a finite number of features and a (potentially unlimited) number of feature values. The features that determine the rows and columns of the matrices are given by the term tree depth and the level index of nodes. In addition, given a Coq term, its features must differentiate its term and type components. As a result, in our encoding each tree node is reflected by three features that represent the term component and the type component of the given node, as well as the level index of its parent node in the term tree, cf. Table 1 (in which refers to “tree depth”).
level index 0  level index 1  level index 2  

td0  (,1,1)  (0,0,0)  (0,0,0) 
td1  (,,0)  (,,0)  (,,0) 
td2  (,,2)  (0,0,0)  (0,0,0) 
td3  (,,0)  (,,0)  (0,0,0) 
3.2. Populating feature matrices with feature values. Feature matrices give a skeleton for extracting proof properties. Next, it is important to have an intelligent algorithm to populate the feature matrices with rational numbers – i.e. feature values. Consider the associativity of the operation of list append given in Theorem app_assoc. Associativity is a property common to many other operations. For example, addition on natural numbers is associative:
The corresponding proof term is as follows:
We would like our clustering tools to group theorems app_assoc and plus_assoc together. It is easy to see that the term tree structures of the theorems app_assoc and plus_assoc are similar, and so will be the structure of their feature matrices. However, the feature matrix cells are populated by feature values. If values assigned to list and nat as well as to ++ and + are very different, then these two theorems may not be grouped into the same cluster.
To give an example of a bad feature value assignment, suppose we define functions and to blindly assign a new natural number to every new object defined in the library. Suppose further that we defined natural numbers at the start of the file of 1000 objects, and lists – at the end of the file, then we may have an assignment nat, +, list, ++. This assignment would suggest to treat functions + and ++ as very distant. If these values populate the feature matrices for our two theorems app_assoc and plus_assoc, the corresponding feature vectors will lie in distant regions of the dimensional plane. This may lead the clustering algorithm to group these theorems in different clusters.
Seeing that the definitions of list and nat are structurally similar, and so are ++ and +, we would rather characterise their similarity by close feature values, irrespective of where in the files they occur. But then our algorithm needs to be intelligent enough to calculate how similar or dissimilar these terms are. For this, we need to cluster all objects in the chosen libraries, find that list and nat are clustered together, and ++ and + are clustered together, and assign close values to objects from the same cluster. In this example, definitions of list, nat, ++ and + do not rely on any other definitions (just Coq primitives), and we can extract their feature matrices and values directly (after manually assigning values to a small fixed number of Coq primitives). In the general case, it may take more than one iteration to reach the primitive symbols.
We call this method of mutual clustering and feature extraction recurrent clustering. It is extremely efficient in analysis of dependentlytyped syntax, in which inductive definitions and recursion play a big role. It also works uniformly to capture the entire Coq syntax: not just proofs of theorems and lemmas in Ltac like in early versions of ML4PG, but also all the type and function definitions.
4. Using clusters to improve proof automation. Suppose now we have the output of the clustering algorithm, in which all Coq objects of interest are grouped into clusters. All objects in each cluster share some structural similarity, like app_assoc and plus_assoc. Suppose we introduce a new lemma for which we do not have a proof but which is clustered together with app_assoc and plus_assoc. We then can readjust the sequences of Ltac tactics used in proofs of app_assoc and plus_assoc and prove .
Our running example so far was simple, and the tactics used in the proofs did not refer to other lemmas. The new recurrent clustering method is most helpful when proofs actually call some other auxiliary lemmas. Simple recombination of proof tactics does not work well in such cases: our new lemma clustered with other finished proofs in cluster may require similar, but not exactly the same auxiliary lemmas.
Take for example lemma maxnACA that states the inner commutativity of the maximum of two natural numbers in the SSReflect library ssrnat:
ML4PG clusters maxnACA together with already proven lemmas addnACA, minnACA, mulnACA — these three lemmas state the inner commutativity of addition, multiplication and the minimum of two naturals, respectively. We will try to construct the proof for maxnACA by analogy with the proofs of addnACA, minnACA, mulnACA. Consider the proof of the lemma addnACA in that cluster: it is proven using the sequence of tactics by move=> m n p q; rewrite !addnA (addnCA n). That is, it mainly relies on auxiliary lemmas addnA and addnCA. The proof of addnACA fails to apply to maxnACA directly. In particular, the auxiliary lemmas addnA and addnCA do not apply in the proof of maxnACA. But similar lemmas would apply! And here the clustering method we have just described comes to the rescue. If we cluster all auxiliary lemmas used in the proofs of addnACA, minnACA, mulnACA with the rest of the lemmas of the ssrnat library we find the cluster minnA, mulnA, maxnA, addnA and the cluster minnAC, mulnAC, maxnAC, addnCA. These two clusters give us candidate auxiliary lemmas to try in our proof of maxnACA, in places where addnA and addnCA were used in addnACA. As it turns out, the lemmas maxnA and maxnAC will successfully apply as auxiliary lemmas in maxnACA, and the sequence of tactics by move=> m n p q; rewrite !maxnA (maxnCA n) proves the lemma maxnACA.
Paper overview. The rest of the paper is organised as follows. Section 3 introduces some of the background concepts. In Section 4, we define the algorithm for automatically extracting features matrices from Coq terms and types. In Section 5, we define a second algorithm, which automatically computes feature values to populate the feature matrices. These two methods are new, and have never been presented before. In Section 6, we propose the new method of premiss selection and tactic generation for Coq based on clustering. Finally, in Section 7 we survey related work and conclude the paper.
3 Background
The underlying formal language of Coq is known as the Predicative Calculus of (Co)Inductive Constructions (pCIC) [Coq].
Definition 1 (pCIC term)
The sorts Set, Prop, Type(i) (i) are terms.
The global names of the environment are terms.
Variables are terms.
If x is a variable and T, U are terms, then forall x:T,U is a term. If x does not occur in U, then forall x:T,U will be written as T > U. A term of the form forall x1:T1, forall x2:T2, …, forall xn:Tn, U will be written as forall (x1:T1) (x2:T2) …(xn:Tn), U.
If x is a variable and T, U are terms, then fun x:T => U is a term. A term of the form fun x1:T1 => fun x2:T2 => … => fun xn:Tn => U will be written as fun (x1:T1) (x2:T2) …(xn:Tn) => U.
If T and U are terms, then (T U) is a term – we use an uncurried notation ((T U1 U2 … Un)) for nested applications ((((T U1) U2) … Un)).
If x is a variable, and T, U are terms, then (let x:=T in U) is a term.
The syntax of Coq [Coq] includes some terms that do not appear in Definition 1; e.g. given a variable x, and terms T and U, fix name (x:T) := U is a Coq term used to declare a recursive definition. The notion of a term in Coq covers a very general syntactic category in the Gallina specification language. However, for the purpose of concise exposition, we will restrict our notion of a term to Definition 1, giving the full treatment of the whole Coq syntax in the actual ML4PG implementation.
Clustering. A detailed study of performance of different wellknown clustering algorithms in proofmining in ML4PG can be found in [KHG13]. In this paper we use the means clustering algorithm throughout as it gave the best evaluation results in [KHG13] and is rather fast (we use an implementation in Weka [Weka]).
The chosen clustering algorithm relies on a usersupplied parameter that identifies the number of clusters to form. Throughout this paper, we will use the following heuristic method [KHG13, HerasKJM13] to determine :
where is a usersupplied granularity value from to . Lower granularity () indicates the general preference for a smaller number of clusters of larger size, and higher granularity () suggests a larger number of clusters of smaller size. In this setting, is supplied by the user, via the ML4PG interface, as an indicator of a general intent.
4 Feature Extraction, Stage1: Extracting Feature Matrices from pCIC Terms
We first introduce a suitable tree representation of pCIC terms. We refer to Section 2 for examples supporting the definitions below.
Definition 2 (pCIC term tree)
Given a pCIC term C, we define its associated pCIC term tree as follows:
If C is one of the sorts Set, Prop or Type(i), then the pCIC term tree of C consists of one single node, labelled respectively by Set:Type(0), Prop:Type(0) or Type(i):Type(i+1).
If C is a name or a variable, then the pCIC term tree of C consists of one single node, labelled by the name or the variable itself together with its type.
If C is a pCIC term of the form forall (x1:T1) (x2:T2) …(xn:Tn), U (analogously for fun (x1:T1) (x2:T2) …(xn:Tn) => U); then, the term tree of C is the tree with the root node labelled by forall (respectively fun) and its immediate subtrees given by the trees representing x1:T1, x2:T2, xn:Tn and U.
If C is a pCIC term of the form let x:=T in U, then the pCIC tree of C is the tree with the root node labelled by let, having three subtrees given by the trees corresponding to x, T and U.
If C is a pCIC term of the form T > U, then the pCIC term tree of C is represented by the tree with the root node labelled by >, and its immediate subtrees given by the trees representing T and U.
If C is a pCIC term of the form (T U1 … Un), then we have two cases. If T is a name, the pCIC term tree of C is represented by the tree with the root node labelled by T together with its type, and its immediate subtrees given by the trees representing U1,…, Un. If T is not a name, the pCIC term tree of C is the tree with the root node labelled by @, and its immediate subtrees given by the trees representing T, U1,…,Un.
Note that pCIC term trees extracted from any given Coq files consist of two kinds of nodes: Gallina and termtype nodes. Gallina is a specification language of Coq, it contains keywords and special tokens such as forall, fun, let or > (from now on, we will call them Gallina tokens). The termtype nodes are given by expressions of the form t1:t2 where t1 is a sort, a variable or a name, and t2 is the type of t1.
We now convert pCIC term trees into feature matrices:
Definition 3 (pCIC Term tree depth level and level index)
Given a pCIC term tree , the depth of the node in , denoted by depth(t), is defined as follows:
, if is a root node;
, where is the depth of the parent node of .
The th level of is the ordered sequence of nodes of depth . As is standard, the order of the sequence is given by visiting the nodes of depth from left to right. We will say that the size of this sequence is the width of the tree at depth . The width of is given by the largest width of its levels. The level index of a node with depth is the position of the node in the th level of . We denote by the node of with depth and index level .
We use the notation to denote the set of matrices of size with rational coefficients.
Definition 4 (pCIC term tree feature matrix)
Given a pCIC term t, its corresponding pCIC term tree with the depth and the width , and three injective functions , and , the feature extraction function builds the term tree matrix of , , where the th entry of captures information from the node as follows:
if is a Gallina node , then the th entry of is a triple where is the level index of the parent of .
if is a node t1:t2, then the th entry of is a triple where is the level index of the parent of the node.
One example of a term tree feature matrix for the term tree of forall (n : nat) (H : even n), odd (n + 1) is given in Table 1. Since the depth of the tree is , and its width is , it takes the matrix of the size to represent that tree. Generally, if the largest pCIC term tree in the given Coq library has the size , we take feature matrices of size for all feature extraction purposes. The resulting feature vectors have an average density ratio of 60%. It has much smaller feature vector size, and much higher density than in sparse approaches [lparurban, K13]. It helps to obtain more accurate clustering results.
In Definition 4, we specify the functions and just by their signature. The function is a predefined function. The number of Gallina tokens is fixed and cannot be expanded by the Coq user. Therefore, we know in advance all the Gallina tokens that can appear in a development, and we can assign a concrete value to each of them. The function is an injective function defined to assign close values to similar Gallina tokens and more distant numbers to unrelated tokens.
The functions and are dynamically redefined for every library and every given proof stage, as the next section will explain.
5 Feature Extraction Stage2: Assigning Feature Values via Recurrent Clustering
When defining the functions and , we must ensure that these functions are sensitive to the structure of terms, assigning close values to similar terms, and more distant values to dissimilar terms.
Starting with the primitives, and will always assign fixed values to the predefined sorts in Coq (cf. item (1) in Definition 5). Next, suppose is the th object of the given Coq library. For variables occurring in a term , and , using a method resembling de Brujn indices (cf. item (2)). Item (3) defines and for the recursive calls. The most interesting (recurrent) case occurs when and need to be defined for subterms of that are defined elsewhere in the library. In such cases, we use output of recurrent clustering of the first object of the library. When the first Coq objects are clustered, each cluster is automatically assigned a unique integer number. Clustering algorithms additionally assign a proximity value (ranging from to ) to every object in a cluster to indicate the proximity of the cluster centroid, or in other terms, the certainty of the given example belonging to the cluster. The definitions of and below use the cluster number and the proximity value to compute feature values:
Definition 5
Given a Coq library, its th object given by a pCIC term t, the corresponding pCIC term tree , and a node , the functions and are defined respectively for the term component t1 and the type component t2 of as follows:

(Base case) If t1 (or t2) is the th element of the set Set, Prop, Type(0), Type(1), Type(2) then its value (or ) is given by .

(Base case) If t1 (or t2) is the th distinct variable in t, then (or ) is assigned the value .

(Base case) If t1t, i.e. t1 is a recursive call (or t2t, i.e. t is an inductive type definition), we assign a designated constant to t1 (or t2, respectively).

(Recurrent case) If t1 or t2 is an th object of the given Coq libraries, where , then cluster the first objects of that library and take into account the clustering output as follows.
If t1 (or t2) belongs to a cluster with associated proximity value , then (or , respectively) is assigned the value .

(Recurrent case) If t1:t2 is a local assumption or hypothesis, then cluster this object against the first objects of the given libraries and take into account the clustering output as follows. If t1 (or t2) belongs to a cluster with associated proximity value , then (or , respectively) is assigned the value .
In the formula for sorts (item 1), the component produces small fractions to reflect the similarity of all sorts, and is added in order to distinguish sorts from variables and names. The formula used elsewhere assigns a value within depending on the proximity value of t in cluster . Thus, elements of the same cluster have closer values compared to the values assigned to elements of other clusters. For example, using the three clusters shown below, , where is the proximity value of eqn in Cluster 1. By contrast, , where is a proximity value of drop in Cluster 2.
The Algorithm at Work: Examples
We finish this section with some examples of clusters discovered in the basic infrastructure of the SSReflect library [SSReflect]. We include here 3 of the 91 clusters discovered by our method automatically after processing 457 objects (across 12 standard files), within 5–10 seconds.

Cluster 1:
Fixpoint eqn (m n : nat) :=match m, n with 0, 0 => true  m’.+1, n’.+1 => eqn m’ n’ _, _ => false end.Fixpoint eqseq (s1 s2 : seq T) :=match s1, s2 with [::], [::] => true  x1 :: s1’, x2 :: s2’ => (x1 == x2) && eqseq s1’ s2’ _, _ => false end. 
Cluster 2:
Fixpoint drop n s := match s, n with  _ :: s’, n’.+1 => drop n’ s’  _, _ => s end.Fixpoint take n s := match s, n with  x :: s’, n’.+1 => x :: take n’ s’  _, _ => [::] end. 
Cluster 3:
The first cluster contains the definitions of equality for natural numbers and lists — showing that the clustering method can spot structural similarities across different libraries. The second cluster discovers similarity between take (takes the first elements of a list) and drop (drops the first elements of a list). The last pattern is less trivial of the three, as it depends on other definitions, like foldr, cat (concatenation of lists) and addn (sum of natural numbers). Recurrent term clustering handles such dependencies well: it assigns close values to cat and addn, since they have been discovered to belong to the same cluster. Note the precision of the recurrent clustering. Among terms it considered, used foldr, however, Cluster 3 contains only definitions, excluding e.g. Definition allpairs s t:=foldr (fun x => cat (map (f x) t)) [::] s ; Definition divisors n:=foldr add_divisors [:: 1] (prime_decomp n) or Definition Poly:=foldr cons_poly 0. This precision is due to the recurrent clustering process with its deep analysis of the term and type structure, including analysis of any auxiliary functions. This is how it discovers that functions add_divisors or cons_poly are structurally different from auxiliary functions cat and addn, and hence definitions allpairs, divisors and Poly are not included in Cluster 3.
This deep analysis of term structure via recurrent clustering improves accuracy and will play a role in the next section.
6 Applications of Recurrent Proof Clustering. A Premiss Selection Method
Several premiss selection methods have been proposed for Isabelle, HOL, as well as several other provers: [KaliszykU15, UrbanRS13]. Relying on Coq’s tactic language, and the clustering results of the previous section, we can formulate the problem of premiss selection for Coq as follows: Given a cluster of pCIC objects from a Coq library and an arbitrary theorem/lemma in this cluster, can we recombine sequences of proof tactics used to prove other theorems/lemmas in in such a way as to obtain a proof for ? In particular, if the proof of requires the use of auxiliary Lemmas from , can we use the outputs of the recurrent clustering to make valid suggestions of auxiliary lemmas?
The algorithm below answers these questions in the positive. To answer the second question, we suggest to automatically examine all auxiliary lemmas used in proofs that belong to the cluster , and look up the clusters to which belong in order to suggest auxiliary lemmas for . This is the essence of the clusteringbased premiss selection method we propose here (see especially the item 4(b) and Algorithm 1 below). As Section 2 has illustrated, often a combination of auxiliary lemmas is required in order to complete a given proof, and the below algorithm caters for such cases.
Premise Selection Method 6.1
Given the statement of a theorem and a set of lemmas , find a proof for as follows:

Using the recurrent clustering and as dataset, obtain the cluster that contains the theorem (possibly alongside other lemmas ).

Obtain the sequence of tactics that are used to prove each lemma in .

Try to prove using , for each .

If no such sequence of tactics proves , then infer new arguments for each tactic as follows:

If the argument of is an internal hypothesis from the context of a proof, try all the internal hypothesis from the context of the current proof.

If the argument of is an external lemma , use the recurrent clustering and as dataset to determine which lemmas are in a cluster with and try all those lemmas in turn as arguments of .

This can be naturally extended to tactics with several arguments, as Algorithm 1 shows.

The heart of the above procedure is the process of Lemma selection (item 4(b)), which we state more formally as Algorithm 6.1.
Evaluation
Using five Coq libraries of varied sizes, we perform an empirical evaluation of the premiss selection method, and thereby testify the accuracy of the proposed recurrent clustering. Our test data consists of the basic infrastructure of the SSReflect library [SSReflect], the formalisation of Javalike bytecode presented in [HK14], the formalisation of persistent homology [HCMS12], the Paths library of the HoTT development [Hott], and the formalisation of the Nash Equilibrium [nash].
Library  Language  Granularity  Granularity  Granularity  Library size (No of Theorems) 

SSReflect library  SSReflect  1389  
JVM  SSReflect  49  
Persistent Homology  SSReflect  306  
Paths (HoTT)  Coq  80  
Nash Equilibrium  Coq  145 
The results of our experiments are given in Table 2. The success rate of the premiss selection method depends on how similar the proofs of theorems in a given library are. Additionally, and unlike all other premiss selection methods [KaliszykU15, UrbanRS13], we now have to factor in the fact that Coq’s proofs as given by the tactics in Ltac language may require a sophisticated combination of tactics for which our premiss selection method does not cater. Indeed, some proofs may not call auxiliary lemmas at all and incorporate all the reasoning within one proof. This explains the high success rate in the Paths (HoTT) library, where most of the lemmas are proven in the same style, and using auxiliary lemmas in the same wellorganised way, and the low rate in the Persistent Homology library, where just a few lemmas have similar proofs with auxiliary lemmas. The granularity value does not have a big impact in the performance of our experiments, and almost the same number of lemmas is proven with different granularity values. In some cases, like in the Nash equilibrium library, a small granularity value generates bigger clusters that increase the exploration space allowing to prove more lemmas. However, reducing the granularity value can also have a negative impact; for instance, in the JVM library the number of clusters is reduced and this leads to a reduction in the number of the proven theorems.
7 Conclusions and Related Work
The presented hybrid method of proof mining combines statistical machine learning and premiss selection, Coq’s proof terms and Ltac tactics. It is specifically designed to cater for a dependently typed language such as that of Coq. The recurrent clustering analyses tree structures of proof terms and types, and groups all Coq objects (definitions, lemmas, theorems, …) into sets, depending on their dependencies and structural similarity with the given libraries. Previous versions of ML4PG could only analyse tactics, rather than proof terms or definitions.
The output of the clustering algorithm can be used to directly explore the similarity of lemmas and definitions, and ML4PG includes a graphical interface to do that. In this paper we presented a novel method of premiss selection that allows to use the output of clustering algorithms for analogybased premiss selection. We use the Ltac language to automatically generate candidate tactics from clusters. Ltac’s language gives a much simpler syntax to explore compared with a possible alternative – a direct generation of Coq term trees.
Evaluation of the method shows that it bears promise, especially in the libraries where the proofs are given in a uniform style, and auxiliary lemmas are used consistently. Capturing the role of auxiliary lemmas in proofs is known to be a hard problem in proof mining, and recurrent clustering gives a good solution to this problem. Other existing methods, like [GWR14, GWR15] address limitations of our premiss selection method, and suggest more sophisticated algorithms for tactic recombination. Integration of these methods with ML4PG is under way.
The recurrent clustering method is embedded into ML4PG: http://www.macs.hw.ac.uk/~ek19/ML4PG/. The integration of the Premiss Selection method into ML4PG or directly into Coq’s Ltac is still future work.
Related work
Statistical machine learning for automation of proofs: premiss selection.
The method of statistical proofpremise selection [KaliszykU15, UrbanRS13, lparurban, K13]
is applied in several theorem provers. The method we presented in Section 6 is an alternative method of premiss selection, due to its reliance on the novel algorithm of recurrent clustering. Also, this paper presents the first premiss selection method that takes into consideration the structure of proof terms in a dependently typed language.
Automated solvers in Coq. In [CzajkaK16], the standard “hammer” approach was adopted, where a fragment of Coq syntax was translated into FOL, and then the translated lemma and theorem statements (without proof terms) were proven using existing firstorder solvers. By contrast, our approach puts emphasis on mining proof terms that inhabit the lemmas and theorems in Coq. Our method does not rely on any third party firstorder solvers. Results of clustering of the Coq libraries are directly used to reconstruct Coq proofs, by analogy with other proofs in the library. Thus, the process of (often unsound and inevitably incomplete) translation [CzajkaK16] from the higherorder dependently typed language of Coq into FOL becomes redundant.
Theory exploration: models from tactic sequences. While statistical methods focus on extracting information from existing large libraries, symbolic methods are instead concerned with automating the discovery of lemmas in new theories [GWR14, GWR15, JDB11, HLW12, hipster], relying on existing proof strategies, e.g. proofplanning and rippling [Duncan02, BB05]. The method of tactic synthesis given in Section 6 belongs to that group of methods.
The method developed in this paper differs from the cited papers in that it allows to reason deeper than the surface of the tactic language. I.e., statistical analysis of pCIC proof terms and types as presented in Section 4 is employed before the tactics in the Ltac language are generated.
Combination of Statistical and Symbolic Machine Learning Methods in theorem proving. Statistical machine learning was used to support auxiliary lemma formulation by analogy in the setting of ACL2, see [HerasKJM13]. In this paper, we extended that approach in several ways: we included types, proof terms and the tactic language – all crucial building blocks of Coq proofs as opposed to firstorder untyped proofs of ACL2.