1 Introduction: Hammers, Learning and Watchlists
Hammerstyle automation tools connecting interactive theorem provers (ITPs) with automated theorem provers (ATPs) have recently led to a significant speedup for formalization tasks [5]. An important component of such tools is premise selection [1]
: choosing a small number of the most relevant facts that are given to the ATPs. Premise selection methods based on machine learning from many proofs available in the ITP libraries typically outperform manually specified heuristics
[1, 17, 19, 7, 4, 2]. Given the performance of such ATPexternal guidance methods, learningbased internal proof search guidance methods have started to be explored, both for ATPs [36, 18, 15, 23, 8] and also in the context of tactical ITPs [10, 12].In this work we develop learningbased internal proof guidance methods for the E [30] ATP system and evaluate them on the large Mizar Mathematical Library [11]. The methods are based on the watchlist (also hint list) technique developed by Veroff [37], focusing proof search towards lemmas (hints) that were useful in related proofs. Watchlists have proved essential in the AIM project [21] done with Prover9 [25] for obtaining very long and advanced proofs of open conjectures. Problems in large ITP libraries however differ from one another much more than the AIM problems, making it more likely for unrelated watchlist lemmas to mislead the proof search. Also, Prover9 lacks a number of largetheory mechanisms and strategies developed recently for E [16, 13, 15].
Therefore, we first design watchlistbased clause evaluation heuristics for E that can be combined with other E strategies. Second, we complement the internal watchlist guidance by using external statistical machine learning to preselect smaller numbers of watchlist clauses relevant for the current problem. Finally, we use the watchlist mechanism to develop new proof guiding algorithms that load many previous proofs inside the ATP and focus the search using a dynamically updated heuristic representation of proof search state based on matching the previous proofs.
The rest of the paper is structured as follows. Section 2 briefly summarizes the work of saturationstyle ATPs such as E. Section 3 discusses heuristic representation of search state and its importance for learningbased proof guidance. We propose an abstract vectorial representation expressing similarity to other proofs as a suitable evolving characterization of saturation proof searches. We also propose a concrete implementation based on proof completion ratios tracked by the watchlist mechanism. Section 4 describes the standard (static) watchlist mechanism implemented in E and Section 5 introduces the new dynamic watchlist mechanisms and its use for guiding the proof search. Section 6 evaluates the static and dynamic watchlist guidance combined with learningbased preselection on the Mizar library. Section 7 shows several examples of nontrivial proofs obtained by the new methods, and Section 8 discusses related work and possible extensions.
2 Proof Search in Saturating FirstOrder Provers
The state of the art in firstorder theorem proving is a saturating prover based on a combination of resolution/paramodulation and rewriting, usually implementing a variant of the superposition calculus [3]. In this model, the proof state is represented as a set of firstorder clauses (created from the axioms and the negated conjecture), and the system systematically adds logical consequences to the state, trying to derive the empty clause and hence an explicit contradiction.
All current saturating firstorder provers are based on variants of the givenclause algorithm. In this algorithm, the proof state is split into two subsets of clauses, the processed clauses (initially empty) and the unprocessed clauses . On each iteration of the algorithm, the prover picks one unprocessed clause (the socalled given clause), performs all inferences which are possible with and all clauses in as premises, and then moves into . The newly generated consequences are added to . This maintains the core invariant that all inferences between clauses in have been performed. Provers differ in how they integrate simplification and redundancy into the system, but all enforce the variant that is maximally simplified (by first simplifying with clauses in , then backsimplifying with ) and that contains neither tautologies nor subsumed clauses.
The core choice point of the givenclause algorithm is the selection of the next clause to process. If theoretical completeness is desired, this has to be fair, in the sense that no clause is delayed forever. In practice, clauses are ranked using one or more heuristic evaluation functions, and are picked in order of increasing evaluation (i.e. small values are good). The most frequent heuristics are based on symbol counting, i.e., the evaluation is the number of symbol occurrences in the clause, possibly weighted for different symbols or symbols types. Most provers also support interleaving a symbolcounting heuristic with a firstinfirstout (FIFO) heuristic. E supports the dynamic specification of an arbitrary number of differently parameterized priority queues that are processed in weighted roundrobbin fashion via a small domainspecific language for heuristics.
Previous work [28, 31] has both shown that the choice of given clauses is critical for the success rate of a prover, but also that existing heuristics are still quite bad  i.e. they select a large majority of clauses not useful for a given proof. Positively formulated, there still is a huge potential for improvement.
3 Proof Search State in Learning Based Guidance
A good representation of the current state is crucial for learningbased guidance. This is quite clear in theorem proving and famously so in Go and Chess [32, 33]. For example, in the TacticToe system [10] proofs are composed from preprogrammed HOL4 [34] tactics that are chosen by statistical learning based on similarity of the evolving goal state to the goal states from related proofs. Similarly, in the learning versions of leanCoP [26] – (FE)MaLeCoP [36, 18] – the tableau extension steps are guided by a trained learner using similarity of the evolving tableau (the ATP proof search state) to many other tableaux from related proofs.
Such intuitive and compact notion of proof search state is however hard to get when working with today’s highperformance saturationstyle ATPs such as E [30] and Vampire [22]. The above definition of saturationstyle proof state (Section 2) as either one or two (processed/unprocessed) large sets of clauses is very unfocused. Existing learningbased guiding methods for E [15, 23] practically ignore this. Instead, they use only the original conjecture and its features for selecting the relevant given clauses throughout the whole proof search.
This is obviously unsatisfactory, both when compared to the evolving search state in the case of tableau and tactical proving, and also when compared to the way humans select the next steps when they search for proofs. The proof search state in our mind is certainly an evolving concept based on the search done so far, not a fixed set of features extracted just from the conjecture.
3.1 Proof Search State Representation for Guiding Saturation
One of the motivations for the work presented here is to produce an intuitive, compact and evolving heuristic representation of proof search state in the context of learningguided saturation proving. As usual, it should be a vector of (realvalued) features that are either manually designed or learned. In a highlevel way, our proposed representation is a
vector expressing an abstract similarity of the search state to (possibly many) previous related proofs. This can be implemented in different ways, using both statistical and symbolic methods and their combinations. An example and motivation comes again from the work of Veroff, where a search is considered promising when the given clauses frequently match hints. The gaps between the hint matchings may correspond to the more bruteforce bridges between the different proof ideas expressed by the hints.Our first practical implementation introduced in Section 5 is to load upon the search initialization related proofs , and for each keep track of the ratio of the clauses from that have already been subsumed during the search. The subsumption checking is using E’s watchlist mechanism (Section 4). The long vector of such proof completion ratios is our heuristic representation of the proof search state, which is both compact and typically evolving, making it suitable for both hardcoded and learned clause selection heuristics.
In this work we start with fast hardcoded watchliststyle heuristics for focusing inferences on clauses that progress the more finished proofs (Section 5). However training e.g. a statistical ENIGMAstyle [15] clause evaluation model by adding to the currently used ENIGMA features is a straightforward extension.
4 Static Watchlist Guidance and its Implementation in E
E originally implemented a watchlist mechanism as a means to force direct, constructive proofs in first order logic. For this application, the watchlist contains a number of goal clauses (corresponding to the hypotheses to be proven), and all newly generated and processed clauses are checked against the watchlist. If one of the watchlist clauses is subsumed by a new clause, the former is removed from the watchlist. The proof search is complete, once all clauses from the watchlist have been removed. In contrast to the normal proof by contradiction, this mechanism is not complete. However, it is surprisingly effective in practice, and it produces a proof by forward reasoning.
It was quickly noted that the basic mechanism of the watchlist can also be used to implement a mechanism similar to the hints successfully used to guide Otter [24] (and its successor Prover9 [25]) in a semiinteractive manner [37]. Hints in this sense are intermediate results or lemmas expected to be useful in a proof. However, they are not provided as part of the logical premises, but have to be derived during the proof search. While the hints are specified when the prover is started, they are only used to guide the proof search  if a clause matches a hint, it is prioritized for processing. If all clauses needed for a proof are provided as hints, in theory the prover can be guided to prove a theorem without any search, i.e. it can replay a previous proof. A more general idea, explored in this paper, is to fill the watchlist with a large number of clauses useful in proofs of similar problems.
In E, the watchlist is loaded on startup, and is stored in a feature vector index [29] that allows for efficient retrieval of subsumed (and subsuming) clauses. By default, watchlist clauses are simplified in the same way as processed clauses, i.e. they are kept in normal form with respect to clauses in . This increases the chance that a new clause (which is always simplified) can match a similar watchlist clause. If used to control the proof search, subsumed clauses can optionally remain on the watchlist.
We have extended E’s domainspecific language for search heuristics with two priority functions to access information about the relationship of clauses to the watchlist  the function PreferWatchlist gives higher rank to clauses that subsume at least one watchlist clause, and the dual function DeferWatchlist
ranks them lower. Using the first, we have also defined four builtin heuristics that preferably process watchlist clauses. These include a pure watchlist heuristic, a simple interleaved watch list function (picking 10 out of every eleven clauses from the watchlist, the last using FIFO), and a modification of a strong heuristic obtained from a genetic algorithm
[27] that interleaves several different evaluation schemes and was modified to prefer watchlist clauses in two of its four subevaluation functions.5 Dynamic Watchlist Guidance
In addition to the above mentioned static watchlist guidance, we propose and experiment with an alternative: dynamic watchlist guidance. With dynamic watchlist guidance, several watchlists, as opposed to a single watchlist, are loaded on startup. Separate watchlists are supposed to group clauses which are more likely to appear together in a single proof. The easiest way to produce watchlists with this property is to collect previously proved problems and use their proofs as watchlists. This is our current implementation, i.e., each watchlist corresponds to a previous proof. During a proof search, we maintain for each watchlist its completion status, i.e. the number of clauses that were already encountered. The main idea behind our dynamic watchlist guidance is to prefer clauses which appear on watchlists that are closer to completion. Since watchlists now exactly correspond to previous refutational proofs, completion of any watchlist implies that the current proof search is finished.
5.1 Watchlist Proof Progress
Let watchlists ,, be given for a proof search. For each watchlist we keep a watchlist progress counter, denoted , which is initially set to . Whenever a clause is generated during the proof search, we have to check whether subsumes some clause from some watchlist . When subsumes a clause from we increase by . The subsumed clause from is then marked as encountered, and it is not considered in future watchlist subsumption checks.^{1}^{1}1 Alternatively, the subsumed watchlist clause can be considered for future subsumption checks but the watchlist progress counter should not be increased when is subsumed again. This is because we want the progress counter to represent the number of different clauses from encountered so far. Note that a single generated clause can subsume several clauses from one or more watchlists, hence several progress counters might be increased multiple times as a result of generating .
5.2 Standard Dynamic Watchlist Relevance
The easiest way to use progress counters to guide given clause selection is to assign the (standard) dynamic watchlist relevance to each generated clause , denoted , as follows. Whenever is generated, we check it against all the watchlists for subsumption and we update watchlist progress counters. Any clause which does not subsume any watchlist clause is given . When subsumes some watchlist clause, its relevance is the maximum watchlist completion ratio over all the matched watchlists. Formally, let us write when clause subsumes some clause from watchlist . For a clause matching at least one watchlist, its relevance is computed as follows.
The assumption is that a watchlist that is matched more is more relevant to the current proof search. In our current implementation, the relevance is computed at the time of generation of and it is not updated afterwards. As future work, we propose to also update the relevance of all generated but not yet processed clauses from time to time in order to reflect updates of the watchlist progress counters. Note that this is expensive, as the number of generated clauses is typically high. Suitable indexing could be used to lower this cost or even to do the update immediately just for the affected clauses.
To use the watchlist relevance in E, we extend E’s domainspecific language for search heuristics with two priority functions PreferWatchlistRelevant and DeferWatchlistRelevant. The first priority function ranks higher the clauses with higher watchlist relevance^{2}^{2}2 Technically, E’s priority function returns an integer priority, and clauses with smaller values are preferred. Hence we compute the priority as . , and the other function does the opposite. These priority functions can be used to build E’s heuristics just like in the case of the static watchlist guidance. As a results, we can instruct E to process watchlistrelevant clauses in advance.
5.3 Inherited Dynamic Watchlist Relevance
The previous standard watchlist relevance prioritizes only clauses subsuming watchlist clauses but it behaves indifferently with respect to other clauses. In order to provide some guidance even for clauses which do not subsume any watchlist clause, we can examine the watchlist relevance of the parents of each generated clause, and prioritize clauses with watchlistrelevant parents. Let denote the set of previously processed clauses from which have been derived. Inherited dynamic watchlist relevance, denoted , is a combination of the standard dynamic relevance with the average of parents relevances multiplied by a decay factor .
Clearly, the inherited relevance equals to the standard relevance for the initial clauses with no parents. The decay factor () determines the importance of parents watchlist relevances.^{3}^{3}3In our experiments, we use Note that the inherited relevances of are already precomputed at the time of generating , hence no recursive computation is necessary.
With the above we compute the average of parents inherited relevances, hence the inherited watchlist relevance accumulates relevance of all the ancestors. As a result, is greater than 0 if and only if has some ancestor which subsumed a watchlist clause at some point. This might have an undesirable effect that clauses unrelated to the watchlist are completely ignored during the proof search. In practice, however, it seems important to consider also watchlistunrelated clauses with some degree in order to prove new conjectures which do not appear on the input watchlist. Hence we introduce two threshold parameters and which resets the relevance to 0 as follows. Let denote the length of clause , counting occurrences of symbols in .
Parameter is a threshold on the watchlist inherited relevance while combines the relevance with the clause length.^{4}^{4}4In our experiments, we use and . These values have been found useful by a small grid search over a random sample of 500 problems. As a result, shorter watchlistunrelated clauses are preferred to longer (distantly) watchlistrelated clauses.
6 Experiments with Watchlist Guidance
For our experiments we construct watchlists from the proofs found by E on a benchmark of Mizar40 [19] problems in the MPTP dataset [35].^{5}^{5}5Precisely, we have used the small (bushy, reproving) versions, but without ATP minimization. They can be found at http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/MPTP2/problems_small_consist.tar.gz ^{6}^{6}6Experimental results and code can be found at https://github.com/ai4reason/eproverdata/tree/master/ITP18.. These initial proofs were found by an evolutionarily optimized [14] ensemble of 32 E strategies each run for 5 s. These are our baseline strategies. Due to limited computational resources, we do most of the experiments with the top 5 strategies that (greedily) cover most solutions (top 5 greedy cover). These are strategies number 2, 8, 9, 26 and 28, henceforth called A, B, C, D, E. In 5 s (in parallel) they together solve 21122 problems. We also evaluate these five strategies in 10 seconds, jointly solving 21670 problems. The 21122 proofs yield over 100000 unique proof clauses that can be used for watchlistbased guidance in our experiments. We also use smaller datasets randomly sampled from the full set of problems to be able to explore more methods. All problems are run on the same hardware^{7}^{7}7Intel(R) Xeon(R) CPU E52698 v3 @ 2.30GHz with 256G RAM. and with the same memory limits.
Each E strategy is specified as a frequencyweighted combination of parameterized clause evaluation functions (CEF) combined with a selection of inference rules. Below we show a simplified example strategy specifying the term ordering KBO, and combining (with weights 2 and 4) two CEFs made up of weight functions Clauseweight and FIFOWeight and priority functions DeferSOS and PreferWatchlist.
tKBO H(2*Clauseweight(DeferSoS,20,9999,4),4*FIFOWeight(PreferWatchlist))
6.1 Watchlist Selection Methods
We have experimented with several methods for creation of static and dynamic watchlists. Typically we use only the proofs found by a particular baseline strategy to construct the watchlists used for testing the guided version of that strategy. Using all proof clauses as a watchlist slows E down to 6 given clauses per second. This is comparable to the speed of Prover9 with similarly large watchlists, but there are indexing methods that could speed this up. We have run several smaller tests, but do not include this method in the evaluation due to limited computational resources. Instead, we select a smaller set of clauses. The methods are as follows:

Use all proof clauses from theorems in the problem’s Mizar article^{8}^{8}8Excluding the current theorem.. Such watchlist sizes range from to , which does not cause any significant slowdown of E.

Use highfrequency proof clauses for static watchlists, i.e., clauses that appear in many proofs.

Use nearest neighbor (NN) learning to suggest useful static watchlists for each problem, based on symbol and termbased features [20] of the conjecture. This is very similar to the standard use of NN and other learners for premise selection. In more detail, we use symbols, walks of length 2 on formula trees and common subterms (with variables and skolem symbols unified). Each proof is turned into a multilabel training example, where the labels are the (serially numbered) clauses used in the proof, and the features are extracted from the conjecture.

Use NN in a similar way to suggest the most related proofs for dynamic watchlists. This is done in two iterations.

In the first iteration, only the conjecturebased similarity is used to select related problems and their proofs.

The second iteration then uses data mined from the proofs obtained with dynamic guidance in the first iteration. From each such proof we create a training example associating ’s conjecture features with the names of the proofs that matched (i.e., guided the inference of) the clauses needed in . On this dataset we again train a NN learner, which recommends the most useful related proofs for guiding a particular conjecture.

6.2 Using Watchlists in E Strategies
As described in Section 4, watchlist subsumption defines the PreferWatchlist priority function that prioritizes clauses that subsume at least one watchlist clause. Below we describe several ways to use this priority function and the newly defined dynamic PreferWatchlistRelevant priority function and its relevanceinheriting modifications. Each of them can additionally take the “noremove” option, to keep subsumed watchlist clauses in the watchlist, allowing repeated matching by different clauses. Preliminary testing has shown that just adding a single watchlistbased clause evaluation function (CEF) to the baseline CEFs^{9}^{9}9Specifically we tried adding Defaultweight(PreferWatchlist) and ConjectureRelativeSymbolWeight(PreferWatchlist) with frequencies times that of the rest of the CEFs in the strategy. is not as good as the methods defined below. In the rest of the paper we provide short names for the methods, such as prefA (baseline strategy A modified by the pref method described below).

pref: replace all priority functions in a baseline strategy with the PreferWatchlist priority function. The resulting strategies look as follows:
H(2*Clauseweight(PreferWatchlist,20,9999,4),
4*FIFOWeight(PreferWatchlist)) 
const: replace all priority functions in a baseline strategy with ConstPrio, which assigns the same priority to all clauses, so all ranking is done by weight functions alone.

uwl: always prefer clauses that match the watchlist, but use the baseline strategy’s priority function otherwise^{10}^{10}10uwl is implemented in E’s source code as an option..

ska: modify watchlist subsumption in E to treat all skolem symbols of the same arity as equal, thus widening the watchlist guidance. This can be used with any strategy. In this paper it is used with pref.

dyn: replace all priority functions in a baseline strategy with PreferWatchlistRelevant, which dynamically weights watchlist clauses (Section 5.2).

dyndec: add the relevance inheritance mechanisms to dyn (Section 5.3).
6.3 Evaluation
First we measure the slowdown caused by larger static watchlists on the best baseline strategy and a random sample of 10000 problems. The results are shown in Table 1. We see that the speed significantly degrades with watchlists of size 10000, while 500big watchlists incur only a small performance penalty.
Size  10  100  256  512  1000  10000 

proved  3275  3275  3287  3283  3248  2912 
PPS  8935  9528  8661  7288  4807  575 
Table 2 shows the 10 s evaluation of several static and dynamic methods on a random sample of problems using articlebased watchlists (method art in Section 6.1). For comparison, E’s auto strategy proves of the problems in 10 s and its autoschedule proves . Given 50 seconds the autoschedule proves problems compared to our top 5 cover’s 1964.
The first surprising result is that const significantly outperforms the baseline. This indicates that the oldstyle simple E priority functions may do more harm than good if they are allowed to override the more recent and sophisticated weight functions. The ska strategy performs best here and a variety of strategies provide better coverage. It’s interesting to note that ska and pref overlap only on problems. The original evo strategy performs well, but lacks diversity.
Strategy  baseline  const  pref  ska  dyn  evo  uwl 

A  1238  1493  1503  1510  1500  1303  1247 
B  1255  1296  1315  1330  1316  1300  1277 
C  1075  1166  1205  1183  1201  1068  1097 
D  1102  1133  1176  1190  1175  1330  1132 
E  1138  1141  1141  1153  1139  1070  1139 
total  1853  1910  1931  1933  1922  1659  1868 
Table 3 briefly evaluates NN selection of watchlist clauses (method kNNst in Section 6.1) on a single strategy prefA.
Watchlist size  16  64  256  1024  2048 

Proved  1518  1531  1528  1532  1520 
Next we use kNN to suggest watchlist proofs^{11}^{11}11All clauses in suggested proofs are used. (method kNNdyn.i) for pref and dyn. Table 4 evaluates the influence of the number of related proofs loaded for the dynamic strategies. Interestingly, pref outperforms dyn almost everywhere but dyn’s ensemble of strategies AE generally performs best and the top 5 cover is better. We conclude that dyn’s dynamic relevance weighting allows the strategies to diversify more.
Table 5 evaluates the top 5 greedy cover from Table 4 on the full Mizar dataset, already showing significant improvement over the 21670 proofs produced by the 5 baseline strategies. Based on proof data from a fullrun of the top5 greedy cover in Table 5, new kNN proof suggestions were made (method kNNdyn.ii) and dyn’s grid search rerun, see Table 6 and Table 7 for kNN round 2 results.
We also test the relevance inheriting dynamic watchlist feature (dyndec), primarily to determine if different proofs can be found. The results are shown in Table 8. This version adds problems to the top 5 greedy cover of all the strategies run on the problem dataset, making it useful in a schedule despite lower performance alone. Table 9 shows this greedy cover, and then its evaluation on the full dataset. The 23192 problems proved by our new greedy cover is a 7% improvement over the top 5 baseline strategies.
size  dynA  dynB  dynC  dynD  dynE  total 

4  1531  1352  1235  1194  1165  1957 
8  1543  1366  1253  1188  1170  1956 
16  1529  1357  1224  1218  1185  1951 
32  1546  1373  1240  1218  1188  1962 
64  1535  1376  1216  1215  1166  1935 
128  1506  1351  1195  1214  1147  1907 
1024  1108  963  710  943  765  1404 
size  prefA  prefB  prefC  prefD  prefE  total 
4  1539  1369  1210  1220  1159  1944 
8  1554  1385  1219  1240  1168  1941 
16  1572  1405  1225  1254  1180  1952 
32  1568  1412  1231  1271  1190  1958 
64  1567  1402  1228  1262  1172  1952 
128  1552  1388  1210  1248  1160  1934 
1024  1195  1061  791  991  806  1501 
dynA_32  dynC_8  dynD_16  dynE_4  dynB_64  

added  17964  2531  1024  760  282 
total  17964  14014  14294  13449  16175 
size  dyn2A  dyn2B  dyn2C  dyn2D  dyn2E  total  round 1 total 

4  1539  1368  1235  1209  1179  1961  1957 
8  1554  1376  1253  1217  1183  1971  1956 
16  1565  1382  1256  1221  1181  1972  1951 
32  1557  1383  1252  1227  1182  1968  1962 
64  1545  1385  1244  1222  1171  1963  1935 
128  1531  1374  1221  1227  1171  1941  1907 
dyn2A_16  dyn2C_16  dyn2D_32  dyn2E_4  dyn2B_4  

total  18583  14486  14720  13532  16244 
added  18583  2553  1007  599  254 
size  dyndec2A  dyndec2B  dyndec2C  dyndec2D  dyndec2E  total 

4  1432  1354  1184  1203  1152  1885 
16  1384  1316  1176  1221  1140  1846 
32  1381  1309  1157  1209  1133  1820 
128  1326  1295  1127  1172  1082  1769 
total  dyn2A_16  dyn2C_16  dyndec2D_16  dyn2E_4  dyndec2A_128 

2007  1565  230  97  68  47 
23192  18583  2553  1050  584  422 
23192  18583  14486  14514  13532  15916 
7 Examples
The Mizar theorem YELLOW_5:36^{12}^{12}12http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/yellow_5#T36 states De Morgan’s laws for Boolean lattices:
Using 32 related proofs results in 2220 clauses placed on the watchlists. The dynamically guided proof search takes 5218 (nontrivial) given clause loops done in 2 s and the resulting ATP proof is 436 inferences long. There are 194 given clauses that match the watchlist during the proof search and 120 (61.8%) of them end up being part of the proof. I.e., 27.5% of the proof consists of steps guided by the watchlist mechanism. The proof search using the same settings, but without the watchlist takes 6550 nontrivial given clause loops (25.5% more). The proof of the theorem WAYBEL_1:85^{13}^{13}13http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/waybel_1#T85 is considerably used for this guidance:
Note that this proof is done under the weaker assumptions of H being lower bounded and Heyting, rather than being Boolean. Yet, 62 (80.5%) of the 77 clauses from the proof of WAYBEL_1:85 are eventually matched during the proof search. 38 (49.4%) of these 77 clauses are used in the proof of YELLOW_5:36. In Table 10 we show the final state of proof progress for the 32 loaded proofs after the last non empty clause matched the watchlist. For each we show both the computed ratio and the number of matched and all clauses.
0  0.438  42/96  1  0.727  56/77  2  0.865  45/52  3  0.360  9/25 

4  0.750  51/68  5  0.259  7/27  6  0.805  62/77  7  0.302  73/242 
8  0.652  15/23  9  0.286  8/28  10  0.259  7/27  11  0.338  24/71 
12  0.680  17/25  13  0.509  27/53  14  0.357  10/28  15  0.568  25/44 
16  0.703  52/74  17  0.029  8/272  18  0.379  33/87  19  0.424  14/33 
20  0.471  16/34  21  0.323  20/62  22  0.333  7/21  23  0.520  26/50 
24  0.524  22/42  25  0.523  45/86  26  0.462  6/13  27  0.370  20/54 
28  0.411  30/73  29  0.364  20/55  30  0.571  16/28  31  0.357  10/28 
An example of a theorem that can be proved in 1.2 s with guidance but cannot be proved in 10 s with any unguided method is the following theorem BOOLEALG:62^{14}^{14}14http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/boolealg#T62 about the symmetric difference in Boolean lattices:
Using 32 related proofs results in 2768 clauses placed on the watchlists. The proof search then takes 4748 (nontrivial) given clause loops and the watchlistguided ATP proof is 633 inferences long. There are 613 given clauses that match the watchlist during the proof search and 266 (43.4%) of them end up being part of the proof. I.e., 42% of the proof consists of steps guided by the watchlist mechanism. Among the theorems whose proofs are most useful for the guidance are the following theorems LATTICES:23^{15}^{15}15http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/lattices#T23, BOOLEALG:33^{16}^{16}16http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/boolealg#T33 and BOOLEALG:54^{17}^{17}17http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/boolealg#T54 on Boolean lattices:
Finally, we show several theorems ^{18}^{18}18http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/boolealg#T68^{19}^{19}19http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/closure1#T21^{20}^{20}20http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/bcialg_4#T44^{21}^{21}21http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/xxreal_3#T67 with nontrivial Mizar proofs and relatively long ATP proofs obtained with significant guidance. These theorems cannot be proved by any other method used in this work.
8 Related Work and Possible Extensions
The closest related work is the hint guidance in Otter and Prover9. Our focus is however on large ITPstyle theories with large signatures and heterogeneous facts and proofs spanning various areas of mathematics. This motivates using machine learning for reducing the size of the static watchlists and the implementation of the dynamic watchlist mechanisms. Several implementations of internal proof search guidance using statistical learning have been mentioned in Sections 1 and 3. In both the tableaubased systems and the tactical ITP systems the statistical learning guidance benefits from a compact and directly usable notion of proof state, which is not immediately available in saturationstyle ATP.
By delegating the notion of similarity to subsumption we are relying on fast, crisp and wellknown symbolic ATP mechanisms. This has advantages as well as disadvantages. Compared to the ENIGMA [15] and neural [23] statistical guiding methods, the subsumptionbased notion of clause similarity is not featurebased or learned. This similarity relation is crisp and sparser compared to the similarity relations induced by the statistical methods. The proof guidance is limited when no derived clauses subsume any of the loaded proof clauses. This can be countered by loading a high number of proofs and widening (or softening) the similarity relation in various approximate ways. On the other hand, subsumption is fast compared to the deep neural methods (see [23]) and enjoys clear guarantees of the underlying symbolic calculus. For example, when all the (non empty) clauses from a loaded related proof have been subsumed in the current proof search, it is clear that the current proof search is successfully finished.
A clear novelty is the focusing of the proof search towards the (possibly implausible) inferences needed for completing the loaded proofs. Existing statistical guiding methods will fail to notice such opportunities, and the static watchlist guidance has no way of distinguishing the watchlist matchers that lead faster to proof completion. In a way this mechanism resembles the feedback obtained by Monte Carlo exploration, where a seemingly statistically unlikely decision can be made, based on many rollouts and averaging of their results. Instead, we rely here on a database of previous proofs, similar to previously played and finished games. The newly introduced heuristic proof search (proof progress) representation may however enable further experiments with Monte Carlo guidance.
8.1 Possible Extensions
Several extensions have been already discussed above. We list the most obvious.
More sophisticated progress metrics:
The current proofprogress criterion may be too crude. Subsuming all the initial clauses of a related proof is unlikely until the empty clause is derived. In general, a large part of a related proof may not be needed once the right clauses in the “middle of the proof” are subsumed by the current proof search. A better proofprogress metric would compute the smallest number of proof clauses that are still needed to entail the contradiction. This is achievable, however more technically involved, also due to issues such as rewriting of the watchlist clauses during the current proof search.
Clause reevaluation based on the evolving proof relevance:
As more and more watchlist clauses are matched, the proof relevance of the clauses generated earlier should be updated to mirror the current state. This is in general expensive, so it could be done after each given clause loops or after a significant number of watchlist matchings. An alternative is to add corresponding indexing mechanisms to the set of generated clauses, which will immediately reorder them in the evaluation queues based on the proof relevance updates.
More abstract/approximate matching:
Instead of the strict notion of subsumption, more abstract or heuristic matching methods could be used. An interesting symbolic method to consider is matching modulo symbol alignments [9]. A number of approximate methods are already used by the above mentioned statistical guiding methods.
Adding statistical methods for clause guidance:
Instead of using only hardcoded watchliststyle heuristics for focusing inferences, a statistical (e.g. ENIGMAstyle) clause evaluation model could be trained by adding the vector of proof completion ratios to the currently used ENIGMA features.
9 Conclusion
The portfolio of new proof guiding methods developed here significantly improves E’s standard portfolio of strategies, and also the previous best set of strategies invented for Mizar by evolutionary methods. The best combination of five new strategies run in parallel for 10 seconds (a reasonable hammering time) will prove over 7% more Mizar problems than the previous best combination of five nonwatchlist strategies. Improvement over E’s standard portfolio is much higher. Even though we focus on developing the strongest portfolio rather than a single best method, it is clear that the best guided versions also significantly improve over their nonguided counterparts. This improvement for the best new strategy (dyn2A used with 16 most relevant proofs) is 26.5% (). These are relatively high improvements in automated theorem proving.
We have shown that the new dynamic methods based on the idea of proof completion ratios improve over the static watchlist guidance. We have also shown that as usual with learningbased guidance, iterating the methods to produce more proofs leads to stronger methods in the next iteration. The first experiments with widening the watchlistbased guidance by relatively simple inheritance mechanisms seem quite promising, contributing many new proofs. A number of extensions and experiments with guiding saturationstyle proving have been opened for future research. We believe that various extensions of the compact and evolving heuristic representation of saturationstyle proof search as introduced here will turn out to be of great importance for further development of learningbased saturation provers.
10 Acknowledgments
We thank Bob Veroff for many enlightening explanations and discussions of the watchlist mechanisms in Otter and Prover9. His “industrygrade” projects that prove open and interesting mathematical conjectures with hints and proof sketches have been a great sort of inspiration for this work.
References
 [1] J. Alama, T. Heskes, D. Kühlwein, E. Tsivtsivadze, and J. Urban. Premise selection for mathematics by corpus analysis and kernel methods. , 52(2):191–213, 2014.
 [2] A. A. Alemi, F. Chollet, N. Eén, G. Irving, C. Szegedy, and J. Urban. DeepMath  deep sequence models for premise selection. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pages 2235–2243, 2016.
 [3] L. Bachmair and H. Ganzinger. RewriteBased Equational Theorem Proving with Selection and Simplification. Journal of Logic and Computation, 3(4):217–247, 1994.
 [4] J. C. Blanchette, D. Greenaway, C. Kaliszyk, D. Kühlwein, and J. Urban. A learningbased fact selector for Isabelle/HOL. J. Autom. Reasoning, 57(3):219–244, 2016.
 [5] J. C. Blanchette, C. Kaliszyk, L. C. Paulson, and J. Urban. Hammering towards QED. J. Formalized Reasoning, 9(1):101–148, 2016.

[6]
T. Eiter and D. Sands, editors.
LPAR21, 21st International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Maun, Botswana, May 712, 2017
, volume 46 of EPiC Series in Computing. EasyChair, 2017.  [7] M. Färber and C. Kaliszyk. Random forests for premise selection. In C. Lutz and S. Ranise, editors, Frontiers of Combining Systems  10th International Symposium, FroCoS 2015, Wroclaw, Poland, September 2124, 2015. Proceedings, volume 9322 of Lecture Notes in Computer Science, pages 325–340. Springer, 2015.
 [8] M. Färber, C. Kaliszyk, and J. Urban. Monte Carlo tableau proof search. In L. de Moura, editor, Automated Deduction  CADE 26  26th International Conference on Automated Deduction, Gothenburg, Sweden, August 611, 2017, Proceedings, volume 10395 of Lecture Notes in Computer Science, pages 563–579. Springer, 2017.
 [9] T. Gauthier and C. Kaliszyk. Matching concepts across HOL libraries. In S. M. Watt, J. H. Davenport, A. P. Sexton, P. Sojka, and J. Urban, editors, CICM’15, volume 8543 of LNCS, pages 267–281. Springer, 2014.
 [10] T. Gauthier, C. Kaliszyk, and J. Urban. TacticToe: Learning to reason with HOL4 tactics. In Eiter and Sands [6], pages 125–143.
 [11] A. Grabowski, A. Korniłowicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Reasoning, 3(2):153–245, 2010.
 [12] T. Gransden, N. Walkinshaw, and R. Raman. SEPIA: search for proofs using inferred automata. In Automated Deduction  CADE25  25th International Conference on Automated Deduction, Berlin, Germany, August 17, 2015, Proceedings, pages 246–255, 2015.
 [13] J. Jakubuv and J. Urban. Extending E prover with similarity based clause selection strategies. In M. Kohlhase, M. Johansson, B. R. Miller, L. de Moura, and F. W. Tompa, editors, Intelligent Computer Mathematics  9th International Conference, CICM 2016, Bialystok, Poland, July 2529, 2016, Proceedings, volume 9791 of Lecture Notes in Computer Science, pages 151–156. Springer, 2016.
 [14] J. Jakubuv and J. Urban. BliStrTune: hierarchical invention of theorem proving strategies. In Y. Bertot and V. Vafeiadis, editors, Proceedings of the 6th ACM SIGPLAN Conference on Certified Programs and Proofs, CPP 2017, Paris, France, January 1617, 2017, pages 43–52. ACM, 2017.
 [15] J. Jakubuv and J. Urban. ENIGMA: efficient learningbased inference guiding machine. In H. Geuvers, M. England, O. Hasan, F. Rabe, and O. Teschke, editors, Intelligent Computer Mathematics  10th International Conference, CICM 2017, Edinburgh, UK, July 1721, 2017, Proceedings, volume 10383 of Lecture Notes in Computer Science, pages 292–302. Springer, 2017.
 [16] C. Kaliszyk, S. Schulz, J. Urban, and J. Vyskocil. System description: E.T. 0.1. In A. P. Felty and A. Middeldorp, editors, Automated Deduction  CADE25  25th International Conference on Automated Deduction, Berlin, Germany, August 17, 2015, Proceedings, volume 9195 of Lecture Notes in Computer Science, pages 389–398. Springer, 2015.
 [17] C. Kaliszyk and J. Urban. Learningassisted automated reasoning with Flyspeck. J. Autom. Reasoning, 53(2):173–213, 2014.
 [18] C. Kaliszyk and J. Urban. FEMaLeCoP: Fairly efficient machine learning connection prover. In M. Davis, A. Fehnker, A. McIver, and A. Voronkov, editors, Logic for Programming, Artificial Intelligence, and Reasoning  20th International Conference, LPAR20 2015, Suva, Fiji, November 2428, 2015, Proceedings, volume 9450 of Lecture Notes in Computer Science, pages 88–96. Springer, 2015.
 [19] C. Kaliszyk and J. Urban. MizAR 40 for Mizar 40. J. Autom. Reasoning, 55(3):245–256, 2015.
 [20] C. Kaliszyk, J. Urban, and J. Vyskočil. Efficient semantic features for automated reasoning over large theories. In Q. Yang and M. Wooldridge, editors, IJCAI’15, pages 3084–3090. AAAI Press, 2015.
 [21] M. K. Kinyon, R. Veroff, and P. Vojtechovský. Loops with abelian inner mapping groups: An application of automated deduction. In M. P. Bonacina and M. E. Stickel, editors, Automated Reasoning and Mathematics  Essays in Memory of William W. McCune, volume 7788 of LNCS, pages 151–164. Springer, 2013.
 [22] L. Kovács and A. Voronkov. Firstorder theorem proving and Vampire. In N. Sharygina and H. Veith, editors, CAV, volume 8044 of LNCS, pages 1–35. Springer, 2013.
 [23] S. M. Loos, G. Irving, C. Szegedy, and C. Kaliszyk. Deep network guided proof search. In Eiter and Sands [6], pages 85–105.
 [24] W. McCune and L. Wos. Otter: The CADE13 Competition Incarnations. Journal of Automated Reasoning, 18(2):211–220, 1997. Special Issue on the CADE 13 ATP System Competition.
 [25] W. W. McCune. Prover9 and Mace4. http://www.cs.unm.edu/~mccune/prover9/, 2005–2010. (acccessed 20160329).
 [26] J. Otten and W. Bibel. leanCoP: lean connectionbased theorem proving. J. Symb. Comput., 36(12):139–161, 2003.
 [27] S. Schäfer and S. Schulz. Breeding theorem proving heuristics with genetic algorithms. In G. Gottlob, G. Sutcliffe, and A. Voronkov, editors, Global Conference on Artificial Intelligence, GCAI 2015, Tbilisi, Georgia, October 1619, 2015, volume 36 of EPiC Series in Computing, pages 263–274. EasyChair, 2015.
 [28] S. Schulz. Learning Search Control Knowledge for Equational Theorem Proving. In F. Baader, G. Brewka, and T. Eiter, editors, Proc. of the Joint German/Austrian Conference on Artificial Intelligence (KI2001), volume 2174 of LNAI, pages 320–334. Springer, 2001.
 [29] S. Schulz. Simple and Efficient Clause Subsumption with Feature Vector Indexing. In M. P. Bonacina and M. E. Stickel, editors, Automated Reasoning and Mathematics: Essays in Memory of William W. McCune, volume 7788 of LNAI, pages 45–67. Springer, 2013.
 [30] S. Schulz. System description: E 1.8. In K. L. McMillan, A. Middeldorp, and A. Voronkov, editors, LPAR, volume 8312 of LNCS, pages 735–743. Springer, 2013.
 [31] S. Schulz and M. Möhrmann. Performance of clause selection heuristics for saturationbased theorem proving. In N. Olivetti and A. Tiwari, editors, Proc. of the 8th IJCAR, Coimbra, volume 9706 of LNAI, pages 330–345. Springer, 2016.

[32]
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche,
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman,
D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach,
K. Kavukcuoglu, T. Graepel, and D. Hassabis.
Mastering the game of go with deep neural networks and tree search.
Nature, 529(7587):484–489, 2016.  [33] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017.
 [34] K. Slind and M. Norrish. A brief overview of HOL4. In O. A. Mohamed, C. A. Muñoz, and S. Tahar, editors, Theorem Proving in Higher Order Logics, 21st International Conference, TPHOLs 2008, Montreal, Canada, August 1821, 2008. Proceedings, volume 5170 of LNCS, pages 28–32. Springer, 2008.
 [35] J. Urban. MPTP 0.2: Design, implementation, and initial experiments. J. Autom. Reasoning, 37(12):21–43, 2006.
 [36] J. Urban, J. Vyskočil, and P. Štěpánek. MaLeCoP: Machine learning connection prover. In K. Brünnler and G. Metcalfe, editors, TABLEAUX, volume 6793 of LNCS, pages 263–277. Springer, 2011.
 [37] R. Veroff. Using hints to increase the effectiveness of an automated reasoning program: Case studies. Journal of Automated Reasoning, 16(3):223–239, 1996.
Comments
There are no comments yet.