1 Introduction: Symbol Independent Inference Guidance
In this work, we develop two symbolindependent (anonymous) inference guiding methods for saturationstyle automated theorem provers (ATPs) such as E [25] and Vampire [20]
. Both methods are based on learning clause classifiers from previous proofs within the ENIGMA framework
[13, 14, 5] implemented in E. By symbolindependence we mean that no information about the symbol names is used by the learned guidance. In particular, if all symbols in a particular ATP problem are consistently renamed to new symbols, the learned guidance will result in the same proof search and the same proof modulo the renaming.Symbolindependent guidance is an important challenge for learningguided ATP, addressed already in Schulz’s early work on learning guidance in E [23]. With ATPs being increasingly used and trained on large ITP libraries [3, 2, 16, 18, 6, 8], it is more and more rewarding to develop methods that learn to reason without relying on the particular terminology adopted in a single project. Initial experiments in this direction using concept alignment [10] methods have already shown performance improvements by transferring knowledge between the HOL libraries [9]. Structural analogies (or even terminology duplications) are however common already in a single large ITP library [17] and their automated detection can lead to new proof ideas and a number of other interesting applications [11].
This system description first briefly introduces saturationbased ATP with learned guidance (Section 2). Then we discuss symbolindependent learning and guidance using abstract features and gradient boosting trees (Section 3) and graph neural networks (Section 4). The implementation details are explained in Section 5 and the methods are evaluated on the MPTP benchmark in Section 6.
2 Saturation Proving Guided by Machine Learning
Saturationbased Automated Theorem Provers
(ATPs) such as E and Vampire are used to prove goals using a set of axioms . They clausify the formulas and try to deduce contradiction using the given clause loop [22] as follows. The ATP maintains two sets of processed () and unprocessed () clauses. At each loop iteration, a given clause from is selected, moved to , and is extended with new inferences from and . This process continues until the contradiction is found, becomes empty, or a resource limit is reached. The search space grows quickly and selection of the right given clauses is critical.
Learning Clause Selection
over a set of related problems is a general method how to guide the proof search. Given a set of FOL problems and initial ATP strategy , we can evaluate over obtaining training samples . For each successful proof search, training samples contain the set of clauses processed during the search. Positive clauses are those that were useful for the proof search (they appeared in the final proof), while the remaining clauses were useless, forming the negative examples. Given the samples , we can train a machine learning classifier which predicts usefulness of clauses in future proof searches. Some clause classifiers are described in detail in Sections 3, 4, and 5.
ATP Guidance By a Trained Classifier:
Once a clause classifier is trained, we can use it inside an ATP. An ATP strategy is a collection of proof search parameters such as term ordering, literal selection, and also given clause selection mechanism. In E, the given clause selection is defined by a collection of clause weight functions which alternate to select the given clauses. Our ENIGMA framework uses two methods of plugging the trained classifier into . Either (1) we use to select all given clauses (solo mode denoted ), or (2) we combine predictions of with clause selection mechanism from so that roughly of the clauses is selected by (cooperative mode denoted ). Proof search settings other than clause selection are inherited from in both the cases. See [5] for details. The phases of learning and ATP guidance can be iterated in a learning/evaluation loop [29], yielding growing sets of proofs and stronger classifiers trained over them. See [15] for such large experiment.
3 Clause Classification by Decision Trees
Clause Features
are used by ENIGMA to represent clauses as sparse vectors for machine learners. They are based mainly on vertical/horizontal cuts of the clause syntax tree. We use simple
feature hashing to handle theories with large number of symbols. A clause is represented by the vector whose th index stores the value of a feature with hash index . Values of conflicting features (mapped to the same index) are summed. Additionally, we embed conjecture features into the clause representation and we work with vector pairs of size , where is the feature vector of the current goal (conjecture). This allows us to provide goalspecific predictions. See [15] for more details.Gradient Boosting Decision Trees (GBDTs)
implemented by the XGBoost library
[4] currently provide the strongest ENIGMA classifiers. Their speed is comparable to the previously used [14] weaker linear logistic classifier, implemented by the LIBLINEAR library [7]. In this work, we newly employ the LightGBM [19] GBDT implementation. A decision tree is a binary tree whose nodes contain Boolean conditions on values of different features. Given a feature vector , the decision tree can be navigated from the root to the unique tree leaf which contains the classification of clause . GBDTs combine predictions from a collection of followup decision trees. While inputs, outputs, and API of XGBoost and LightGBM are compatible, each employ a different method of tree construction. XGBoost constructs trees levelwise, while LightGBM leafwise. This implies that XGBoost trees are wellbalanced. On the other hand, LightGBM can produce much deeper trees and the tree depth limit is indeed an important learning metaparameter which must be additionally set.New SymbolIndependent Features:
We develop a feature anonymization method based on symbol arities. Each function symbol name with arity is substituted by a special name “f”, while a predicate symbol name with arity is substituted by “p”. Such features lose the ability to distinguish different symbol names, and many features are merged together. Vector representations of two clauses with renamed symbols are clearly equal. Hence the underlying machine learning method will provide equal predictions for such clauses. For more detailed discussion and comparison with related work see Appendix 0.B.
New Statistics and Problem Features:
To improve the ability to distinguish different anonymized clauses, we add the following features. Variable statistics of clause containing (1) the number of variables in without repetitions, (2) the number of variables with repetitions, (3) the number of variables with exactly one occurrence, (4) the number of variables with more than one occurrence, (510) the number occurrences of the most/less (and second/third most/less) occurring variable. Symbol statistics do the same for symbols instead of variables. Recall that we embed conjecture features in clause vector pair . As embeds information about the conjecture but not about the problem axioms, we propose to additionally embed some statistics of the problem that and come from. We use 22 problem features that E prover already computes for each input problem to choose a suitable strategy. These are (1) number of goals, (2) number of axioms, (3) number of unit goals, etc. See E’s manual for more details. Hence we work with vector triples .
4 Clause Classification by Graph Neural Network
Another clause classifier newly added to ENIGMA is based on graph neural networks (GNNs). We use the symbolindependent network architecture developed in [21] for premise selection. As [21] contains all the details, we only briefly explain the basic ideas behind this architecture here.
Hypergraph.
Given a set of clauses we create a directed hypergraph with three kinds of nodes that correspond to clauses, function and predicate symbols , and unique (sub)terms and literals occurring in , respectively. There are two kinds of hyperedges that describe the relations between nodes according to . The first kind encodes literal occurrences in clauses by connecting the corresponding nodes. The second hyperedge kind encodes the relations between nodes from and . For example, for we loosely speaking connect the nodes and with the node and similarly for literals, where their polarity is also taken into account.
Messagepassing.
The hypergraph describes the relation between various kinds of objects occurring in . Every node in the hypergraph is initially assigned a constant vector, called the embedding, based only on its kind (, , or ). These node embeddings are updated in a fixed number of messagepassing rounds, based on the embeddings of each node’s neighbors. The underlying idea of such neural messagepassing methods^{1}^{1}1Graph convolutions are a generalization of the sliding window convolutions used for aggregating neighborhood information in neural networks used for image recognition. is to make the node embeddings encode more and more precisely the information about the connections (and thus various properties) of the nodes. For this to work, we have to learn initial embeddings for our three kinds of nodes and the update function.^{2}^{2}2We learn individual components, which correspond to different kinds of hyperedges, from which the update function is efficiently constructed.
Classification.
After the messagepassing phase, the final clause embeddings are available in the corresponding clause nodes. The estimated probability of a clause being a good given clause is then computed by a neural network that takes the final embedding of this clause and also aggregated final embeddings of all clauses obtained from the negated conjecture.
5 Learning and Using the Classifiers, Implementation
In order to use either GBDTs (Section 3) or GNNs (Section 4), a prediction model must be learned. Learning starts with training samples , that is, a set of pairs of positive and negative clauses. For each training sample , we additionally know the source problem and its conjecture . Hence we can consider one sample as a quadruple for convenience.
Gbdt.
Given a training sample , each clause is translated to the feature vector . Vectors where are labeled as positive, and otherwise as negative. All the labeled vectors are fed together to a GBDT trainer yielding model .
When predicting a generated clause, the feature vector is computed and is asked for the prediction. GBDT’s binary predictions (positive/negative) are turned into E’s clause weight (positives have weight and negatives ).
Gnn.
Given as above we construct a hypergraph for the set of clauses
. This hypergraph is translated to a tensor representation (vectors and matrices), marking clause nodes as positive, negative, or goal. These tensors are fed as input to our GNN training, yielding a GNN model
. The training works in iterations, andcontains one GNN per iteration epoch. Only one GNN from a selected epoch is used for predictions during the evaluation.
In evaluation, it is more efficient to compute predictions for several clauses at once. This also improves prediction quality as the queried data resembles more the training hypergraphs where multiple clauses are encoded at once as well. During an ATP run on problem with the conjecture , we postpone evaluation of newly inferred clauses until we reach a certain amount of clauses to query.^{3}^{3}3We may evaluate less than if E runs out of unevaluated unprocessed clauses. To resemble the training data even more, we add a fixed number of the given clauses processed so far. We call these context clauses (). To evaluate , we construct the hypergraph for , and mark clauses from as goals. Then model is asked for predictions on (predictions for are dropped). The numeric predictions computed by are directly used as E’s weights.
Implementation & Performance.
We use GBDTs implemented by the XGBoost [4] and LightGBM [19]
libraries. For GNN we use Tensorflow
[1]. All the libraries provide Python interfaces and C/C++ APIs. We use the Python interfaces for training and the C APIs for the evaluation in E. The Python interfaces for XGBoost and LightGBM include the C APIs, while for Tensorflow this must be manually compiled, which is further complicated by poor documentation.The libraries support training both on CPUs and on GPUs. We train LightGBM on CPUs, and XGBoost and Tensorflow on GPUs. However, we always evaluate on a single CPU as we aim at practical usability on standard hardware. This is nontrivial and it distinguishes this work from evaluations done with large numbers of GPUs or TPUs and/or in prohibitively high real times. The LightGBM training can be parallelized much better – with 60 CPUs it is much faster than XGBoost on 4 GPUs. Neither using GPUs for LightGBM nor many CPUs for XGBoost provided better training times. The GNN training is slower than GBDT training and it is not easy to make Tensorflow evaluate reasonably on a single CPU. It has to be compiled with all CPU optimizations and restricted to a single thread, using Tensorflow’s poorly documented experimental C API.
6 Experimental Evaluation
Setup.
We experimentally evaluate^{4}^{4}4On a server with 36 hyperthreading Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz cores, 755 GB of memory, and 4 NVIDIA GeForce GTX 1080 Ti GPUs. our GBDT and GNN guidance ^{5}^{5}5Available at https://github.com/ai4reason/eproverdata/tree/master/IJCAR20 on a large benchmark of Mizar40 [18] problems^{6}^{6}6http://grid01.ciirc.cvut.cz/~mptp/1147/MPTP2/problems_small_consist.tar.gz exported by MPTP [28]. Hence this evaluation is compatible with our previous symboldependent work [15]. We evaluate GBDT and GNN separately. We start with a goodperforming E strategy (see [5, Appendix A]) which solves problems with a limit per problem. This gives us training data (see Section 5), and we start three iterations of the learning/evaluation loop (see Section 2).
For GBDT, we train several models (with hash base ) and conduct a small learning metaparameters grid search. For XGBoost, we try different tree depths (), and for LightGBM various combinations of tree depths and leaves count (). We evaluate all these models in a cooperative mode with on a random (but fixed) of all problems (Appendix 0.A). The best performing model is evaluated on the whole benchmark in both cooperative () and solo () runs. These give us the next samples . We perform three iterations and obtain models , , and .
For GNN, we train a model with 100 epochs, obtaining different GNNs. We evaluate GNNs from selected epochs () and we try different settings of query () and context () sizes (see Section 5). In particular, ranges over and over . All possible combinations of are again evaluated in a grid search on the small benchmark subset (Appendix 0.A), and the best performing model is selected for the next iteration. We run three iterations and obtain models , , and .
TPR  TNR  training  real time  abstract time  
[%]  [%]  size  time  params  
          0.0  0.0  
84.9  68.4  14M  2h29m  X,d12  38.1  67.8  
79.0  79.5  29M  4h33m  X,d12  58.2  94.4  
80.5  79.2  47M  40m  L,d30,l1800  62.7  112.2  
92.1  77.1  14M  17h  e20,q128,c512  39.7  84.9  
90.0  78.6  31M  1d19h  e10,q128,c512  54.7  103.5  
91.3  79.6  50M  1d8h  e50,q256,c768  55.4  107.6 
Results
are presented in Table 1. For each model and we show (1) true positive/negative rates, (2) training data sizes, (3) train times, and (4) the best performing parameters from the grid search. Furthermore, for each model we show the performance of in (5) real and (6) abstract time. Details follow. (1) Model accuracies are computed on samples extracted from problems newly solved by each model, that is, on testing data not known during the training. Columns TPR/TNR show accuracies on positive/negative testing samples. (2) Train sizes measure the training data in millions of clauses. (4) Letter “X” stands for XGBoost models, while “L” for LightGBM. (5) For real time we use limit per problem, and (6) in abstract time we limit the number of generated clauses to . We show the number of problems solved and the gain (in %) on . The abstract time evaluation is useful to assess the methods modulo the speed of the implementation. The first row shows the performance of without learning.
Evaluation.
The GNN models start better, but the GBDT models catch up and beat GNN in later iterations. The GBDT models show a significant gain even in the 3rd iteration, while the GNN models start stagnating. The GNN models report better testing accuracy, but their ATP performance is not as good.
For GBDTs, we see that the first two best models ( and ) were produced by XGBoost, while by LightGBM. While both libraries can provide similar results, LightGBM is significantly faster. For comparison, the training time for XGBoost in the third iteration was 7 hours, that is, LightGBM is 10 times faster. The higher speed of LightGBM can overcome the problems with more complicated parameter settings, as more models can be trained and evaluated.
For GNNs, we observe higher training times and better models coming from earlier epochs. The training in the 1st and 2nd iterations was done on 1 GPU, while in the 3rd on 4 GPUs. The good abstract time performance indicates that further gain could be obtained by a faster implementation. But note that this is the first time that NNs have been made comparable to GBDTs in real time.
Figure 1 summarizes the results. On the left, we observe a slower start for GNNs caused by the initial model loading. On the right, we see a decrease in the number of processed clauses, which suggests that the guidance is effective.
7 Conclusion
We have developed and evaluated symbolindependent GBDT and GNN ATP guidance. This is the first time symbolindependent features and GNNs are tightly integrated with E and provide good realtime results on a large corpus.
Both the GBDT and GNN predictors display high ability to learn from previous proof searches even in the symbolindependent setting. To provide competitive realtime performance of the GNNs, we have developed contextbased evaluation of batches of generated clauses in E. The new GBDTs show even better performance than their symboldependent versions from our previous work [15]. This is most likely because of the parameter grid search and new features not used before. The union of problems solved by the 12 ENIGMA strategies (both and ) in real time adds up to . When we add to this portfolio we solve problems. This shows that ENIGMA strategies learned quite well from , not losing many solutions. Vampire in 300 seconds solves problems. Future work includes joint evaluation on several ITP libraries, similar to [9].
8 Acknowledgments
We thank Stephan Schulz and Thibault Gauthier for discussing with us their methods for symbolindependent term and formula matching.
References
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [2] Jasmin Christian Blanchette, David Greenaway, Cezary Kaliszyk, Daniel Kühlwein, and Josef Urban. A learningbased fact selector for Isabelle/HOL. , 57(3):219–244, 2016.
 [3] Jasmin Christian Blanchette, Cezary Kaliszyk, Lawrence C. Paulson, and Josef Urban. Hammering towards QED. J. Formalized Reasoning, 9(1):101–148, 2016.
 [4] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.
 [5] Karel Chvalovský, Jan Jakubuv, Martin Suda, and Josef Urban. ENIGMANG: efficient neural and gradientboosted inference guidance for E. In Pascal Fontaine, editor, Automated Deduction  CADE 27  27th International Conference on Automated Deduction, Natal, Brazil, August 2730, 2019, Proceedings, volume 11716 of Lecture Notes in Computer Science, pages 197–215. Springer, 2019.
 [6] Lukasz Czajka and Cezary Kaliszyk. Hammer for Coq: Automation for dependent type theory. J. Autom. Reasoning, 61(14):423–453, 2018.
 [7] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871–1874, June 2008.
 [8] Thibault Gauthier and Cezary Kaliszyk. Premise selection and external provers for HOL4. In Xavier Leroy and Alwen Tiu, editors, Proceedings of the 2015 Conference on Certified Programs and Proofs, CPP 2015, Mumbai, India, January 1517, 2015, pages 49–57. ACM, 2015.

[9]
Thibault Gauthier and Cezary Kaliszyk.
Sharing HOL4 and HOL light proof knowledge.
In Martin Davis, Ansgar Fehnker, Annabelle McIver, and Andrei
Voronkov, editors,
Logic for Programming, Artificial Intelligence, and Reasoning  20th International Conference, LPAR20 2015, Suva, Fiji, November 2428, 2015, Proceedings
, volume 9450 of Lecture Notes in Computer Science, pages 372–386. Springer, 2015.  [10] Thibault Gauthier and Cezary Kaliszyk. Aligning concepts across proof assistant libraries. J. Symb. Comput., 90:89–123, 2019.
 [11] Thibault Gauthier, Cezary Kaliszyk, and Josef Urban. Initial experiments with statistical conjecturing over large formal corpora. In Andrea Kohlhase, Paul Libbrecht, Bruce R. Miller, Adam Naumowicz, Walther Neuper, Pedro Quaresma, Frank Wm. Tompa, and Martin Suda, editors, Joint Proceedings of the FM4M, MathUI, and ThEdu Workshops, Doctoral Program, and Work in Progress at the Conference on Intelligent Computer Mathematics 2016 colocated with the 9th Conference on Intelligent Computer Mathematics (CICM 2016), Bialystok, Poland, July 2529, 2016, volume 1785 of CEUR Workshop Proceedings, pages 219–228. CEURWS.org, 2016.
 [12] Zarathustra Goertzel, Jan Jakubův, and Josef Urban. ENIGMAWatch: ProofWatch meets ENIGMA. In Serenella Cerrito and Andrei Popescu, editors, Automated Reasoning with Analytic Tableaux and Related Methods, pages 374–388, Cham, 2019. Springer International Publishing.
 [13] Jan Jakubuv and Josef Urban. ENIGMA: efficient learningbased inference guiding machine. In Herman Geuvers, Matthew England, Osman Hasan, Florian Rabe, and Olaf Teschke, editors, Intelligent Computer Mathematics  10th International Conference, CICM 2017, Edinburgh, UK, July 1721, 2017, Proceedings, volume 10383 of Lecture Notes in Computer Science, pages 292–302. Springer, 2017.
 [14] Jan Jakubuv and Josef Urban. Enhancing ENIGMA given clause guidance. In Florian Rabe, William M. Farmer, Grant O. Passmore, and Abdou Youssef, editors, Intelligent Computer Mathematics  11th International Conference, CICM 2018, Hagenberg, Austria, August 1317, 2018, Proceedings, volume 11006 of Lecture Notes in Computer Science, pages 118–124. Springer, 2018.
 [15] Jan Jakubuv and Josef Urban. Hammering Mizar by learning clause guidance. In John Harrison, John O’Leary, and Andrew Tolmach, editors, 10th International Conference on Interactive Theorem Proving, ITP 2019, September 912, 2019, Portland, OR, USA, volume 141 of LIPIcs, pages 34:1–34:8. Schloss Dagstuhl  LeibnizZentrum für Informatik, 2019.
 [16] Cezary Kaliszyk and Josef Urban. Learningassisted automated reasoning with Flyspeck. J. Autom. Reasoning, 53(2):173–213, 2014.
 [17] Cezary Kaliszyk and Josef Urban. HOL(y)Hammer: Online ATP service for HOL Light. Mathematics in Computer Science, 9(1):5–22, 2015.
 [18] Cezary Kaliszyk and Josef Urban. MizAR 40 for Mizar 40. J. Autom. Reasoning, 55(3):245–256, 2015.
 [19] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and TieYan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In NIPS, pages 3146–3154, 2017.
 [20] Laura Kovács and Andrei Voronkov. Firstorder theorem proving and Vampire. In Natasha Sharygina and Helmut Veith, editors, CAV, volume 8044 of LNCS, pages 1–35. Springer, 2013.
 [21] Miroslav Olsák, Cezary Kaliszyk, and Josef Urban. Property invariant embedding for automated reasoning. CoRR, abs/1911.12073, 2019.
 [22] Ross A. Overbeek. A new class of automated theoremproving algorithms. J. ACM, 21(2):191–200, April 1974.
 [23] Stephan Schulz. Learning search control knowledge for equational deduction, volume 230 of DISKI. Infix Akademische Verlagsgesellschaft, 2000.
 [24] Stephan Schulz. Learning search control knowledge for equational theorem proving. In Franz Baader, Gerhard Brewka, and Thomas Eiter, editors, KI 2001: Advances in Artificial Intelligence, Joint German/Austrian Conference on AI, Vienna, Austria, September 1921, 2001, Proceedings, volume 2174 of Lecture Notes in Computer Science, pages 320–334. Springer, 2001.
 [25] Stephan Schulz. E  A Brainiac Theorem Prover. AI Commun., 15(23):111–126, 2002.
 [26] Stephan Schulz. Fingerprint Indexing for Paramodulation and Rewriting. In Bernhard Gramlich, Ulrike Sattler, and Dale Miller, editors, Proc. of the 6st IJCAR, Manchester, volume 7364 of LNAI, pages 477–483. Springer, 2012.
 [27] Stephan Schulz. Simple and efficient clause subsumption with feature vector indexing. In Automated Reasoning and Mathematics, volume 7788 of Lecture Notes in Computer Science, pages 45–67. Springer, 2013.
 [28] Josef Urban. MPTP 0.2: Design, implementation, and initial experiments. J. Autom. Reasoning, 37(12):21–43, 2006.
 [29] Josef Urban, Geoff Sutcliffe, Petr Pudlák, and Jiří Vyskočil. MaLARea SG1  Machine Learner for Automated Reasoning with Semantic Guidance. In Alessandro Armando, Peter Baumgartner, and Gilles Dowek, editors, IJCAR, volume 5195 of LNCS, pages 441–456. Springer, 2008.
 [30] Robert Veroff. Using hints to increase the effectiveness of an automated reasoning program: Case studies. J. Autom. Reasoning, 16(3):223–239, 1996.
Appendix 0.A Additional Data From the Experiments
This appendix presents additional data from the experiments in Section 6. Figure 3 shows the results of the grid search for GNN models on one tenth of all benchmark problems done in order to find the bestperforming parameters for query and context sizes. The axis plots the query size, the axis plots the context size, while the axis plots the ATP performance, that is, the number of solved problems. Recall that the grid search was performed on a randomly selected but fixed tenth of all benchmark problems with a realtime limit per problem. For and , there is a separate graph for each iteration, showing only the best epochs. For , there are two graphs for models from epoch 20 and 50. Note how the later epoch 50 becomes more independent on the context size. The ranges of the grid search parameters were extended in later iterations when the bestperforming value was at the graph edge.
Figure 4 shows the grid search results for the best LightGBM’s GBDT models from iterations , , and (denoted here , , and ). The axis plots the number of tree leaves, the axis plots the tree depth, while the axis plots the number of solved problems. There are two models from the second iteration (), showing the effect of different learning rate (). Again, the ranges of metaparameters were updated in between the iterations by a human engineer.
Figure 5 shows the training accuracies and training loss for the LightGBM model . Accuracies (TPR and TNR) of the training data are computed from the first iteration (). The values for loss () are inverted () so that higher values correspond to better models which makes a visual comparison easier. We can see a clear correlation between the accuracies and the loss, but not so clear correlation with the ATP performance. The ATP performance of is the same as in Figure 4, repeated here for convenience.
Figure 2 compares the lengths of the discovered proofs. We can see that there is no systematic difference in this metric between the base strategy and the ENIGMA ones.
Appendix 0.B Discussion of Anonymization
Our use of symbolindependent aritybased features for GBDTs differs from Schulz’s anonymous clause patterns [24, 23] (CPs) used in E for proof guidance and from Gauthier and Kaliszyk’s (GK) anonymous abstractions used for their concept alignments between ITP libraries [10] in two ways:

In both CP and GK, serial (de Bruijnstyle) numbering of abstracted symbols of the same arity is used. I.e., the term will get abstracted to . Our encoding is just . It is even more lossy, because it is the same for .

ENIGMA with gradient boosting decision trees (GBDTs) can be (approximately) thought of as implementing weighted featurebased clause classification where the feature weights are learned. Whereas both in CP and GK, exact matching is used after the abstraction is done.^{7}^{7}7We thank Stephan Schulz for pointing out that although CPs used exact matching by default, matching up to a certain depth was also implemented.
In CP, this is used for hintstyle guidance of E. There, for clauses, such serial numbering however isn’t stable under literal reordering and subsumption. Partial heuristics can be used, such as normalization based on a fixed global ordering done in both CP and GK.
Addressing the latter issue (stability under reordering of literals and
subsumption) leads to the NP hardness of (hint)
matching/subsumption. I.e., the abstracted subsumption task can be
encoded as standard firstorder subsumption for clauses where terms
like are encoded as
. The NP hardness of subsumption is however
here more serious in practice than in standard ATP because only
applications behave as nonvariable symbols during the matching.
Thus, the difference between our anonymous approach and CP is practically the same as between the standard symbolbased ENIGMA guidance and standard hintbased [30] guidance. In the former the matching (actually, clause classification) is approximate, weighted and learned, while with hints the clause matching/classification is crisp, logicrooted and preprogrammed, sometimes running into the NP hardness issues. Our latest comparison [12] done over the Mizar/MPTP corpus in the symbolbased setting showed better performance of ENIGMA over using hints, most likely due to better generalization behavior of ENIGMA based on the statistical (GBDT) learning.
Comments
There are no comments yet.