Make E Smart Again

04/19/2020 ∙ by Zarathustra Amadeus Goertzel, et al. ∙ 0

In this work in progress, we demonstrate a new use-case for the ENIGMA system. The ENIGMA system using the XGBoost implementation of gradient boosted decision trees has demonstrated high capability to learn to guide the E theorem prover's inferences in real-time. Here, we strip E to the bare bones: we replace the KBO term ordering with an identity relation as the minimal possible ordering, disable literal selection, and replace evolved strategies with a simple combination of the clause weight and FIFO (first in first out) clause evaluation functions. We experimentally demonstrate that ENIGMA can learn to guide E as well as the smart, evolved strategies even without these standard automated theorem prover functionalities. To this end, we experiment with XGBoost's meta-parameters over a dozen loops.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction: Making E Stupid and Then Smart Again

State-of-the-art saturation-based automated theorem provers (ATPs) for first-order logic (FOL), such as E and Vampire [12], employ the given clause algorithm [13], translating the input FOL problem (background theory and negated conjecture) into a refutationally equivalent set of clauses. The search for a contradiction is performed maintaining sets of processed () and unprocessed () clauses (the proof state ). The algorithm repeatedly selects a given clause from , moves to , and extends with all clauses inferred with and . This process continues until a contradiction is found, becomes empty, or a resource limit is reached.

Historically, term ordering, together with literal selection, is used to guarantee the completeness of the proof search [1] and to “tame the growth of the search space and help steer proof search” [5]. Term ordering ensures that rewriting happens in only one direction, toward smaller terms. Literal selection limits the inferences done with each given clause to the selected literals, which slows down the growth of the search space and reduces redundant inferences.

E includes a strategy language of clause evaluation functions

, made up of weight and priority functions, to heuristically guide the proof search. In this work, I use two algorithmically invented 

[6, 7] strategies, E1 and E2111Strategies E1 and E2 are displayed in the appendix, that use many sophisticated clause evaluation functions, the Knuth-Bendix ordering (KBO6), literal selection, and other E heuristics.

The ENIGMA [8, 9, 10, 4] system with the XGBoost [2] implementation of gradient boosted decision trees has recently demonstrated high capability to learn to guide the E [14] theorem prover’s inferences in real-time. ENIGMA uses the XGBoost model as a clause evaluation function to recommend clauses for selection based on clause and conjecture features. In particular, after several proving and learning iterations, its performance on the problems from the Mizar40 [11] benchmark improved by ([10] over the strategy E1 used for the initial proving phase.

In this work, E is stripped to the bare bones by disabling term ordering and literal selection. KBO6 is replaced with an identity relation as the minimal possible ordering (called IDEN – an addition to E 222The E version used in this paper can be found at, and the library for running ENIGMA with E can be found at While this frees E to do inferences in any order, E can no longer perform rewriting inferences. The strategy E1 is replaced with the simple combination of the clause weight and FIFO (first in first out) evaluation functions. E is thus practically reduced to a basic superposition prover, without advanced heuristics, rewriting, or completeness guarantees. We call this strategy E0:

--definitional-cnf=24 --prefer-initial-clauses -tIDEN
--restrict-literal-comparisons -WNoSelection

E0 solves only 3872 of the Mizar40 problems in

seconds compared to 14526 for E1. The first research question is the extent to which ENIGMA with this basic prover can learn ATP guidance completely on its own. The second is to what extent ENIGMA’s learning can be boosted with data from strong strategies and models. That is, I explore how smart machine learning can become in this

zero-strategy setting. The more general related question is to what extent can machine learning replace the sophisticated human-invented theorem-proving body of wisdom used in today’s ATPs for restricting advanced proof calculi.

2 Experiments

We evaluate ENIGMA with the basic strategy, E0, in several scenarios and over two datasets of different sizes. All experiments are run with seconds per problem333As a rule of thumb, E solves most problems within a few seconds or not for a very long time. 444All the experiments are run on the same hardware unless otherwise specified: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz with 188GB RAM..

ENIGMA has so far been used in two ways: coop combines the learned model with some standard E strategy equally () while solo only uses the learned model for choosing the given clauses. The best results have been achieved by MaLARea-style [16] looping: that is, an ENIGMA model is trained and run with E (loop 0), then the resulting data are added to the initial training data and a new ENIGMA model is trained (loop 1).

In this work, ENIGMA trains with both solo and coop data. I present results from solo runs because they represent the most minimal setting.

2.1 Small Data (2000 problems)

The E evaluations and XGBoost training can take a long time on the full Mizar40 dataset, so 2000 randomly sampled problems are used to test meta-parameters on. Each XGBoost model consists of decision trees of depth , the most important training meta-parameters in addition to the learning rate (). In previous work with ENIGMA, and were fixed for all loops of learning. Here we try to vary the values of and during 16 loops. Let denote the experiment with specific and . Of the many protocols tested, the following are included in the plot of solved problems (above): Fives (), Nines (), Thirteens (), Sixteens ().

We also experiment with adaptively setting the meta-parameters as the number of training examples increases according to the following protocols:

  • Inc () increases by 2 from 3 to 33 and keeps fixed.

  • 32_inc () fixes and gradually increases from 50 to 250.

  • Inc2 () gradually decreases from 150 to 50, varying the value intuitively 555Precise details of intuitively set parameters can be seen in the appendix..

  • Inc3 () aims to be more systematic and steps from 50 to 250

  • Dec3() decreases from 250 to 50.

At the 16th loop Inc’s performance is best, solving 299 problems, doubling the performance of E0 (152). However Inc2 and Inc3 solve 298 problems and 32_inc solves 291 problems. The conclusion is that simple protocols work well so long as or is incremented adaptively rather than fixed.

2.2 Big Data (57880 problems)

These experiments are done on the large benchmark of Mizar40 [11] problems from the MPTP dataset [15]. E1 and E2 are two strong E strategies that solve 14526 and 12788 problems.

  • Experiment 1 is done with and and uses our previously trained model that allowed us to solve problems when cooperating with E1 in our previous experiments [10]. This strong model, which hashes the features into () buckets [3, Sec. 3.4], is used with E0 now.

  • Experiment 2’s parameters were intuitively toggled during the looping as in Inc3, and a feature size of is used. Exp. 2 uses training data from E1 and E2 for additional guidance up to the 4th loop (and then stops including them in the training data based on the assumption they may confuse learning).

  • Experiment 3 sets and according to protocol Inc3. Exp. 3 only learns from E run with E0 and trains on the GPU, which requires the feature size to be reduced to .

  • Experiment 4 mimics Exp. 3 but uses E1 and E2 data for training (up to the 4th loop).

  • Experiment 5

    further tests boosting with data from an E0 ENIGMA model that proved 9759 problems and an E1 that proved 21542 problems. Tree depth is intuitively varied among 32, 512, and 1000, the number of trees is varied among 2, 100, 200, and 32. The feature vector size starts at

    and is decreased to allow the data to fit on the RAM, down to 32 ().

As seen in the figure, the strong model does not help much in guiding E without ordering or selection in Exp. 1. Exp. 2 learns gradually and catches up with Exp. 1, but seems to plateau around 10,000. Surprisingly the pure Exp. 3 learns fast with the small feature size, but plateaus and drops in performance (perhaps due to overfitting). Exp. 4 indicates that guidance is useful and surpasses E2 with 13805 in round 13. Exp. 5 solves 15990 problems, showing that ENIGMA can take E0 beyond the smart strategies with appropriate parameters and boosting. This is a great improvement over the 3872 problems solved by E0.

3 Conclusion

ENIGMA can learn to guide the E prover effectively even without smart strategies and term orderings. The models confer a increase over the naive E0 after 13 rounds of the proving/learning loop, and even trained without guidance data, a increase.

The experiments indicate that machine learning can be used to fully control an ATP’s guidance, learning to replace orderings, heuristic strategies, and deal with the increase in generated clauses without literal selection. However the combination of ENIGMA and standard ATP heuristics still significantly out-performs ENIGMA alone.

Given the large symmetry-breaking impact of these methods in classical ATP, future work includes, e.g., training the guidance in such a way that redundant (symmetric) inferences are not done by the trained model once it has committed to a certain path. This probably means equipping the learning with more history and knowledge of the proof state in the saturation-style setting. ENIGMAWatch 

[4] may aid with symmetry breaking by focusing the proof search on particular proof paths. Additional work is needed to isolate the factors in Exp. 5’s performance, and determine the most effective boosting methods in addition to increasing and with training loops. Ablation studies should be done to discover the impact of term ordering and literal selection individually on E and ENIGMA’s performance. Perhaps term ordering alone is sufficient to train good ENIGMA models.

Running ENIGMA without term ordering and other restrictions is important because it may allow us to combine training data from different strategies, and it may allow ENIGMA to find novel proofs.

4 Acknowledgments

The research topic was proposed by Jan Jakubuv and Josef Urban, and further discussed with them, Martin Suda, and Thomas Tan. I also thank the AITP’20 anonymous referees for their comments on the first extended abstract of this work.


  • [1] Leo Bachmair and Harald Ganzinger. Rewrite-Based Equational Theorem Proving with Selection and Simplification. Journal of Logic and Computation, 3(4):217–247, 1994.
  • [2] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.
  • [3] Karel Chvalovský, Jan Jakubuv, Martin Suda, and Josef Urban. ENIGMA-NG: efficient neural and gradient-boosted inference guidance for E. CoRR, abs/1903.03182, 2019.
  • [4] Zarathustra Goertzel, Jan Jakubuv, and Josef Urban. Enigmawatch: Proofwatch meets ENIGMA. In Serenella Cerrito and Andrei Popescu, editors, Automated Reasoning with Analytic Tableaux and Related Methods - 28th International Conference, TABLEAUX 2019, London, UK, September 3-5, 2019, Proceedings, volume 11714 of Lecture Notes in Computer Science, pages 374–388. Springer, 2019.
  • [5] Kryštof Hoder, Giles Reger, Martin Suda, and Andrei Voronkov. Selecting the selection. In Nicola Olivetti and Ashish Tiwari, editors, Proc. of the 8th IJCAR, Coimbra, volume 9706 of LNAI, pages 313–329. Springer, 2016.
  • [6] Jan Jakubův and Josef Urban. Hierarchical invention of theorem proving strategies. AI Commun., 31(3):237–250, 2018.
  • [7] Jan Jakubuv and Josef Urban. Extending E prover with similarity based clause selection strategies. In Michael Kohlhase, Moa Johansson, Bruce R. Miller, Leonardo de Moura, and Frank Wm. Tompa, editors, Intelligent Computer Mathematics - 9th International Conference, CICM 2016, Bialystok, Poland, July 25-29, 2016, Proceedings, volume 9791 of Lecture Notes in Computer Science, pages 151–156. Springer, 2016.
  • [8] Jan Jakubuv and Josef Urban. ENIGMA: efficient learning-based inference guiding machine. In Herman Geuvers, Matthew England, Osman Hasan, Florian Rabe, and Olaf Teschke, editors, Intelligent Computer Mathematics - 10th International Conference, CICM 2017, Edinburgh, UK, July 17-21, 2017, Proceedings, volume 10383 of Lecture Notes in Computer Science, pages 292–302. Springer, 2017.
  • [9] Jan Jakubuv and Josef Urban. Enhancing ENIGMA given clause guidance. In Florian Rabe, William M. Farmer, Grant O. Passmore, and Abdou Youssef, editors, Intelligent Computer Mathematics - 11th International Conference, CICM 2018, Hagenberg, Austria, August 13-17, 2018, Proceedings, volume 11006 of Lecture Notes in Computer Science, pages 118–124. Springer, 2018.
  • [10] Jan Jakubuv and Josef Urban. Hammering Mizar by learning clause guidance. In John Harrison, John O’Leary, and Andrew Tolmach, editors, 10th International Conference on Interactive Theorem Proving, ITP 2019, September 9-12, 2019, Portland, OR, USA, volume 141 of LIPIcs, pages 34:1–34:8. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • [11] Cezary Kaliszyk and Josef Urban. MizAR 40 for Mizar 40. J. Autom. Reasoning, 55(3):245–256, 2015.
  • [12] Laura Kovács and Andrei Voronkov. First-order theorem proving and Vampire. In Natasha Sharygina and Helmut Veith, editors, CAV, volume 8044 of LNCS, pages 1–35. Springer, 2013.
  • [13] Ross A. Overbeek. A new class of automated theorem-proving algorithms. J. ACM, 21(2):191–200, April 1974.
  • [14] Stephan Schulz, Simon Cruanes, and Petar Vukmirovic. Faster, higher, stronger: E 2.3. In Pascal Fontaine, editor, Automated Deduction - CADE 27 - 27th International Conference on Automated Deduction, Natal, Brazil, August 27-30, 2019, Proceedings, volume 11716 of Lecture Notes in Computer Science, pages 495–507. Springer, 2019.
  • [15] Josef Urban. MPTP 0.2: Design, implementation, and initial experiments. J. Autom. Reasoning, 37(1-2):21–43, 2006.
  • [16] Josef Urban, Geoff Sutcliffe, Petr Pudlák, and Jiří Vyskočil. MaLARea SG1 - Machine Learner for Automated Reasoning with Semantic Guidance. In Alessandro Armando, Peter Baumgartner, and Gilles Dowek, editors, IJCAR, volume 5195 of LNCS, pages 441–456. Springer, 2008.

Appendix 0.A Strategies

Strategy E1is:

--definitional-cnf=24 --split-aggressive --simul-paramod
--forward-context-sr --destructive-er-aggressive --destructive-er
--prefer-initial-clauses -tKBO -winvfreqrank -c1 -Ginvfreq -F1
--delete-bad-limit=150000000 -WSelectMaxLComplexAvoidPosPred

Strategy E2 is:

--definitional-cnf=24 --split-aggressive --split-reuse-defs
--simul-paramod --forward-context-sr --destructive-er-aggressive
--destructive-er --prefer-initial-clauses -tKBO -winvfreqrank
-c1 -Ginvfreq -F1 --delete-bad-limit=150000000
-WSelectMaxLComplexAvoidPosPred -H’(

Appendix 0.B Additional Protocol Details

In this section I include the details for intuitively toggled protocols.

Protocol Inc2 is as follows:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Depth 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
Trees 150 150 150 100 100 100 75 50 75 100 150 75 100 150 75 100

The protocol for Exp. 2 is as follows:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
D 4 5 6 7 8 9 10 11 12 13 14 15 16 16 32 9 16 32 64 24 25 32
T 50 150 160 170 180 190 200 200 200 200 210 220 225 225 225 300 300 225 150 250 250 250

The protocol for Exp .5 requires some explanation. The motivation is to see how far E0 can be taken, even if the methods are too CPU-intensive for a thorough grid search.

Exp. 2 and Exp. 4 demonstrate the utility of boosting. Thus to create better boosting data I trained ENIGMA for 10 loops with strategies E1 through E12 and used this as boosting data for the first 4 of 10 loops of training. In addition to training E0, and in the spirit of ablation studies, I also trained ENIGMA models for E0 with KBO ordering (and no literal selection) and for E0 with KBO ordering and restricted literal comparisons. The motivation is that these versions may serve as a bridge between standard E and the basic E0.

Then I used these results to boost an ENIGMA model in loop 0, and trained based on this for 10 loops, proving 9759 problems.

Finally this data and the data from a loop 3 ENIGMA model trained with E1 is used to boost E0 with the following meta-parameters:

0 1 2 3 4 5 6 7 8 9 10 11
Depth 512 512 32 1000 32 1000 32 1000 32 1000 1000 100
Trees 2 2 100 100 200 100 200 32 300 32 32 32
Feature Size 16384 8192 4096 28 4096 28 4096 32 2048 64 32 128