deepmath
deep learning for math
view repo
We study the effectiveness of neural sequence models for premise selection in automated theorem proving, one of the main bottlenecks in the formalization of mathematics. We propose a two stage approach for this task that yields good results for the premise selection task on the Mizar corpus while avoiding the handengineered features of existing stateoftheart models. To our knowledge, this is the first time deep learning has been applied to theorem proving on a large scale.
READ FULL TEXT VIEW PDF
We study methods for automated parsing of informal mathematical expressi...
read it
In the recent years, we have linked a large corpus of formal mathematics...
read it
We describe two theorem proving tasks  premise selection and internal
...
read it
"Theorem proving is similar to the game of Go. So, we can probably impro...
read it
Automated reasoning and theorem proving have recently become major chall...
read it
We propose a deep learningbased approach to the problem of premise
sele...
read it
We identify the main actors in the Isabelle and Coq communities and desc...
read it
deep learning for math
Mathematics underpins all scientific disciplines. Machine learning itself rests on measure and probability theory, calculus, linear algebra, functional analysis, and information theory. Complex mathematics underlies computer chips, transit systems, communication systems, and financial infrastructure – thus the correctness of many of these systems can be reduced to mathematical proofs.
Unfortunately, these correctness proofs are often impractical to produce without automation, and presentday computers have only limited ability to assist humans in developing mathematical proofs and formally verifying human proofs. There are two main bottlenecks: (1) lack of automated methods for semantic or formal parsing of informal mathematical texts (autoformalization
), and (2) lack of strong automated reasoning methods to fill in the gaps in already formalized humanwritten proofs.
The two bottlenecks are related. Strong automated reasoning can act as a semantic filter for autoformalization, and successful autoformalization would provide a large corpus of computerunderstandable facts, proofs, and theory developments. Such a corpus would serve as both background knowledge to fill in gaps in humanlevel proofs and as a training set to guide automated reasoning. Such guidance is crucial: exhaustive deductive reasoning tools such as today’s resolution/superposition automated theorem provers (ATPs) quickly hit combinatorial explosion, and are unusable when reasoning with a very large number of facts without careful selection BlanchetteKPU16 .
In this work, we focus on the latter bottleneck. We develop deep neural networks that learn from a large repository of manually formalized computerunderstandable proofs. We learn the task that is essential for making today’s ATPs usable over large formal corpora: the selection of a limited number of most relevant facts for proving a new conjecture. This is known as
premise selection.The main contributions of this work are:
A demonstration for the first time that neural network models are useful for aiding in large scale automated logical reasoning without the need for handengineered features.
The comparison of various network architectures (including convolutional, recurrent and hybrid models) and their effect on premise selection performance.
A method of semanticaware “definition”embeddings for function symbols that improves the generalization of formulas with symbols occurring infrequently. This model outperforms previous approaches.
Analysis showing that neural network based premise selection methods are complementary to those with handengineered features: ensembling with previous results produce superior results.
In the last two decades, large corpora of complex mathematical knowledge have been formalized: encoded in complete detail so that computers can fully understand the semantics of complicated mathematical objects. The process of writing such formal and verifiable theorems, definitions, proofs, and theories is called Interactive Theorem Proving (ITP).
The ITP field dates back to 1960s HarrisonUW14 and the Automath system by N.G. de Bruijn DeBruijn68 . ITP systems include HOL (Light) Harrison96 , Isabelle WenzelPN08 , Mizar mizarinanutshell , Coq coq , and ACL2 KaufmannM08 . The development of ITP has been intertwined with the development of its cousin field of Automated Theorem Proving (ATP) DBLP:books/el/RobinsonV01 , where proofs of conjectures are attempted fully automatically. Unlike ATP systems, ITP systems allow humanassisted formalization and proving of theorems that are often beyond the capabilities of the fully automated systems.
Large ITP libraries include the Mizar Mathematical Library (MML) with over 50,000 lemmas, and the core Isabelle, HOL, Coq, and ACL2 libraries with thousands of lemmas. These core libraries are a basis for large projects in formalized mathematics and software and hardware verification. Examples in mathematics include the HOL Light proof of the Kepler conjecture (Flyspeck project) HalesABDHHKMMNNNOPRSTTTUVZ15 , the Coq proofs of the FeitThompson theorem DBLP:conf/itp/GonthierAABCGRMOBPRSTT13 and Four Color theorem Gonthier07 , and the verification of most of the Compendium of Continuous Lattices in Mizar BancerekR02 . ITP verification of the seL4 kernel KleinAEHCDEEKNSTW10 and CompCert compiler Leroy09 show comparable progress in large scale software verification. While these large projects mark a coming of age of formalization, ITP remains laborintensive. For example, Flyspeck took about 20 personyears, with twice as much for FeitThompson. Behind this cost are our two bottlenecks: lack of tools for autoformalization and strong proof automation.
Recently the field of Automated Reasoning in Large Theories (ARLT) UrbanV13 has developed, including AI/ATP/ITP (AITP) systems called hammers that assist ITP formalization BlanchetteKPU16
. Hammers analyze the full set of theorems and proofs in the ITP libraries, estimate the relevance of each theorem, and apply optimized translations from the ITP logic to simpler ATP formalism. Then they attack new conjectures using the most promising combinations of existing theorems and ATP search strategies. Recent evaluations have proved 40% of all Mizar and Flyspeck theorems fully automatically
holyhammer ; KaliszykU13b . However, there is significant room for improvement: with perfect premise selection (a perfect choice of library facts) ATPs can prove at least 56% of Mizar and Flyspeck instead of today’s 40% BlanchetteKPU16 . In the next section we explain the premise selection task and the experimental setting for measuring such improvements.Given a formal corpus of facts and proofs expressed in an ATPcompatible format, our task is
Given a large set of premises , an ATP system with given resource limits, and a new conjecture , predict those premises from that will most likely lead to an automatically constructed proof of by .
We use the Mizar Mathematical Library (MML) version 4.181.1147^{1}^{1}1ftp://mizar.uwb.edu.pl/pub/system/i386linux/mizar7.13.01_4.181.1147i386linux.tar as the formal corpus and E prover Sch02AICOMM version 1.9 as the underlying ATP system. The following list exemplifies a small nonrepresentative sample of topics and theorems that are included in the Mizar Mathematical Library: CauchyRiemann Differential Equations of Complex Functions, Characterization and Existence of Gröbner Bases, Maximum Network Flow Algorithm by Ford and Fulkerson, Gödel’s Completeness Theorem, Brouwer Fixed Point Theorem, Arrow’s Impossibility Theorem BorsukUlam Theorem, Dickson’s Lemma, Sylow Theorems, Hahn Banach Theorem, The Law of Quadratic Reciprocity, Pepin’s Primality Test for PublicKey Cryptography, Ramsey’s Theorem.
This version of MML was used for the latest AITP evaluation reported in KaliszykU13b . There are 57,917 proved Mizar theorems and unnamed toplevel lemmas in this MML organized into 1,147 articles. This set is chronologically ordered by the order of articles in MML and by the order of theorems in the articles. Proofs of later theorems can only refer to earlier theorems. This ordering also applies to 88,783 other Mizar formulas (encoding the type system and other automation known to Mizar) used in the problems. The formulas have been translated into firstorder logic formulas by the MPTP system Urban06 (see Figure 1).
Our goal is to automatically prove as many theorems as possible, using at each step all previous theorems and proofs. We can learn from both human proofs and ATP proofs, but previous experiments KuhlweinU12b ; holyhammer show that learning only from the ATP proofs is preferable to including human proofs if the set of ATP proofs is sufficiently large. Since for 32,524 (56.2%) of the 57,917 theorems an ATP proof was previously found by a combination of manual and learningbased premise selection KaliszykU13b , we use only these ATP proofs for training.
The 40% success rate from KaliszykU13b used a portfolio of 14 AITP methods using different learners, ATPs, and numbers of premises. The best single method proved 27.3% of the theorems. Only fast and simple learners such as
nearestneighbors, naive Bayes, and their ensembles were used, based on handcrafted features such as the set of (normalized) subterms and symbols in each formula.
Strong premise selection requires models capable of reasoning over mathematical statements, here encoded as variablelength strings of firstorder logic. In natural language processing, deep neural networks have proven useful in language modeling
mikolov2010recurrent , text classification dai2015semi , sentence pair scoring baudis2016sentence , conversation modeling vinyals2015neural , and question answering sukhbaatar2015end . These results have demonstrated the ability of deep networks to extract useful representations from sequential inputs without handtuned feature engineering. Neural networks can also mimic some higherlevel reasoning on simple algorithmic tasks zaremba2014learning ; kaiser2015neural .The Mizar data set is also an interesting case study in neural network sequence tasks, as it differs from natural language problems in several ways. It is highly structured with a simple context free grammar – the interesting task occurs only after parsing. The distribution of lengths is wide, ranging from 5 to 84,299 characters with mean 304.5, and from 2 to 21,251 tokens with mean 107.4 (see Figure 2). Fully recurrent models would have to backpropagate through 100s to 1000s of characters or 100s of tokens to embed a whole statement. Finally, there are many rare words – 60.3% of the words occur fewer than 10 times – motivating the definitionaware embeddings in section 5.2.
The full premise selection task takes a conjecture and a set of axioms and chooses a subset of axioms to pass to the ATP. We simplify from subset selection to pairwise relevance by predicting the probability that a given axiom is useful for proving a given conjecture. This approach depends on a relatively sparse dependency graph. Our general architecture is shown in Figure 3
(left): the conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with two fully connected layers and logistic loss. During training time, the two embedding networks and the joined predictor path are trained jointly.
As discussed in section 3, we train our models on premise selection data generated by a combination of various methods, including knearestneighbor search on handengineered similarity metrics. We start with a first stage of characterlevel models, and then build second and later stages of wordlevel models on top of the results of earlier stages.
We begin by avoiding special purpose engineering by treating formulas on the characterlevel using an 80 dimensional onehot encoding of the character sequence. These sequences are passed to a weight shared network for variable length input. For the embedding computation, we have explored the following architectures:
Pure recurrent LSTM hochreiter1997long and GRU chung2015gated networks.
A pure multilayer convolutional network with various numbers of convolutional layers (with strides) followed by a global temporal maxpooling reduction (see Figure
3(right)).A recurrentconvolutional network, that uses convolutional layers to produce a shorter sequence which is processed by a LSTM.
The exact architectures used are specified in the experimental section.
It is computationally prohibitive to compute a large number of (conjecture, axiom) pairs due to the costly embedding phase. Fortunately, our architecture allows caching the embeddings for conjectures and axioms and evaluating the shared portion of the network for a given pair. This makes it practical to consider all pairs during evaluation.
The characterlevel models are limited to word and structure similarity within the axiom or conjecture being embedded. However, many of the symbols occurring in a formula are defined by formulas earlier in the corpus, and we can use the axiomembeddings of those symbols to improve model performance.
Since Mizar is based on firstorder set theory, definitions of symbols can be either explicit or implicit. An explicit definition of sets for some expression , while an implicit definition states a property of the defined object, such as defining a function by . To avoid manually encoding the structure of implicit definitions, we embed the entire statement defining a symbol , and then use the stage 1 axiomembedding corresponding to the whole statement as a wordlevel embeddings.
Ideally, we would train a single network that embeds statements by recursively expanding and embedding the definitions of the defined symbols. Unfortunately, this recursion would dramatically increase the cost of training since the definition chains can be quite deep. For example, Mizar defines real numbers in terms of nonnegative reals, which are defined as Dedekind cuts of nonnegative rationals, which are defined as ratios of naturals, etc. As an inexpensive alternative, we reuse the axiom embeddings computed by a previously trained characterlevel model, mapping each defined symbol to the axiom embedding of its defining statement. Other tokens such as brackets and operators are mapped to fixed pseudorandom vectors of the same dimension.
Since we embed one token at a time ignoring the grammatical structure, our approach does not require a parser: a trivial lexer is implemented in a few lines of Python. With wordlevel embeddings, we use the same architectures with shorter input sequence to produce axiom and conjecture embeddings for ranking the (conjecture, axiom) pairs. Iterating this approach by using the resulting, stronger axiom embeddings as word embeddings multiple times for additional stages did not yield measurable gains.
For training and evaluation we use a subset of 32,524 out of 57,917 theorems that are known to be provable by an ATP given the right set of premises. We split off a random 10% of these (3,124 statements) for testing and validation. Also, we held out 400 statements from the 3,124 for monitoring training progress, as well as for model and checkpoint selection. Final evaluation was done on the remaining 2,724 conjectures. Note that we only held out conjectures, but we trained on all statements as axioms. This is comparable to our kNN baseline which is also trained on all statements as axioms. The randomized selection of the training and testing sets may also lead to learning from future proofs: a proof of theorem written after theorem may guide the premise selection for . However, previous NN experiments show similar performance between a full 10fold crossvalidation and incremental evaluation as long as chronologically preceding formulas participate in proofs of only later theorems.
For each conjecture, our models output a ranking of possible premises. Our primary metric is the number of conjectures proved from the top premises, where . This metric can accommodate alternative proofs but is computationally expensive. Therefore we additionally measure the ranking quality using the average maximum relative rank of the testing premise set. Formally, average max relative rank is
where ranges over conjectures, is the set of premises available to prove , is the set of premises for conjecture from the test set, and is the rank of premise among the set according to the model. The motivation for aMRR is that conjectures are easier to prove if all their dependencies occur early in the ranking.
Since it is too expensive to rank all axioms for a conjecture during continuous evaluation, we approximate our objective. For our holdout set of 400 conjectures, we select all true dependencies and 128 fixed random false dependencies from and compute the average max relative rank in this ordering. Note that aMRR is nonzero even if all true dependencies are ordered before false dependencies; the best possible value is 0.051.
All our neural network models use the general architecture from Fig 3
: a classifier on top of the concatenated embeddings of an axiom and a conjecture. The same classifier architecture was used for all models: a fullyconnected neural network with one hidden layer of size 1024. For each model, the axiom and conjecture embedding networks have the same architecture without sharing weights. The details of the embedding networks are shown in Fig
4.The neural networks were trained using asynchronous distributed stochastic gradient descent using the Adam optimizer
kingma2014adamwith up to 20 parallel NVIDIA K80 GPU workers per model. We used the TensorFlow framework
tensorflow2015whitepaperand the Keras library
chollet2015keras . The weights were initialized using glorot2010understanding . Polyak averaging with 0.9999 decay was used for producing the evaluation weights polyak1992acceleration . The character level models were trained with maximum sequence length 2048 characters, where the wordlevel (and definition embedding) based models had a maximum sequence length of 500 words. For good performance, especially for low cutoff thresholds, it was critical to employ negative mining during training. A side process was continuously evaluating many (conjecture, axiom) pairs. For each conjecture, we pick the lowest scoring statements that have higher score than the lowest scoring true positive. A queue of previously mined negatives is maintained for producing a mixture of examples in which the ratio of mined instances is about 25% and the rest are randomly selected premises. Negative mining was crucial for good quality: at the top16 cutoff, the number of proved theorems on the test set has doubled. For the union of proof attempts over all cutoff thresholds, the ratio of successful proofs has increased from 61.3% to 66.4% for the best neural model.Our best selection pipeline uses a stage1 characterlevel convolutional neural network model to produce wordlevel embeddings for the second stage. The baseline uses distanceweighted
NN EasyChair:74 ; KaliszykU13b with handcrafted semantic features KaliszykUV15a . For all conjectures in our holdout set, we consider all the chronologically preceding statements (lemmas, definitions and axioms) as premise candidates. In the DeepMath case, premises were ordered by their logistic scores. E prover was applied to the top of the premisecandidates for each of the cutoffs until a proof is found or fails. Table 1 reports the number of theorems proved with a cutoff value at most the in the leftmost column. For E prover, we used auto strategy with a soft time limit of 90 seconds, a hard time limit of 120 seconds, a memory limit of 4 GB, and a processed clauses limit of 500,000.Our most successful models employ simple convolutional networks followed by max pooling (as opposed to recurrent networks like LSTM/GRU), and the two stage definitionbased defCNN outperforms the naïve wordCNN word embedding significantly. In the latter the word embeddings were learned in a single pass; in the former they are fixed from the stage1 characterlevel model. For each architecture (cf. Figure 4) two convolutional layers perform best. Although our models differ significantly from each other, they differ even more from the NN baseline based on handcrafted features. The right column of Table 1 shows the result if we average the prediction score of the stage1 model with that of the definition based stage2 model. We also experimented with characterbased RNN models using shorter sequences: these lagged behind our longsequence CNN models but performed significantly better than those RNNs trained on longer sequences. This suggest that RNNs could be improved by more sophisticated optimization techniques such as curriculum learning.
Cutoff  NN Baseline (%)  charCNN (%)  wordCNN (%)  defCNNLSTM (%)  defCNN (%)  def+charCNN (%) 

16  674 (24.6)  687 (25.1)  709 (25.9)  644 (23.5)  734 (26.8)  835 (30.5) 
32  1081 (39.4)  1028 (37.5)  1063 (38.8)  924 (33.7)  1093 (39.9)  1218 (44.4) 
64  1399 (51)  1295 (47.2)  1355 (49.4)  1196 (43.6)  1381 (50.4)  1470 (53.6) 
128  1612 (58.8)  1534 (55.9)  1552 (56.6)  1401 (51.1)  1617 (59)  1695 (61.8) 
256  1709 (62.3)  1656 (60.4)  1635 (59.6)  1519 (55.4)  1708 (62.3)  1780 (64.9) 
512  1762 (64.3)  1711 (62.4)  1712 (62.4)  1593 (58.1)  1780 (64.9)  1830 (66.7) 
1024  1786 (65.1)  1762 (64.3)  1755 (64)  1647 (60.1)  1822 (66.4)  1862 (67.9) 

Also, when we applied two of the premise selection models on those Mizar statements that were not proven automatically before, we managed to prove 823 additional of them.
In this work we provide evidence that even simple neural models can compete with handengineered features for premise selection, helping to find many new proofs. This translates to real gains in automatic theorem proving. Despite these encouraging results, our models are relatively shallow networks with inherent limitations to representational power and are incapable of capturing high level properties of mathematical statements. We believe theorem proving is a challenging and important domain for deep learning methods, and that more sophisticated optimization techniques and training methodologies will prove more useful than in less structured domains.
We would like to thank Cezary Kaliszyk for providing us with an improved baseline model. Also many thanks go to the Google Brain team for their generous help with the training infrastructure. We would like to thank Quoc Le for useful discussions on the topic and to Sergio Guadarrama for his help with TensorFlowslim.
International conference on artificial intelligence and statistics
, pages 249–256, 2010.A machinechecked proof of the Odd Order Theorem.
In S. Blazy, C. PaulinMohring, and D. Pichardie, editors, ITP, volume 7998 of LNCS, pages 163–179. Springer, 2013.
Comments
There are no comments yet.