## 1 Introduction

Automated theorem proving studies the design of automated systems that reason over
logical theories (collections of *axioms*

that are formulae known to be true) to generate formal proofs of given conjectures. It has been a longstanding, active area of artificial intelligence research that has demonstrated utility in the design of operating systems

klein2009operating; klein2014comprehensive, distributed systems garland1998ioa; hawblitzel2015ironfleet, compilers curzon1991verified; leroy2009formal, microprocessor design hunt1989microprocessor, and in general mathematics hales2017formal.Classical automated theorem provers (ATPs) have historically been most useful for solving problems that require complex chains of reasoning steps to be executed over smaller sets of axioms (see TPTP sutcliffe2009tptp for examples). When faced with problems for which thousands to millions of axioms are provided (only a handful of which may be needed at a time), even state-of-the-art theorem provers have difficulty ramachandran2005first; hoder2011sine. This deficiency has become more evident in recent years, as large logical theories matuszek2006introduction; mizar40for40; pease2002suggested

have become more widely available. A natural way to scale ATPs to broader domains has been to design sophisticated mechanisms that allow them to determine which axioms or intermediate proof outputs merit exploration in the proof search process. These mechanisms thus prune an otherwise unmanageably large proof search space down to a size that can be handled efficiently by classical theorem provers. The task of classifying axioms as being useful to prove a given conjecture is referred to as

*premise selection*, while the task of classifying intermediate proof steps as being a part of a successful proof for a conjecture is referred to as

*proof step classification*.

Initial approaches for solving these two tasks proposed heuristics based on simple symbol co-occurrences between formulae

hoder2011sine; roederer2009divvy; kuhlwein2012overview. While effective, these methods were soon surpassed by machine-learning techniques which could automatically adjust themselves to the needs of particular domains

alama2014premise; irving2016deepmath. At present, there has been a rising interest in developing neural approaches for both premise selection and proof step classification bansal2019holist; kaliszyk2017holstep; loos2017deep; however, the complex and structured nature of logical formulae has made this development challenging. Neural approaches that take into account a formula’s structure (e.g., its parse tree), have been shown to outperform their more basic counterparts which operate on only a formula’s symbols wang2017premise; paliwal2019graph. The two most commonly used structure-aware neural methods have been Tree LSTMs tai2015improved and GNNs kipf2016semi. However, as they have been applied in this domain, these methods appear to be leaving out useful structural information.When used to embed the parse tree of a logical formula, Tree LSTMs generate embeddings that represent the parse tree globally, but they miss logically important information like shared subexpressions and variable quantifications. Conversely, traditional GNN approaches appear to better capture shared subexpressions and variable quantifications; however, the global graph embedding they produce for the whole formula consists of a simple pooling operation over individual node embeddings; each representing only themselves and their local neighborhoods. Additionally, most prior approaches have embedded the premise and conjecture formulae independently of each other wang2017premise; loos2017deep; paliwal2019graph; irving2016deepmath. They first embed the graph of the premise and then separately embed the conjecture graph, resulting in the contents of one formula having no influence on the embedding of the other.

To address these issues, we present a novel, two-phase embedding approach that operates over the DAG representations of logical formulae and is designed with careful consideration to their particular properties. Our method first produces an initial set of high-quality embeddings for nodes that incorporates more than just their local neighborhoods. Then, it pools the embeddings together in a structure-dependent way to generate a single graph-level embedding. This decoupling provides a clear point at which information between formulae can be exchanged, which allows us to define a localized attention-based exchange mechanism that can regulate information flow between the concurrent formula embedding processes.

We demonstrate the effectiveness and generality of our approach by evaluating classification performance on two standard datasets that involve different logical formalisms; the Mizar dataset mizar40for40; irving2016deepmath for first-order logic and the Holstep dataset kaliszyk2017holstep for higher-order logic. Our experiments show that the architecture introduced in this paper outperforms all previous approaches on the binary classification tasks of premise selection and proof step classification for both datasets. We also demonstrate how to easily integrate our approach with *E*, a well-established theorem prover schulz2013system, as its premise selection mechanism, allowing it to find more proofs (61.6% improvement) in a large-theory setting.

To summarize, our main contributions are: 1) We show how to leverage the DAG structure implicit in logical formulae to produce more effective embeddings than traditional approaches operating over the local neighborhoods of individual nodes; 2) We introduce a novel neural architecture that employs a localized attention mechanism to allow formulae to exchange information during the embedding process; 3) We provide an extensive series of experiments and compare a range of neural architectures, showing significant improvement over existing state-of-the-art methods on two standard ATP classification datasets.

## 2 Related Work

We note that premise selection and proof step classification are not intrinsically machine learning tasks. The earliest approaches to premise selection hoder2011sine were simple heuristics capturing the (transitive) co-occurrence of symbols in a given axiom and conjecture. Soon after, it was recognized that machine-learning techniques would be effective tools for solving this problem. The work of alama2014premise introduced a kernel method for premise selection where the similarity between two formulae was computed by the number of common subterms and symbols. Deepmath irving2016deepmath was the first deep learning approach to this problem, comparing the performance of sequence models over character and symbol-level representations of logical formulae. In kucik2018premise

, the authors proposed a symbol-level method that learned low-dimensional distributed representations of function symbols and used those to construct embedded representations of given formulae for premise selection. The work of

olvsak2019property introduced a GNN for representing specifically first-order logic formulae in conjuctive normal form that captured certain logical invariances like reorderings of clauses and literals.Recently, Holstep kaliszyk2017holstep, a new formal dataset designed to be large enough to evaluate neural methods for premise selection and proof step classification (among other tasks), was introduced. Along with the dataset came a set of benchmark deep learning models that operated over character and symbol-level representations of higher-order logic formulae. FormulaNet wang2017premise

was the first approach to transform a formula into a rooted DAG (a modified version of the parse tree) and then process the resulting graph with a GNN. Their GNN produced embeddings for each node within a formula’s graph representation and then max pooled across node embeddings to get a formula-level embedding.

There are several other related works in this area that focus on different tasks (e.g., proof guidance, the combined, online version of the aforementioned two tasks). Deep learning approaches to proof guidance include loos2017deep, where the authors explored a number of neural architectures in their implementation (including a Tree LSTM that encoded parse trees of logical formulae). paliwal2019graph represented formulae as DAGs with shared subexpressions and used message-passing GNNs (MPNNs) to generate embeddings that could be used to guide theorem proving on the higher-order logic benchmark of bansal2019holist. However, like wang2017premise, the graph-level embeddings produced by their approach were simple, consisting only of a max pooling over individual node embeddings. The work of evans2018can introduced a dataset for evaluating neural models on entailment for propositional logic and explored the use of several popular neural architectures on the proposed task. huang2018gamepad where the authors introduced the GamePad dataset for evaluating neural models on the tasks of position evaluation and tactic prediction.

## 3 Formula Representation

#### Background:

First-order logic formulae are formal expressions based on an alphabet of predicate, function, and variable symbols which are combined by logical connectives. A term is either a variable, a constant (function with no arguments), or, inductively, a function applied to a tuple of terms. A formula is either a predicate applied to a tuple of terms or, inductively, a connective (e.g., read as “and”) applied to some number of formulae. In addition, variables in formulae can be quantified by quantifiers (e.g., by read as “for all”), where a quantifier introduces an additional semantic restriction for the interpretation of the variables it quantifies. Higher-order logic formulae also allow for quantification over predicate and function symbols or the application of predicates over other predicates. For more details on both first and higher-order logic, we refer the reader to taylor1999practical.

#### Logical Formulae as Graphs:

While the earliest work on integrating deep learning with reasoning techniques used symbol- or word-level representations of input formulae irving2016deepmath; kaliszyk2017holstep (considering formula strings as words), subsequent work explored using formula parse trees loos2017deep; evans2018can; huang2018gamepad or rooted DAG forms wang2017premise; paliwal2019graph. On the Holstep kaliszyk2017holstep and Holist bansal2019holist datasets, the DAG forms of logical formulae were found to be the more useful than bag-of-symbols and tree-structured encodings wang2017premise; paliwal2019graph. We focus on rooted-DAG representations of formulae; Figure 1 shows an unmodified example of such a representation. The DAG associated to a formula corresponds to its parse tree, where directed edges are added from parents to their arguments and shared subexpressions are mapped to the same subgraphs. As in wang2017premise, all instances of the same variable are collapsed into a single node (which maintains all prior connections) and the name of each variable is replaced by a generic variable token.

#### Edge Labeling:

Capturing the ordering of arguments of logical expressions is still an open topic of research. wang2017premise used a so-called *treelet* encoding scheme that represents the position of a node relative to other arguments of the same parent as triples. paliwal2019graph used positional edge labels, assigning to each edge a label reflecting the position of its target node in the argument list of the node’s parent.
We follow the latter strategy, albeit, with modifications. In our formulation, edge labels are determined by a partial ordering. For unordered logical connectives (e.g., , ) and predicates (e.g., ) all arguments are of the same rank. For other predicates, functions, and logical connectives the arguments are instead linearly ordered. However, we also support hybrid cases like simultaneous quantification over multiple variables.
The label given to each argument edge in the graph is the rank of the corresponding argument with respect to the parent concatenated with the type of the parent.

## 4 Our Approach

In this work, we broadly distinguish between node embedding methods by reachability. More formally, consider a binary adjacency relation defined for a set of graphs . The -reachability relation is given as the -th power of , which is defined recursively with and . We can define the transitive closure of as simply . Letting the set of all nodes be , we say that a graph embedding function is a embedding method if there exists a such that where for every in we have that computes the value of as a function of only the embeddings for . Naturally, we define a embedding method as one for which the opposite holds, i.e. for each we have that computes the value of as a function of the embeddings for all where holds. This distinction is particularly useful to make for graphs in the logic domain, as the transitive closure of adjacency is necessary for many key logical operations. As a trivial but important example, consider checking for the resolvability or unifiability of two ground formulas. Potentially all nodes of the two formulas would need to be examined, meaning that if both formulas had depth then a procedure defined with that checks only some subset of nodes within a fixed range of the root would be insufficient.

We view embedding methods as those that perform a sophisticated type of subgraph pooling. That is, a node embedding method computes the embedding for a node as a function of the embeddings of all nodes reachable from , i.e. a pooling of all such node embeddings. By definition, they incorporate as much graph context as is possible (i.e., the transitive closure of ). While node embedding methods naturally lend themselves to graph-level readout functions (and we will also use them in this way), we note that these concepts are defined for node-level embeddings (an important distinction to make, as for certain applications the input graphs could be disconnected).

Our approach operates in two stages (see Figure 2). First, a neural network generates embeddings for each node of an input formula’s graph representation. Then, the node embeddings are passed into a follow-up embedding method, referred to as the pooling method, that has as the parent relation. The embedding for the root node of the input formula is returned, which is a function of all nodes in its graph. Our approach is very modular, with any node-level embedding method capable of serving as the initial node embedder (though we are mainly interested in embedding methods) and any embedding method being usable as the pooling method. Thus, in the next sections we describe the node embedding methods independently, and then we describe the classification process.

### 4.1 Embedding Methods

#### Message-Passing Graph Neural Networks:

The MPNN framework can be thought of as an iterative update procedure that represents a node as an aggregation of information from its local neighborhood. To begin, our MPNN assigns each node and edge of the input graph an initial embedding, and . Then, following wang2017premise

, initial node states are computed by passing each such embedding through batch normalization

ioffe2015batchand a ReLU, producing node states

and edge states . Lastly, a message-passing phase runs for roundswhere and are functions that take a node and return the immediate ancestors / children of in , and , , and are feed-forward networks unique to the -th round of updates, and

denotes vector concatenation. The reachability relation

in this context is defined as where and are relations that hold true for immediate ancestor and child relationships, respectively. Similar to gilmer2017neural, and should be considered the messages to be passed to , and represents the node embedding for node after rounds of iteration.#### Graph Convolutional Neural Networks:

Like with our MPNNs, for our Graph Convolutional Networks (GCNs) kipf2016semi, the reachability relation is given as the undirected adjacency relation, i.e., for nodes and we have . First, each node is associated with an embedding . Then, for rounds, updated embeddings are computed as

where is a non-linearity (in this work, we use a ReLU), , and is a learned matrix specific to the -th round of updates.

### 4.2 Embedding Methods

#### DAG LSTMs:

DAG LSTMs can be viewed as the generalization of Tree LSTMs tai2015improved to DAG-structured data. With initial node embeddings , the DAG LSTM uses the same N-ary equations as the Tree LSTM to compute node states

where denotes element-wise multiplication,

is the sigmoid function and

, , , and are learned matrices (different for each edge type). and represent input and output gates, while and are intermediate computations (memory cells), and is a forget gate that modulates the flow of information from individual arguments into a node’s computed state. is a predecessor function that returns the set of nodes for which holds true, i.e. . In this work, it returns either the parents or the children, depending on whether the direction of accumulation is desired to go upwards or downwards. For readability, we omitted the layer normalization ba2016layer applied to each matrix multiplication (e.g., , , etc.) from the above equations. Each instance of layer normalization maintained its own separate parameters.The DAG LSTM we propose here is nearly the same as the Tree LSTM of tai2015improved, however there are key implementational differences between the two approaches. In Tree LSTMs, typically returns child nodes (since a node can have only one parent), while in our work it can return either children or parents. In addition, batching together node updates in a Tree LSTM can be done at the level of depth (i.e., all nodes at the same depth in the tree can have their updates computed simultaneously); however, with DAGs this batching strategy could cause a node to be updated and overwritten multiple times. To solve this, we propose the use of *topological batching*. In our approach, node updates are computed in the order given by a topological sort of the graph, starting from the leaves (or root depending on ), with updates batched together at the level of topological equivalence, i.e., every node with the same rank can have the updates computed simultaneously.

#### Attention-Enhanced DAG LSTMs:

In order to allow the contents of the premise and conjecture to influence one another during the embedding process, we introduce a localized attention mechanism that exchanges information between the two graph embeddings. Let and be the sets of node embeddings computed by some initial node embedder for the premise and conjecture graphs. Let be a function that takes a node and either or and returns all node embeddings from the set where the associated node has an identical label to the given node, i.e. . Our approach computes multi-headed attention scores vaswani2017attention between identically labeled nodes and uses those attention scores to build new embeddings that provide cross graph information to the pooling procedure. For an input , for each we compute

where , , and are learned matrices for each of the attention heads.

where is the dimensionality of and is computed by the standard softmax

The final vector for input is a combination of its transformations, with and being learned matrices, a learned vector for the type (e.g., quantifier, predicate, etc.) of node , and denoting vector concatenation over the attention vectors. The gating mechanism we propose here simply allows for the architecture to cut off information flow between the two graphs if doing so improves loss, thus turning the architecture into the simpler DAG LSTM introduced previously. Following the attention computation, each is replaced by and a DAG LSTM then processes each node embedding.

#### Bidirectional DAG LSTMs:

We also explore a simple extension of the DAG LSTM from above to a Bidirectional DAG LSTM. This extends the relation from being either ancestors or children to being both, i.e. . For a node and graph , the embedding is

where is a feed-forward network, and DAG-LSTM / DAG-LSTM are both DAG LSTMs (following the design presented before) which set as and , respectively.

### 4.3 Classification Process

In our approach, the final graph embeddings for the premise and conjecture are taken to be the embeddings for the root nodes of the premise and conjecture, and . For ablation experiments using only local neighborhood-based node embedders (MPNN / GCN from Section 4.1), the inputs to the classifier network would be a max pooling of the individual node embeddings for each graph. In either case, the two graph embeddings are concatenated and passed to a classifier feed-forward network for the final prediction .

## 5 Experiments and Results

In this section, we evaluate the performance of our approach to show 1) how accurately can it predict the label of an axiom or proof step, 2) an ablation study that shows the effect of different node embedding and pooling mechanisms, and 3) its impact when integrated with an existing theorem prover in terms of number found proofs. We compare our approach to prior works using two standard datasets: Mizar^{1}^{1}1https://github.com/JUrban/deepmath mizar40for40 and Holstep^{2}^{2}2http://cl-informatik.uibk.ac.at/cek/holstep/ kaliszyk2017holstep

. The hyperparameters and network configurations for our approach can be found in the appendix (which is located in the supplementary material).

#### Mizar Dataset:

*Mizar* mizar40for40 is a corpus of 57,917 theorems. Like irving2016deepmath; olvsak2019property; kucik2018premise, we use only the subset of 32,524 theorems which have an associated ATP proof, as those have been paired with both positive and negative premises (i.e., axioms that do / do not entail a particular theorem) to train our approach. We randomly split the 32,524 theorems as 80% / 10% / 10% for training, development, and testing (yielding 417,763 / 51,877 / 52,880 individual premises). Following olvsak2019property, each example given to the network consisted of a conjecture paired with the complete set of both positive and negative premises. The task was then to classify each individual premise as positive or negative.

#### Holstep Dataset:

*Holstep* kaliszyk2017holstep

is a large corpus designed to test machine learning approaches on automated reasoning. Following prior work

kaliszyk2017holstep; wang2017premise, we use only the portion needed for proof step classification. That part has 9,999 conjectures for training and 1,411 conjectures for testing, where each conjecture is paired with an equal number of positive and negative proof steps (i.e., proof steps that were / were not part of the final proof for the associated conjecture). Using that data, we obtain 2,013,046 training examples and 196,030 testing examples, where each example is a triple with the proof step, conjecture, and a positive or negative label. We held out 10% of the training set to be used as a development set. We follow the binary classification problem setting of wang2017premise and kaliszyk2017holstep where our approach classifies each proof step as either relevant or irrelevant to its paired conjecture.### 5.1 Classification Experiments

#### Baselines:

For premise selection on Mizar, we compare with two existing systems: the distributed formula representation of kucik2018premise and the property-invariant formula representation of olvsak2019property. For the proof step classification task on Holstep, we compare against 4 systems implemented in two prior works: 1) DeepWalk perozzi2014deepwalk and FormulaNet wang2017premise, both of which were applied to Holstep in wang2017premise. 2) CNN-LSTM and CNN, both introduced in the original Holstep paper kaliszyk2017holstep.

Formula Embedding Method | Mizar Acc. | Holstep Acc. |

Kucik & Korovin (2018) kucik2018premise | 76.5% | – |

DeepWalk (2014) perozzi2014deepwalk | – | 61.8% |

CNN-LSTM (2017) kaliszyk2017holstep | – | 83.0% |

CNN (2017) kaliszyk2017holstep | – | 82.0% |

FormulaNet (2017) wang2017premise | – | 90.3% |

BidirDagLSTM-AttDagLSTM (this work) | 81.0% | 91.4% |

#### Main Results:

Table 1 shows the performance for the version of our approach that incorporates the entire context surrounding a node into its embedding and jointly embeds premises / proof steps with the conjecture, i.e., a Bidirectional DAG LSTM with attention-enhanced DAG LSTM pooling. Overall, our system outperforms all state-of-the-art systems on both datasets using a standard evaluation on held-out test data. It outperforms by a large margin on Mizar (+4.5%, which is statistically significant with ) and by a moderate, but still statistically significant, margin on Holstep (+1.1% with ). In addition to the standard evaluation using a held-out test dataset reported on Table 1, we also compare to olvsak2019property

, which introduced a GNN designed to process specifically first-order logic theories in conjunctive normal form. In their evaluation on Mizar, they split their data into only a train and test set and evaluated the model obtained at each epoch on their test set, reporting an accuracy of “around 80%” as the best performance across all test set evaluations. Following their setup, our best validation performance is 81.9%, a roughly 2% gain over

olvsak2019property.The result in Table 1 confirms our hypothesis that a more holistic treatment of logical formulae can result in a more effective embedding process than simpler methods that, by their implementation, consider less structure and embed premises and conjectures independently.

#### Ablation Studies:

We present the results of our ablations in Table 2, with statistically significant () improvements over kucik2018premise on Mizar and wang2017premise on Holstep marked explicitly. On Mizar, we can see that variants with attention-based pooling were the most performant by a large margin. When controlling for pooling type, the node embedders provided better performance than the node embedders. Similarly, when controlling for node embedding type, pooling methods provided improvement over max pooling.

For Holstep, when controlling for node embedding type the pooling methods had better performance than max pooling. Interestingly, when controlling for pooling type, the difference between the MPNN and node embedding methods was not significant. Within approaches introduced here, those variants using AttDagPool did not significantly improve over those using DagPool. We suspect that this is due to the fundamental difference between proof step classification and premise selection. Intermediate proof steps are typically much larger and noisier than actual premises, which may have led to Holstep example pairs being independent (i.e., there were properties of an individual proof step without the conjecture that would give away the positive or negative label). This is partially supported by both wang2017premise and kaliszyk2017holstep, who observed that their architectures performed just as well when classifying with only the proof step, rather than on both the proof step and conjecture (90.0% vs. 90.3% for FormulaNet and 83.0% vs. 83.0% for CNN-LSTM). On validation data, we also explored higher numbers of update rounds (i.e., the parameter) for variants of our approach using an MPNN as the initial node embedder; however, like wang2017premise we found insignificant change beyond .

Node Embedding | Pool Type | Mizar Acc. | Statistically Sig. | Holstep Acc. | Statistically Sig. | |
---|---|---|---|---|---|---|

MPNN | MaxPool | 2 | 76.9% | 90.5% | ||

MPNN | DagPool | 2 | 77.4% | 91.3% | ||

MPNN | AttDagPool | 2 | 79.7% | 91.3% | ||

GCN | MaxPool | 2 | 74.7% | 89.0% | ||

GCN | DagPool | 2 | 77.3% | 90.9% | ||

GCN | AttDagPool | 2 | 79.8% | 90.8% | ||

DagLSTM | DagPool | – | 78.4% | 91.4% | ||

DagLSTM | AttDagPool | – | 80.7% | 91.5% | ||

BidirDagLSTM | DagPool | – | 78.1% | 91.4% | ||

BidirDagLSTM | AttDagPool | – | 81.0% | 91.4% |

### 5.2 Premise Selection for Automated Theorem Proving

To show that our approach could be used to improve the performance of an actual theorem prover, we ran a traditional premise selection experiment with E schulz2013system. We first trained new models using our settings from the classification experiments, however, this time optimizing for binary classification between pairs of individual formulae. In addition to our Mizar training set from before, we also augmented our training data by adding randomly selected negative examples for each example from our original training set. For testing, we paired the conjecture of each of the 3,252 problems from our Mizar validation set with the complete set of statements from all chronologically preceding problems (as described in irving2016deepmath) in the union of our training and validation sets. For each problem, our model then ranked the premises with respect to each conjecture and returned the top premises (where indicates including all premises).

E was run on each problem in auto-schedule mode (which tries several expert heuristics based on the given problem) with a time limit of 10 seconds per , stopping at the first where the problem was solved. To validate that our approach solves more problems than E would have by itself in the same amount of time, we also measured the performance of E when run with all premises (identical to ) for 90 seconds per problem. Out of 3,252 problems, E by itself was able to solve 918; however, using our approach as its premise selection mechanism, E was capable of solving 1484. In both settings, E had the same amount of time (90 seconds) per problem to find a proof, but with our approach it was able to solve 566 more problems (a 61.6% improvement) which is statistically significant with .

## References

## 6 Appendix

### 6.1 Network Configurations

For Holstep, our hyperparameters were chosen to be comparable to [wang2017premise]. In our model, node embeddings were 256-dimensional vectors and edge embeddings were 64-dimensional vectors. All feed-forward networks (each , each , each , , and ) followed mostly the same configuration, except for their input dimensionalities. Each had one hidden layer with dimensionality equal to the output layer (except for where the dimensionality was half the input dimensionality). Every hidden layer for all feed-forward networks (except for ) was followed by batch normalization [ioffe2015batch] and a ReLU. The final activation for was a sigmoid; for all other feed-forward networks, the final activations were ReLUs. For the DAG LSTMs, the hidden states were 256-dimensional vectors. Each , , , and were learned matrices and each of , , , , , and were learned matrices. For Mizar, all above dimensionalities were halved to be comparable to [kucik2018premise, olvsak2019property]. For the attention-enhanced DAG LSTM, the multi-headed attention mechanism used two heads, with each , , and mapping from the node state dimensionality to double the node state dimensionality.

### 6.2 Training

Our models were constructed in PyTorch

[paszke2017automatic] and trained with the Adam Optimizer [kingma2014adam]with default settings. The loss function optimized for was binary cross-entropy. We trained each model for 5 epochs on Holstep and 30 epochs on Mizar, as validation performance did not improve with more training. Performance on the validation sets was evaluated after each epoch and the best performing model on validation was used for the single evaluation on the test data.

### 6.3 Hardware Setup

All experiments were run on Linux machines with 72-core Intel Xeon(R) 6140 CPUs @ 2.30 GHz and 750 GB RAM, and two Tesla P100 GPUs with 16 GB GPU memory.

Comments

There are no comments yet.