Premise selection with neural networks and distributed representation of features

07/26/2018 ∙ by Andrzej Stanisław Kucik, et al. ∙ The University of Manchester 0

We present the problem of selecting relevant premises for a proof of a given statement. When stated as a binary classification task for pairs (conjecture, axiom), it can be efficiently solved using artificial neural networks. The key difference between our advance to solve this problem and previous approaches is the use of just functional signatures of premises. To further improve the performance of the model, we use dimensionality reduction technique, to replace long and sparse signature vectors with their compact and dense embedded versions. These are obtained by firstly defining the concept of a context for each functor symbol, and then training a simple neural network to predict the distribution of other functor symbols in the context of this functor. After training the network, the output of its hidden layer is used to construct a lower dimensional embedding of a functional signature (for each premise) with a distributed representation of features. This allows us to use 512-dimensional embeddings for conjecture-axiom pairs, containing enough information about the original statements to reach the accuracy of 76.45 only with simple two-layer densely connected neural networks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated theorem provers (ATPs) and proof assistants helped with formal verification of many well-known mathematical theorems, most notable examples being the proofs of four colour theorem [12] and Kepler conjecture [13]. However, they face two significant limitations.

First of them is the fact that they can only employ formalised mathematics. Most of the corpora of mathematical results, although written in a relatively formal manner, is still only available as natural language texts, and there exist no efficient semantic or formal parsers to translate them into the machine language. This makes many important theorems (especially those published recently) unavailable for automated systems. Secondly, there is no formal method of emulating human intuition in choosing the relevant, already known facts (i.e. premise selection) and strategies for the proof. This makes even simple and intuitive proofs intractable for ATPs.

Recent advancements in another field of artificial intelligence

machine learning, and in particular neural networks (also known by their rebadged name: deep learning) give us hope that this issues can be resolved. This is because they proved to be very successful in the fields previously reserved almost exclusively for formal methods, such as strategy board games – most notably the game of GO [3].

The first attempt to employ deep learning to automated theorem proving was made in [1], where the neural network models are trained on pairs of first-order logic axioms and conjectures to determine which axiom is most likely to be relevant in constructing an automated proof by the prover. In this paper we build on the foundation laid therein, showing that selecting an appropriate representation of premises can greatly simplify the problem, allowing us to use much simpler neural networks and, consequently, make the decision in much shorter time.

In the context of theorem proving, deep learning techniques were also recently used for example in [2, 5, 6, 7, 10, 11, 22, 25].

The code for neural network architectures presented in this paper, as well as for processing of the input data, is available at [20].

2 Datasets

For this experiment we use a dataset of 32,524 examples collected and organised by Josef Urban in [24] for Mizar40 experiment and DeepMath experiment [1]. Each example is of the form:

     C tptpformula
     + tptpformula
     + tptpformula
     - tptpformula
     - tptpformula

where tptpformula is the standard first-order logic TPTP representation, C indicates a conjecture, + a premise that is needed for an ATP proof of C, and - a premise that is not required for the proof but was highly ranked by a -nearest neighbour algorithm trained on the proofs. For the practical reasons dictated by the theory of machine learning, there is roughly the same number of useful and redundant facts associated with each conjecture. In total, we have 102,442 unique formulae across this dataset; 32,524 conjectures and 69,918 axioms. Every conjecture has 16 axioms assigned to it on average, with the minimum being 2 and the maximum – 270. We take each conjecture and corresponding axioms and form pairs (conjecture, axiom), which will constitute our positive and negative examples (522,528 in total). In [1]

two alternative data representations were adopted; character-level and word-level representation. Both of these are problematic however. Premises have a variable number of characters (5 to 84,299 with mean 304.5) and words/tokens (2 to 21,251, with mean 107.4) so they have to be either truncated or padded with zeros. The character representation is given by an 80 dimensional one-hot encoding with a maximum sequence length of 2048 characters, and the word representation is obtained by word encoding of axiom embeddings computed by the previously trained character-level model, and generating pseudo-random vectors (of the same dimension) to encode tokens such as brackets and operators. The maximum number of words is limited to 500. The resulting datasets are sparse and highly dimensional, and some of the important information is lost by restricting the maximal number of words or characters. This obstructs the performance of machine learning algorithms applied to it and, in case of artificial neural networks, imposes serious limitations on the network architecture. In the 

next section we present our approach to tackle this problem.

3 Distributed representation of formulae signatures

First, we limit the information obtained from each formula to the functor symbols, ignoring variable symbols (since they are essentially arbitrarily chosen characters), brackets, quantifiers, connectives, equality symbol etc. We will show that this information is sufficient to obtain accuracy similar to or higher than these obtained in the earlier models. There is 13,217 unique functor symbols across all the formulae in our dataset. Thus, to each of these functor symbols we can assign a unique positive integer smaller or equal to 13,217 and then, for any formula in our dataset, we can represent its functional signature by a 13,217 dimensional vector, whose th coordinate is equal to the number of occurrences of a function in the scope of , associated with the integer . But this does not really solve the problem present in previous approaches. Each formula usually contains only a handful of functions, and hence, in this setting, it would be represented by a sparse and long vector.

This phenomenon is very common, especially in natural language processing, and it is known as the 

curse of dimensionality. It can be solved by a distributed representation of features, and there are several algorithms which can efficiently create such representation, for example a neural probabilistic language model from [4] or t-SNE technique from [21]. However, these methods are normally applied to textual (and hence temporal) data, and rely on the concept of a context, which is not defined for formulae signatures. We must therefore modify it to suit our setting.


be a finite set of real, linearly independent vectors, and let

be real vector spaces, which we will call input and output, respectively, and let . Suppose that we know the values for some (or for all) arguments but we do not explicitly know what is. The essence of machine learning is to determine or to find its approximation, and consequently, to find the values which were previously unknown. Usually cannot be (easily) represented algebraically, but we can find a good approximation of as a composition of simpler functions. This, in turn, is the essence of neural network methods.

Suppose we have a task , to approximate a function , and we do it by a neural network , which can be represented as a composition of functions:

where denotes approximate equality with respect to some fixed cost function on , and are called hidden layers (and they often are composite functions themselves).

If the network performs well after some training, we may assume that the first layers preserve and pass on some crucial information about the input set to the latter layers, needed to complete task . Thus, we may fix the parameters of and only train those of , regarding as the input set of a new neural network ,

Now, let and let be some (other) real vector space. Assume that we have a new task , to approximate a function . If we decide to solve task using neural networks, we need to remember that there is a positive relationship between the number of parameters of the network and the dimensionality of . If the latter is big, then we must either choose a simpler neural network architecture (potentially damaging its accuracy) or devote more time and hardware resources to the training process, which is not always possible. In practice this is bypassed by dimensionality reducing data preprocessing, training only top layers of the network in the later phases of the training or by loading pretrained layers (from some other tasks) and fixing them as the initial layers for our network model, and only training the layers on top of them. Pragmatic motivation of getting a lower dimensional embedding for the input space, as well as the advantages of obtaining it either during the main training process or beforehand - as a separate learning task, is described in the context of image classification and natural language processing, for example, in [9].

Given that forms a basis for , we may solve a simplified version of task in the following way. Every element in can be represented as

for some constants and distinct vectors . We use the pretrained layers to define

for all . Then we can approximate with a neural network , whose input space is given by , subject to the constraint

Although we are still approximating , if

then this embedding will reduce dimensionality of the input for a neural network, allowing for a more robust architecture of the network , as compared to networks using as the input. We can also experiment with several different network architectures, without having to obtain a new, lower dimensional embedding each time. And since is a basis for , training the layers on it will be faster than training all layers () on a training set from before freezing the first of them. That is, provided that the cardinality of is smaller than the cardinality of this training set.

In natural language processing we usually start with a vocabulary (i.e. a list of words) . It can be represented as a canonical basis for , that is the set of -dimensional vectors, with all entries equal to 0, but for the th entry, where is the index of a word in which corresponds to . Since such vocabularies are normally immensely large, before any language processing task, it is good to find a lower dimensional, dense representation of . It is usually done by extracting features from temporal context of each word. If we want to mimic the same strategy for functional signatures of logical expressions, we must first define what a context is in this setting, since a functional signature, unlike a sequence of words, is not a temporal object.

First, note that if is a premise, then we can represent its functional signature by

where is the total number of unique function symbols across some corpus of premises’ functional signatures, which contains , is, again, the unit vector corresponding to the th function, and is the number of occurrences of this function in the scope of . Now, let and be functions corresponding to and respectively, for . If there exists a premise such that , then we say that is in the context of . We may represent the frequency distribution of functions in the context of by

We want to approximate by a neural network with two hidden layers:

where and are and matrices, and are and dimensional vectors, and

are activation functions applied elementwise, and

. After training this network, we may use this new, lower dimensional representation for the functional signature of a premise :

In case when is an invertible function, we may even use:


This is because approximating is equivalent to approximating , whenever is the inverse of .

In our experimental setting, is the set of all one-hot encoded functor symbols ( = 13,217), , , is the hyperbolic tangent function, and

is the softmax function. We use Keras library

[8] to create neural network models for the dimensionality reduced embeddings, as well as the premise selection model in the next section. We initialise the weight matrices using He uniform initialisation [14]

. We train the network on the 13,217 dimensional identity matrix, taking batches of 4096 training examples, for 150 epochs, using RMSprop algorithm

[15] with decay of the learning rate equal to .

Usually, we train a model on some set of training data, so that we can produce an estimate of some unknown function, which can later be used to predict values of this function for data points which were previously unavailable. Here, we know the contextual distribution for all the functor symbols, and hence all the values of

, and we use the network model to simply find a less complex approximation of . For that reason we do not split the data into training and validation tests (also, doing so would effectively exclude some parameters from training, as the set is linearly independent). And since we want to approximate as accurately as possible, given the fixed number of network parameters, over-fitting is not discouraged. We do, however, shuffle the data after every epoch, to allow for more distributed features in the lower dimensional representation of our dataset. The accuracy of this network, with respect to categorical crossentropy, reaches 84% after the training.

Since the set of all functional signatures contains only linear combinations of one-hot encoded representations of functor symbols (set ), and because is an invertible function, we can use (3) as the lower dimensional representation for functional signatures of premises in out dataset to develop a premise selection model in the next section.

Alternatively, we could have used autoencoders (see

[16, 17]

), with the same network architecture, to obtain a lower dimensional representation of functional signatures directly. That is, we would want to find tensors

(with the same shapes as above) such that

for all premises in our dataset. This naïve approach saves us the trouble of finding contextual distributions for all the symbols, and normally it would be a more natural choice of a dimensionality reduction technique. However, empirically, the premise selection models presented in the next section is less accurate if it uses this representation of data. Nonetheless, we included the implementation of this alternative approach in [20], should an interested reader wish to verify this.

4 Premise selection model

From the set of 32,524 conjectures and 69,918 axioms, we form a set of 522,528 positive and negative examples, by concatenating the new 256 dimensional signatures of corresponding axioms and premises. The resulting set may be represented as matrix (which is considerably smaller than

representation of full functional signatures, if the dimensionality reduction had not been applied). From this tensor we randomly select 470,275 rows (90%) for training, and use the remaining 52,253 to form a test set (10%). We use the standard regularisation of the data, by computing the mean and standard deviation along each column of the training set, and subtracting this mean from each corresponding column, and dividing them by standard deviation.

We develop a several variants of neural networks with two hidden densely connected layers. The first layer has 64, 128, 256, 512 or 1024 output units, and the number of the output units of the second layer lies in the same range, provided that it is never bigger than the number of output units of the first layer. The activation functions for these layers are the rectified linear units (ReLU). Both of these hidden layers are followed by a dropout layer


, with the dropout rate 0.5 - to reduce the overfitting of the model. The output layer activation function is the logistic sigmoid function, returning the predicted probability that the tested premise is relevant for some proof of the tested conjecture. During the development stage, we also extract 10% of the training data for validation of the models. The models are trained for up to 1500 epochs, on batches of 4096 examples using the Adam optimiser

[19] (with the learning rate

) with respect to the logistic loss function. The training data is shuffled after each epoch. The test results are presented in the table below.

Layer 1
Layer 2 64 128 256 512 1024
loss 0.5418 0.5295 0.5173 0.5292 0.5687
64 accuracy 72.21% 72.75% 73.52% 74.57% 75.19%
# of param. 37,057 73,985 147,841 295,553 590,977
loss 0.5315 0.5158 0.5224 0.5523
128 accuracy 72.94% 73.73% 74.49% 75.47%
# of param. 82,305 164,535 328,449 656,641
loss 0.5195 0.5135 0.5347
256 accuracy 73.63% 74.19% 75.32%
# of param. 197,377 394,241 787,969
loss 0.5095 0.5166
512 accuracy 74.58% 75.34%
# of param. 525,825 1,050,625
loss 0.5024
1024 accuracy 75.76%
# of param. 1,575,937

As we can see above the  (64 output units for each of the hidden layers) reaches the lowest accuracy out of these fourteen models. But is does so with comparatively few trainable parameters, which means that it can be trained in a short time and it makes predictions quickly. The  model has the highest accuracy, but it requires the biggest amount of parameters, so naturally it is significantly slower. Furthermore, it is also more prone to overfitting, which can be seen of the graphs below.

Given the trade-off between the number of parameters (and hence the computation time) and the accuracy of the model, one should choose the most suitable model carefully. We chose and , and trained them again, this time for 2500 epochs and without extracting any validation data. The results are presented below.

loss 0.5385 0.5194 0.5127 0.4895
accuracy 72.14% 73.74% 74.73% 76.45%
false negatives 13.5% 12.35% 9.32% 11.0%

5 Conclusion and discussion

It is clear that thanks to dimensionality reduction we can create a neural network model that can perform the premise selection task very swiftly and with relatively high accuracy. Nevertheless, it seems that, in different applications, deep learning achieves even better results. So we could ask a question: how the above approach could be improved. First of all, we need to realise that the choice of negative examples may have influenced the performance negatively. The machine learning algorithms generally require an equal number of positive and negative examples, so that the model is not biased towards predicting one more often than the other. But as long as producing positive examples is trivial (provided that we have a valid proof of the conjecture), and we empirically see that the algorithm seldom misclassifies positive examples as negatives (see the table above), the same cannot be said about negative examples. The fact, that we have no proof of a given conjecture, which would rely on some axiom, does not imply that there exist no proof depending on this axiom. Obviously, we could include, as negative examples, axioms from a completely different theory, assuring that they are almost certainly useless. But this only weakens our model, as positive and negative examples should have similar nature, so that the model can focus on this features which really decide whether given axiom is useful or not. So far it does not seem like there is any good solution for this dilemma.

Another problem, is the fact that, when we focus solely on the functional signatures of premises, we completely ignore the logical structure of the statements, and hence the relations between functions. This does not happen if we use character-level representation (or even word-level representation with tokens like brackets also treated as words). But for the neural network to clearly identify these relations, it would have to be very deep, and hence computationally inefficient - given that the input is also highly dimensional in this setting. Another way of dealing with this issue is to substitute the statements with graphs, where vertices represent objects, and edges the relations between them. But this setting also requires complicated neural networks, obstructing its performance.

The dimensionality reduction, that we adopted in this paper, is a wonderful tool, which allows us to greatly decrease the time required to make predictions, but it can also mean the loss of essential information, often required to make these predictions. In natural language processing it is very likely that the blank space in the statement

This is a glass of an orange _____.

ought to be filled with the word ’juice’, indicating that words can often easily be deduced just from the context, and the loss of information, while switching from one-hot to context embeddings, is negligible. If we also have a sentence

This is a glass of an apple _____.

then it will probably be filled with the same word. So ’orange’ and ’apple’ will have similar context embedding, without explicitly telling the computer that they are fruits. Whether or not a similar phenomenon occurs in functional signatures is debatable. Perhaps in the future a more natural embeddings will emerge.

And finally, having discussed the issues with the input data, let us deliberate on the network architecture. Let us start with emphasising that using convolutional or recurrent networks is unsuitable in this setting. The ordering of functions inside the functional signature (and thus also inside the lower dimensional embedding) is arbitrary (and we simply used the alphabetical order), so there is no theoretical justification for the use of convolutional neural networks, as their purpose is to identify local patterns between neighbouring objects. Also in practice, their performance appears to be inferior to fully connected networks for this task, when trained and tested on the same data. Temporal architectures are unsuitable for the similar reason, i.e. there is no clear temporal ordering of the functional signatures. It is possible to slightly improve the performance of the model however, by including more hidden - densely connected layers. But this, while decreasing the training time and increasing the accuracy, also increases overfitting and the prediction time and, making their introduction counterproductive.

6 Acknowledgement

The authors of this article would like to thank the UK Engineering and Physical Sciences Research Council (EPSRC) and the School of Computer Science at the University of Manchester for their financial support.