1 Introduction
Automated reasoning has long been considered to require development of logics and “hard” algorithms, such as backtracking search. Recently, approaches that employ deep learning have also been applied, but these have focused on predicting the next step of a proof, which is again executed with a hard algorithm [17, 8, 16, 2].
We raise the question whether hard algorithms could be omitted from this process and mathematical reasoning performed entirely in the latent space. To this end, we investigate whether we can predict useful latent representations of the mathematical formulas that result from proof steps. Ideally, we could rely entirely on predicted latent representations to sketch out proofs and only go back to the concrete mathematical formulas to check if our intuitive reasoning was correct. This would allow for more flexible and robust system designs for automated reasoning.
In this work, we present a first experiment indicating directly that theorem proving in the latent space might be possible. We build on HOList, an environment and benchmark for automated theorem provers based on deep learning [2] which is makes use of the interactive theorem prover HOL Light [12], an interactive proof assistant. The HOList theorem database comprises of over 19 thousand theorems and lemmas from a variety of mathematical domains, including topology, multivariate calculus, real and complex analysis, geometric algebra, and measure theory. Concrete examples include basic properties of real and complex numbers such as
, and also wellknown theorems, such as Pythagoras’ theorem, Skolem’s theorem, the fundamental theorem of calculus, Abel’s theorem for complex power series and that the eigenvalues of a complex matrix are the roots of its characteristic polynomial.
We focus on rewrite rules (or rewrites in short). Rewrites are only one of several proof tactics in HOL Light, but they enable powerful transformations on mathematical formulas, as they can be given arbitrary theorems as parameters. For example, the formula can be rewritten to by performing a rewrite with the parameter . Alternatively, a rewrite may diverge (as it operates recursively) or it may return the same formula – in both these cases we consider the rewrite to fail. For instance, in the example above, the rewrite would fail if we used equation as a rewrite parameter instead, since the expression does not contain any operators to match with.
In our experiments, we first train a neural network to map mathematical formulas into a latent space of fixed dimension. This network is trained by predicting – based on the latent representation being trained – whether a given rewrite is going to succeed (i.e. returns with a new formula). For successful rewrites we also predict the latent representation of the resulting formula. To evaluate the feasibility of reasoning in latent space over two steps, we first predict the latent representation of the result of a rewrite, then we evaluate whether the predicted latent representation still allows for accurate predictions of the rewrite success of the resulting formula. For multistep reasoning beyond two steps, we predict the future latent representations based on the previous latent representation only  without seeing the intermediate formula. Our experiments suggest that even after 4 steps of reasoning purely in latent space, neural networks show nontrivial reasoning capabilities, despite not being trained on this task directly.
2 Related Work
Our work is motivated by deep learning based automated theorem proving, but is also closely connected to model based reinforcement learning and approaches that learn to predict the future as part of reinforcement learning.
Model based reinforcement learning is concerned with creating a model of the environment while maximizing the expected reward (e.g. [10]). Already early works have shown that predicting the latent representations of reinforcement learning environments with deep learning is sometimes feasible  even over many steps [18, 5]. This can enable faster training, since it can preempt the need for performing expensive simulations of the environment. Predicting latent representation was also proposed in [4] as a regularization method for reinforcement learning.
One recent successful example of model based reinforcement learning is [14], where the system learns to predict the pixelwise output of the Atari machine. However this approach is based on actually simulating the environment directly in the “pixel space” as opposed to performing predictions in a low dimensional semantic embedding space. More related to our work is [20], which attempts to learn to rewrite simple formulas. The goal is there again is to predict the actual outcome of the rewrite rather than a latent representation of it. In [7], they predict “expected measurements” as an extra supervision in addition to the reward signal.
3 HOL Light
HOL Light [12] is an interactive proof assistant (or interactive theorem prover) for higherorder logic reasoning. Traditionally, proof assistants have been used by human users for creating formalized proofs of mathematical statements manually. Although they come with limited forms of automation, it is still a cumbersome process to formalize proofs, even when it is already available in natural language. Some large scale formalization efforts were conducted successfully in HOL Light and Coq [6], for example the formal proofs of the Kepler conjecture [11] and that of the four color theorem [9]. They required significant meticulous manual work and expert knowledge of the system.
Lately, there have been several attempts to improve the automation of the proof assistants significantly by so called “hammers” [15]. Still, traditional proof automation lacks the mathematical intuition of human mathematicians who can perform complicated intuitive arguments. The quest for modelling and automating fuzzy, “human style” reasoning is one of the main motivation for this work.
3.1 Rewrite Tactic in HOL Light
The HOL Light system allows the user to specify a goal to prove, and then offers a number of tactics to apply to the goal. A tactic application consumes the goal and returns a list of subgoals. Proving all of the subgoals is equivalent to proving the goal itself. Accordingly, if a tactic application returns the empty list of subgoals, the parent goal is proven.
In this work, we focus on the rewrite tactic () of HOL Light, which is a particularly common and versatile tactic. It takes a list of theorems as parameters (though in this work we only consider applications of rewrite with a single parameter). Parameters must be an equation or a conjunction of equations; possibly guarded by a condition. Given a goal statement and parameter , the rewrite tactic searches for subexpressions in that match the left side of one of the equations in and replaces it with the right side of the equation. The matching process takes care of variable names and types, such that minor differences can be bridged. The rewrite procedure is recursive, and hence tries to rewrite the formula until no opportunities for rewrites are left. The rewrite tactic also has a set of builtin rewrite rules, representing trivial simplifications, such as . Note that uses “big step semantics”, meaning that the application of each individual operation can perform multiple elementary rewrite steps recursively. For more details on , refer to the manual [13].
4 Reasoning in Latent Space
We embed higherorder logic statements into a fixed dimensional embedding space by applying a graph neural network to a suitably chosen graph representation of the corresponding formula. The embedding is trained on predicting the outcome (success or failure) of a large number of possible formula rewrite operations. Note that formulas can be quite complex as they are arbitrary typed lambda expressions in higher order logic.
For technical reasons, we will distinguish between two latent embedding spaces and () corresponding to two distinct embeddings for each formula, learned by two different networks.
We have trained three different models. denotes the set of syntactically correct higherorder logic formulas in HOL Light.

Rewrite success prediction ,

Rewrite outcome prediction ,

Embedding alignment prediction .
These networks and their purposes are described in detail in later subsections. Alternatively, we could use a single fixed embedding space with a single model predicting its own future embedding on the rewritten statement. That network could be trained endtoend and reach better performance without the need of aligning the embedding spaces, removing the need for . Here, we opted for a more controlled setup that relies on a fixed embedding network , trained for the sole task of predicting whether rewriting statement by is successful. This way we can rely on a fixed embedding method and run more detailed ablation analyses. Merging and , is left for future work.
4.1 Training Data
We start with the theorem database of the HOList environment [2], which contains 19591 theorems in its theorem database, approximately ordered by increasing complexity. This is split into 11655 training, 3668 validation, and 3620 testing theorems. To generate our training data, we generate all pairs of theorems from the training set, where must occur before in the database (to avoid circular dependencies). We then interpret theorem as a goal to prove and try to rewrite with using the of HOL Light. This can result in three different outcomes:

is rewritten by theorem successfully, and the result differs from .

The rewrite operation terminates, but failed to change the input theorem .

The rewrite operation times out or diverges (becomes too big).
In our experiments, we consider only the first outcome as successful, i.e. when the application finishes within the specified time limit and changes the target, as a successful rewrite attempt. Each training example, therefore, consists of the pair , the success/failbit of the rewrite (1 for successful rewrites, 0 for failed rewrites), and, for successful tactic applications, the new formula that results from the rewrite, which we denote with .
4.2 Base Model Architecture and Training Methodology
The rewrite success prediction model is trained on the training set of theorems in the HOList benchmark. The training task is to predict the success or failure of the application.
We used a twotower network (without weight sharing) with embedding towers and , one for each of the two formulas and . Both towers are graph neural networks as described in [19]. Both of them embed the supplied formula in a fixed dimensional embedding space
. The concatenated embeddings are then processed by a threelayer perceptron
with rectified linear activation, which is followed by, a single output linear function predicting the logit and is trained by logistic regression on the success/fail†bit of the rewrite. Formally:
.4.3 Outcome Prediction Model
In addition to our base model, we train . This model has an identical twotower embedding architecture as , but with a larger combiner network and an extra prediction layer to predict embedding vector of the outcome of the rewritten formulas. Here the embedding towers are denoted by and , the combiner network is and the two linear prediction layers are and . That is: .
This model predicts both the success or failure of applying and for successful rewrites, the latent representation of the result. While is trained to predict the success of rewriting by by logistic regression, is trained to predict by minimizing the squared error.
4.4 Embedding Alignment Model
Since and produce latent vectors and in different spaces, we need to align those spaces enable deduction purely in the embedding space. (Merging and will remove the need for the , however in our current setup we keep the embedding and deduction components separate).
Given an initial statement , we predict the approximate value of , but in order to reason multiple steps in the embedding space alone, without explicitly computing , we need to compute the outcome prediction of . For that, we train a translation model which predicts given an approximation of . Note that does not see as an input, it makes its prediction based on the latent space representation of alone. This allows us for reasoning multiple steps ahead in the latent space without constructing any of the intermediate formulas explicitly.
4.5 Reasoning
After we have trained our three models on the training set theorems (and theorem pairs) of the HOList benchmark, we can use them to perform rewrites in the latent space alone.
We use as a quality metric for the propagated embedding vector. was trained for predicting whether theorem rewrites . Given an approximation of the latent representation , we can evaluate defined by for a large number of tactic parameters . This is compared with true rewrite successes of by to assess the quality of the approximation.
To evaluate multiple steps of reasoning in the latent space, start with formula and rewrite by theorems in that order. For reasoning in the latent space, we only use approximate embeddings vectors of the resulting formulas. To assess the quality of the overall reasoning, the same sequence of rewrites is performed with the actual formulas and the final approximate embedding is evaluated with respect to the formula resulting from the sequence of formal rewrites.
In latent space we from some initial theorem . The following schema indicates the sequence of predictions performed in latent spaces and :
This way, we approximate the following sequence of deductions in the latent space alone, without producing any intermediate formulas. This is compared with the following formal sequence of rewrites:
That is, approximates the latent vector of , that is and approximates the latent vector of , that is , by construction.
By a slight abuse of notation we will refer to the operation of one step of approximate deduction in the latent space by , as it is composition and a subnetwork of .
5 Experiments
This section provides an experimental evaluation that aims to answer the following question: Is this setup capable of predicting embedding vectors multiple steps ahead? We explore the prediction quality of the embedding vectors (for rewrite success) and see how the quality of predicted embedding vectors degrades after is used for predicting multiple steps in the latent space alone.
5.1 Neural Network Architecture Details
Our networks and both have two towers, which are hop graph neural networks with internal node representations of dimensions. The output of each of the two towers is fed into a layer that expands the dimension of the node representation to with a fully connected layer with shared weights for each node. This is followed by maximum pooling over all nodes of the network. The two resulting embedding vectors are concatenated along with their elementwise multiplication, and are processed by a three layer perceptron with rectified linear activation between the layers.
The same architecture is used for both and , but the two networks do not share weights. Also, has larger layers in its combiner network than in , ( units each in vs. units in the layers of ). This was necessary for producing good quality predictions of the embedding vector of the outcome of the rewrite, but unnecessary for predicting the rewrite success alone.
and are trained with groups of instances in each batch: one successful example in each group and random negatives. However all other instances in other groups are used as negative instances for for each goal as well and they are considered negative regardless of whether they would rewrite it – this is justified by the fact that only a few theorems rewrite any given so this introduces only a small amount of uncorrelated label noise. This training methodology is motivated by the fact that evaluating the combiner network is much cheaper then computing the embedding using the graph neural network. Based on [1], we expect that hard negative mining would improve our results significantly, but it is left for future work.
5.2 Evaluation Dataset
In order to measure the performance of our models after multiple deduction steps are performed, we generate datasets successively by applying rewrites to a randomly selected set of examples. We start with all theorems from the validation set of HOList, denoted by . We create from by selecting a random subset of statements from the previous step and random tactic parameters to rewrite by for each statement. Formally, is defined by .
5.3 Evaluation of Rewrite Prediction Model
In order to evaluate the performance of in isolation we need to compare it with carefully selected baselines:

As Figure 2 shows, a few theorems are much more frequently applicable than others. We want to see how the prediction of rewrite success performs based on the rewrite parameter alone if we ignore the theorem to be rewritten. One way to establish such a baseline we just feed a randomly selected theorem to instead of to predict its rewrite success.

A stronger “baseline” is achieved by utilizing the groundtruth to make the best prediction possible based on knowing but still being independent of (the theorem to be rewritten). This is the best achievable prediction that does not depend on .

As we have trained and only on pairs of theorems from the original database, the models exhibit increasing generalization error as we evaluate them on formulas that with increasing number of rewrites. First, we measure the errors these models make if the theorem for the last step is evaluated directly. This gives an upper bound on the rewrite success prediction by , since noisier embedding vectors end up with worse results on average.

Finally, we want to measure how rate at which the latent vectors degrade as we propagate them in embedding space as described in Subsection 4.5.
In order to measure the performance of our models after performing a given number of rewrite steps starting from the theorem database, we measure the tactic success prediction quality of using predicted embeddings. To do so, we compute the ROC curve of the predictions and use the area under the curve (AUC) as our main metric. Higher curves and higher AUC values represent more accurate predictions of rewrite success. We measure how ROC curves change as we use different approximations of .
6 Analysis
Figure 3 shows the distribution of the theorem pair prediction score logits of , for the those “positive” pairs that rewrite and the “negative” pairs that do not rewrite. Note that the ratio is normalized separately for the positive and negative pairs as negative pairs occur much more frequently than positive pairs.
One can see that the quality of the rewrite success prediction degrades significantly after four steps of reasoning purely in the latent space, but it is still much better than the random baseline. This gives clear evidence that the embedding prediction manages to propagate much useful information over multiple steps in the latent space alone.
In Figure 4 we make further measurements and comparisons on the quality of reasoning in the embedding space. On the left hand side we measure five different metrics: The “True” curve assess the embedding computed directly from the target theorem . The “Pred (One step)” curve uses the approximate embedding for . That is, we measure the degradation when performing a single step of embedding prediction. The “Pred (Multistep)” curve uses multiple steps of predictions completely in the latent space as described in Subsection 4.5
. The “Random Baseline” predicts the rewrite success based on the latent vector of a random statement instead of the correct one. “Usage Baseline” is based on the constant prediction that ranks the parameters by how probably they rewrite any statement in the theorem database. This prediction is also independent of
. One can see that our model could perform reasoning for steps in the embedding space and still retain a lot of the predictive power of the original model.In order to appreciate the above results one should keep in mind that none one of our models and were trained on statements that were already rewritten. All the training was done only on the theorems present in the initial database. The reduction of prediction performance is apparent from downward trajectory of the “Pred (Multi step)” curve, which isolates this effect from that of the error accumulated by the embedding predictions, the effect of which is measured indirectly by the “True” curve.
In Figure 5 we have measure the distance of the predicted embedding vectors versus that of the true embedding vectors of formulas after multiple rewrite steps in the latent space. These results are consistent with our earlier findings on success of rewrite predication after rewrite steps in the latent space: while there is some divergence of the predicted embedding vectors from the true embedding vectors (as computed from the rewritten statements directly), the predicted embedding vectors are significantly closer to the true embedding vectors than randomly selected embeddings.
7 Conclusion
In this paper we studied the feasibility of performing complex reasoning for mathematical formulas in a fixed
dimensional embedding space. We proposed a new evaluation metric that measures the preservation semantic information under multiple reasoning steps in the embedding space. Although our models were not trained for performing rewrites on rewritten statements, nor were they trained for being able to deduce multiple steps in the embedding space, our approximate rewrite prediction model
has demonstrated significant prediction power as far as approximate rewrite steps performed in the latent space. Although it seems likely that these results can be significantly improved by better neural network architectures, hard negative mining and training on rewritten formulas, our methods showcases a simple and efficient general methodology for reasoning in the latent space. In addition, it proposes an easy to use, fast to train and crisp evaluation methodology for representing mathematical statements by neural networks.It is likely that such representations prove helpful for faster learning to prove without imitating human proofs like that in DeepHOLZero [3], given that premise selection is a closely related task to predicting the rewrite success of statements. Selfsupervised pretraining or even cotraining such models with premise selection could prove useful as a way of learning more semantic feature representations of mathematical formulas.
References
 [1] (2016) Deepmathdeep sequence models for premise selection. In Advances in Neural Information Processing Systems, pp. 2235–2243. Cited by: §5.1.

[2]
(2019)
HOList: an environment for machine learning of higherorder theorem proving
. ICML 2019. International Conference on Machine Learning. Cited by: §1, §1, §4.1.  [3] (2019) Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501. Cited by: §7.

[4]
(2018)
Using state predictions for value regularization in curiosity driven deep reinforcement learning.
In
2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)
, pp. 25–29. Cited by: §2.  [5] (2017) Recurrent environment simulators. CoRR abs/1704.02254. External Links: Link, 1704.02254 Cited by: §2.
 [6] The Coq Proof Assistant. Note: http://coq.inria.fr External Links: Link Cited by: §3.
 [7] (2017) Learning to act by predicting the future. ICLR 2017. Cited by: §2.
 [8] (2017) TacticToe: learning to reason with HOL4 tactics. In LPAR21. 21st International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Vol. 46, pp. 125–143. Cited by: §1.
 [9] (2008) Formal proof–the fourcolor theorem. Notices of the AMS 55 (11), pp. 1382–1393. Cited by: §3.
 [10] (2018) Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 2450–2462. External Links: Link Cited by: §2.
 [11] (2017) A formal proof of the Kepler conjecture. In Forum of Mathematics, Pi, Vol. 5. Cited by: §3.
 [12] (1996) HOL Light: a tutorial introduction. In FMCAD, pp. 265–269. Cited by: §1, §3.
 [13] Note: Accessed: 2019/09/23 External Links: Link Cited by: §3.1.
 [14] (2019) Modelbased reinforcement learning for Atari. arXiv preprint arXiv:1903.00374. Cited by: §2.
 [15] (2015) HOL (y) hammer: online atp service for hol light. Mathematics in Computer Science 9 (1), pp. 5–22. Cited by: §3.

[16]
(2018)
Learning heuristics for automated reasoning through deep reinforcement learning
. CoRR abs/1807.08058. External Links: Link, 1807.08058 Cited by: §1.  [17] (2017) Deep network guided proof search. LPAR21. 21st International Conference on Logic for Programming, Artificial Intelligence and Reasoning. Cited by: §1.
 [18] (2015) Actionconditional video prediction using deep networks in Atari games. In Advances in neural information processing systems, pp. 2863–2871. Cited by: §2.
 [19] (2019) Graph representations for higherorder logic and theorem proving. arXiv preprint arXiv:1905.10006. Cited by: §2, §4.2.
 [20] (2019) Can neural networks learn symbolic rewriting?. Cited by: §2.
 [21] (2017) Premise selection for theorem proving by deep graph embedding. In Advances in Neural Information Processing Systems, pp. 2786–2796. Cited by: §2.