It is challenging to integrate symbolic reasoning and deep learning in effective ways [Garcez et al.2015]. In the field of symbolic reasoning, much work has been done on using formal methods to model reliable reasoning processes [Chang and Lee1973]. For instance, algebraic reasoning can be modelled by using first-order predicate logics or even higher-order logics, but these logics are usually designed by experienced experts, because it is challenging for machines to learn these logics from data automatically [Bundy and Welham1981, Nipkow et al.2002]. On the other hand, recent approaches on deep learning have revealed that deep neural networks are powerful tools for learning from data [Lecun et al.2015], especially for learning speech features [Mohamed et al.2012] and image features [Sun et al.2015]. However, not much work has been done on using deep neural networks to learn formal symbolic logics. To close the gap between symbolic reasoning and deep learning, this research explores the possibility of using deep feedforward neural networks to learn logics of rewriting in algebraic reasoning. In other words, we try to teach neural networks to solve mathematical problems, such as finding the solution of an equation and calculating the differential or integral of an expression, by using a rewriting system.
Rewriting is an important technique in symbolic reasoning. Its core concept is to simply reasoning process by using equivalence relations between different expressions [Bundy1983]. Usually, rewriting is based on a tree-manipulating system, as many algebraic expressions can be represented by using tree structures, and the manipulation of symbols in the expressions is equivalent to the manipulation of nodes, leaves and sub-trees on the trees [Rosen1973]. To manipulate symbols, a rewriting system usually uses one way matching, which is a restricted application of unification, to find a desired pattern from an expression and then replaces the pattern with another equivalent pattern [Bundy1983]. In order to reduce the search space, rewriting systems are expected to be Church-Rosser, which means that they should be terminating and locally confluent [Rosen1973, Huet1980]. Thus, very careful designs and analyses are needed: A design can start from small systems, because proving termination and local confluence of a smaller system is usually easier than proving those of a larger system [Bundy and Welham1981]. Some previous work has focused on this aspect: The Knuth-Bendix completion algorithm can be used to solve the problem of local confluence [Knuth and Bendix1983], and Huet DBLP:journals/jcss/Huet81 has provided a proof of correctness for this algorithm. Also, dependency pairs [Arts and Giesl2000] and semantic labelling [Zantema1995] can solve the problem of termination for some systems. After multiple small systems have been designed, they can be combined into a whole system, because the direct sum of two Church-Rosser systems holds the same property [Toyama1987].
Deep neural networks have been used in many fields of artificial intelligence, including speech recognition[Mohamed et al.2012]
, human face recognition[Sun et al.2015], natural language understanding [Sarikaya et al.2014]
, reinforcement learning for playing video games[Mnih et al.2015] and Monte Carlo tree search for playing Go [Silver et al.2016]
. Recently, some researchers are trying to extend them to reasoning tasks. For instance, Irving et al. DBLP:conf/nips/IrvingSAECU16 have proposed DeepMath for automated theorem proving with deep neural networks. Also, Serafini and Garcez DBLP:journals/corr/SerafiniG16 have proposed logic tensor networks to combine deep learning with logical reasoning. In addition, Garnelo et al. DBLP:journals/corr/GarneloAS16 have explored deep symbolic reinforcement learning.
In this research, we use deep feedforward neural networks [Lecun et al.2015] to guide rewriting processes. This technique is called human-like rewriting, as it is adapted from standard rewriting and can simulate human’s behaviours of using rewrite rules after learning from algebraic reasoning schemes. The following sections provide detailed discussions about this technique: Section 2 introduces the core method of human-like rewriting. Section 3 discusses algebraic reasoning schemes briefly. Section 4 provides three methods for system improvement. Section 5 provides experimental results of the core method and the improvement methods. Section 6 is for conclusions.
2 Human-like Rewriting
Rewriting is an inference technique for replacing expressions or subexpressions with equivalent ones [Bundy1983]. For instance111We use the mathematical convention that a word is a constant if its first letter is in upper case, and it is a variable if its first letter is in lower case., given two rules of the Peano axioms:
can be rewritten via:
More detailed discussions about the Peano axioms can be found from [Pillay1981]. Generally, rewriting requires a source expression and a set of rewrite rules . Let denote a rewrite rule in , a subexpression of , and the most general unifier of one way matching from and . A single rewriting step of inference can be formed as:
It is noticeable that is only applied to , but not to . The reason is that one way matching, which is a restricted application of unification, requires that all substitutions in a unifier are only applied to the left-hand side of a unification pair. Standard rewriting is to repeat the above step until no rule can be applied to the expression further. It requires the set of rewrite rules to be Church-Rosser, which means that
should be terminating and locally confluent. This requirement restricts the application of rewriting in many fields. For instance, the chain rule in calculus, which is very important for computing derivatives, will result in non-termination:
The above process means that it is challenging to use the chain rule in standard rewriting. Similarly, a commutativity rule , where is an addition, a multiplication, a logical conjunction, a logical disjunction or another binary operation satisfying commutativity, is difficult to be used in standard rewriting. If termination is not guaranteed, it will be difficult to check local confluence, as local confluence requires a completely developed search tree, but non-termination means that the search tree is infinite and cannot be completely developed. More detailed discussion about standard rewriting and Church-Rosser can be found from [Bundy1983].
Human-like rewriting is adapted from standard rewriting. It uses a deep feedforward neural network [Lecun et al.2015] to guide rewriting processes. The neural network has learnt from some rewriting examples produced by humans, so that it can, to some extent, simulate human’s ways of using rewrite rules: Firstly, non-terminating rules are used to rewrite expressions. Secondly, local confluence is not checked. Lastly, experiences of rewriting can be learnt and can guide future rewriting processes.
To train the feedforward neural network, input data and target data are required. An input can be generated via the following steps: Firstly, an expression is transformed to a parsing tree [Huth and Ryan2004] with position annotations. A position annotation is a unique label indicating a position on a tree, where each is the order of a branch. Then the tree is reduced to a set of partial trees with a predefined maximum depth . Next, the partial trees are expanded to perfect -ary trees with the depth and a predefined breadth . In particular, empty positions on the prefect -ary trees are filled by . After that, the perfect -ary trees are transformed to lists via in-order traversal. Detailed discussions about perfect -ary trees and in-order traversal can be found from [Cormen et al.2001]. Finally, the lists with their position annotations are transformed to a set of one-hot representations [Turian et al.2010]. In particular, is transformed to a zero block. Figure 1 provides an example for the above procedure. This representation is called a reduced partial tree (RPT) representation of the expression. A target is the one-hot representation [Turian et al.2010] of a rewrite rule name with a position annotation for applying the rule.
It is noticeable that the input of the neural network is a set of vectors, and the number of vectors is non-deterministic, as it depends on the structure of the expression. However, the target is a single vector. Thus, the dimension of the input will disagree with the dimension of the target if a conventional feedforward neural network structure is used. To solve this problem, we replace its Softmax layer with an averaged Softmax layer. Letdenote the th element of the th input vector, the number of the input vectors, an averaged input vector, the th element of , a weight matrix,
a bias vector,the standard Softmax function [Bishop2006], and the output vector. The averaged Softmax layer is defined as:
It is noticeable that the output is a single vector regardless of the number of the input vectors.
The feedforward neural network is trained by using the back-propagation algorithm with the cross-entropy error function [Hecht-Nielsen1988, Bishop2006]. After training, the neural network can be used to guide a rewriting procedure: Given the RPT representation of an expression, the neural network uses forward computation to get an output vector, and the position of the maximum element indicates the name of a rewrite rule and a possible position for the application of the rule.
3 Algebraic Reasoning Schemes
The learning of the neural network is based on a set of algebraic reasoning schemes. Generally, an algebraic reasoning scheme consists of a question, an answer and some intermediate reasoning steps. The question is an expression indicating the starting point of reasoning. The answer is an expression indicating the goal of reasoning. Each intermediate reasoning step is a record consisting of:
A source expression;
The name of a rewrite rule;
A position annotation for applying the rewrite rule;
A target expression.
In particular, the source expression of the first reasoning step is the question, and the target expression of the final reasoning step is the answer. Also, for each reasoning step, the target expression will be the source expression of the next step if the “next step” exists. By applying all intermediate reasoning steps, the question can be rewritten to the answer deterministically.
In this research, algebraic reasoning schemes are developed via a rewriting system in SWI-Prolog [Wielemaker et al.2012]. The rewriting system is based on Rule (4), and it uses breadth-first search to find intermediate reasoning steps from a question to an answer. Like most rewriting systems and automated theorem proving systems222A practical example is the “by auto” function of Isabelle/HOL [Nipkow et al.2002]. It is often difficult to prove a complex theorem automatically, so that experts’ guidance is often required., its ability of reasoning is restricted by the problem of combinatorial explosion: The number of possible ways of reasoning can grow rapidly when the question becomes more complex [Bundy1983]. Therefore, a full algebraic reasoning scheme of a complex question is usually difficult to be generated automatically, and guidance from humans is required. In other words, if the system fails to develop the scheme, we will apply rewrite rules manually until the remaining part of the scheme can be developed automatically, or we will provide some subgoals for the system to reduce the search space. After algebraic reasoning schemes are developed, their intermediate reasoning steps are used to train the neural network: For each step, the RPT representation of the source expression is the input of the neural network, and the one-hot representation of the rewrite rule name and the position annotation is the target of the neural network, as discussed by Section 2.
4 Methods for System Improvement
4.1 Centralised RPT Representation
The RPT representation discussed before is a top-down representation of an expression: A functor in the expression is a node, and arguments dominated by the functor are child nodes or leaves of the node. However, it does not record bottom-up information about the expression. For instance, in Figure 1, the partial tree labelled does not record any information about its parent node “”.
A centralised RPT (C-RPT) representation can represent both top-down and bottom-up information of an expression: Firstly, every node on a tree considers itself as the centre of the tree and grows an additional branch to its parent node (if it exists), so that the tree becomes a directed graph. This step is called “centralisation”. Then the graph is reduced to a set of partial trees and expanded to a set of perfect -ary trees. In particular, each additional branch is defined as the th branch of its parent node, and all empty positions dominated by the parent node are filled by . Detailed discussions about perfect -ary trees and directed graphs can be found from [Cormen et al.2001]. Figure 2 provides an example for the above steps. Finally, these perfect -ary trees are transformed to lists and further represented as a set of vectors, as discussed by Section 2.
4.2 Symbolic Association Vector
Consider the following rewrite rule:
The application of this rule requires that two arguments of “” are the same. If this pattern exists in an expression, it will be a useful hint for selecting rules. In such case, the use of a symbolic association vector (SAV) can provide useful information for the neural network: Assume that is the list representation of a perfect -ary tree (which has been discussed by Section 2) with a length . is defined as an matrix which satisfies:
After the matrix is produced, it can be reshaped to a vector and be a part of an input vector of the neural network.
4.3 Rule Application Record
Previous applications of rewrite rules can provide hints for current and future applications. In this research, we use rule application records (RAR) to record the previous applications of rewrite rules: Let denote the th element of an RAR , the name of the previous th rewrite rule, and the position annotation for applying the rule. is defined as:
Usually, the RAR only records the last applications of rewrite rules, where is a predefined length of . To enable the neural network to read the RAR, it needs to be transformed to a one-hot representation [Turian et al.2010]. A drawback of RARs is that they cannot be used in the first steps of rewriting, as they record exactly previous applications of rewrite rules.
5.1 Datasets and Evaluation Metrics
A dataset of algebraic reasoning schemes is used to train and test models. This dataset contains 400 schemes about linear equations, differentials and integrals and 80 rewrite rules, and these schemes consist of 6,067 intermediate reasoning steps totally. We shuffle the intermediate steps and then divide them into a training set and a test set randomly: The training set contains 5,067 examples, and the test set contains 1,000 examples. After training a model with the training set, an error rate of reasoning on the test set is used to evaluate the model, and it can be computed by:
where is the number of cases when the model fails to indicate an expected application of rewrite rules, and is the number of examples in the test set.
5.2 Using RPT Representations and Neural Networks
In this part, we evaluate the core method of human-like rewriting: All expressions in the dataset are represented by using the RPT representations. The breadth of an RPT is set to 2, because the expressions in the dataset are unary or binary. The depth of an RPT is set to 1, 2 or 3. Also, feedforward neural networks [Lecun et al.2015]Glorot et al.2011]. The output layer of each neural network is an averaged Softmax layer. The neural networks are trained via the back-propagation algorithm with the cross-entropy error function [Hecht-Nielsen1988, Bishop2006]. When training models, learning rates are decided by the Newbob+/Train strategy [Wiesler et al.2014]: The initial learning rate is set to 0.01, and the learning rate is halved when the average improvement of the cross-entropy loss on the training set is smaller than 0.1. The training process stops when the improvement is smaller than 0.01.
Figure 3 provides learning curves of the models, where “FNN” means that the neural network has hidden layers, and “RPT” means that the depth of RPTs is . To aid the readability, the curves of “FNN1”, “FNN3” and “FNN5” are in blue, red and green respectively, and the curves of “RPT1”, “RPT2” and “RPT3” are displayed by using dotted lines, dashed lines and solid lines respectively. By comparing the curves with the same colour, it is noticeable that more hidden layers can bring about significantly better performance of learning. On the other hand, if the neural network only has a single hidden layer, the learning will stop early, while the cross-entropy loss is very high. Also, by comparing the curves with the same type of line, it is noticeable that a deeper RPT often brings about better performance of learning, but an exception is the curve of the “FNN5 + RPT2” model.
Table 1 reveals performance of the trained models on the test set. In this table, results in “FNN” rows and “RPT” columns correspond to the “FNN+RPT” models in Figure 3. It is noticeable that the error rates of reasoning decrease significantly when the numbers of hidden layers increase. Also, the error rates of reasoning often decrease when the depths of RPTs increase, but an exception occurs in the case of “FNN5 + RPT2”. We believe that the reason why the exception occurs is that the learning rate strategy results in early stop of training. In addition, the error rate of the FNN5 + RPT3 model is the best among all results.
5.3 Using Improvement Methods
In Section 5.2, we have found that the neural networks with 5 hidden layers have better performance than those with 1 or 3 hidden layers on the task of human-like rewriting. Based on the neural networks with 5 hidden layers, we apply the three improvement methods to these models.
Figure 4 shows learning curves of models improved by C-RPTs, SAVs and RARs. Also, learning curves of the baseline RPT models are displayed by using dashed lines, where is the depth of RPTs. Learning curves of the C-RPT models are displayed by Figure 4(a). A comparison between two lines in the same colour reveals that the C-RPT representation can improve the model when
is fixed. Also, the C-RPT2 curve is very close to the RPT3 curve during the last 6 epochs, which reveals that there might be a trade-off between using C-RPTs and increasing the depth of RPTs. The best learning curve is the C-RPT3 curve, as its cross-entropy loss is always the lowest during all epochs. Figure4(b) provides learning curves of the RPT models with the SAV method. It is noticeable that SAVs have two effects: The first is that they can bring about lower cross-entropy losses. The second is that they can reduce the costs of learning time, as each RPT + SAV model uses fewer epochs to finish learning than its counterpart. Figure 4(c) shows learning curves of the RPT models with the RAR method. This figure reveals that RARs always improve the models. In particular, even the RPT1 + RAR model has better learning performance than the RPT3 model. Also, the RPT1 + RAR model and the RPT3 + RAR model use less epochs to be trained, which means that RARs may reduce the time consumption of learning. Figure 4(d) provides learning curves of the models with all improvement methods. A glance at the figure reveals that these models have better performance of learning than the baseline models. Also, they require less epochs to be trained than their counterparts. In addition, the final cross-entropy loss of the C-RPT2 + SAV + RAR model is the lowest among all results.
|RPT + SAV||16.8||15.8||11.5|
|RPT + RAR||6.7||6.0||5.4|
|RPT + SAV + RAR||6.8||5.2||5.4|
|C-RPT + SAV||11.9||15.1||11.8|
|C-RPT + RAR||6.3||5.1||5.1|
|C-RPT + SAV + RAR||5.4||5.3|
Table 2 shows error rates of reasoning on the test set after using the improvement methods. It is noticeable that: Firstly, the C-RPT models have lower error rates than the baseline RPT models, especially when . Secondly, the RPT + SAV models have lower error rates than the baseline RPT model when is 2 or 3, but this is not the case for the RPT1 + SAV model. Thirdly, the RARs can reduce the error rates significantly. Finally, the error rates can be reduced further when the three improvement methods are used together. In particular, the C-RPT2 + SAV + RAR model reaches the best error rate (4.6%) among all models.
6 Conclusions and Future Work
Deep feedforward neural networks are able to guide rewriting processes after learning from algebraic reasoning schemes. The use of deep structures is necessary, because the behaviours of rewriting can be accurately modelled only if the neural networks have enough hidden layers. Also, it has been shown that the RPT representation is effective for the neural networks to model algebraic expressions, and it can be improved by using the C-RPT representation, the SAV method and the RAR method. Based on these techniques, human-like rewriting can solve many problems about linear equations, differentials and integrals. In the future, we will try to use human-like rewriting to deal with more complex tasks of mathematical reasoning and extend it to more general first-order logics and higher-order logics.
This work is supported by the Fundamental Research Funds for the Central Universities (No. 2016JX06) and the National Natural Science Foundation of China (No. 61472369).
- [Arts and Giesl2000] Thomas Arts and Jürgen Giesl. Termination of term rewriting using dependency pairs. Theor. Comput. Sci., 236(1-2):133–178, 2000.
Christopher M Bishop.
Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., 2006.
- [Bundy and Welham1981] Alan Bundy and Bob Welham. Using meta-level inference for selective application of multiple rewrite rule sets in algebraic manipulation. Artif. Intell., 16(2):189–212, 1981.
- [Bundy1983] Alan Bundy. The computer modelling of mathematical reasoning. Academic Press, 1983.
- [Chang and Lee1973] Chin-Liang Chang and Richard C. T. Lee. Symbolic logic and mechanical theorem proving. Computer science classics. Academic Press, 1973.
- [Cormen et al.2001] T H Cormen, C E Leiserson, R L Rivest, and C. Stein. Introduction to algorithms (second edition). page 1297 C1305, 2001.
- [Garcez et al.2015] Artur D ’Avila Garcez, Tarek R Besold, Luc De Raedt, Peter Fldiak, Pascal Hitzler, Thomas Icard, Kai Uwe K hnberger, Luis C Lamb, Risto Miikkulainen, and Daniel L Silver. Neural-symbolic learning and reasoning: Contributions and challenges. In AAAI Spring Symposium - Knowledge Representation and Reasoning: Integrating Symbolic and Neural Approaches, 2015.
- [Garnelo et al.2016] Marta Garnelo, Kai Arulkumaran, and Murray Shanahan. Towards deep symbolic reinforcement learning. CoRR, abs/1609.05518, 2016.
- [Glorot et al.2011] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 315–323, 2011.
Theory of the backpropagation neural network.Neural Networks, 1(Supplement-1):445–448, 1988.
- [Huet1980] G rard Huet. Confluent reductions: Abstract properties and applications to term rewriting systems: Abstract properties and applications to term rewriting systems. Journal of the Acm, 27(4):797–821, 1980.
- [Huet1981] Gérard P. Huet. A complete proof of correctness of the knuth-bendix completion algorithm. J. Comput. Syst. Sci., 23(1):11–21, 1981.
- [Huth and Ryan2004] Michael Huth and Mark Dermot Ryan. Logic in computer science - modelling and reasoning about systems (2. ed.). Cambridge University Press, 2004.
- [Irving et al.2016] Geoffrey Irving, Christian Szegedy, Alexander A. Alemi, Niklas Eén, François Chollet, and Josef Urban. Deepmath - deep sequence models for premise selection. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2235–2243, 2016.
- [Knuth and Bendix1983] Donald E. Knuth and Peter B. Bendix. Simple word problems in universal algebras. Computational Problems in Abstract Algebra, pages 263–297, 1983.
- [Lecun et al.2015] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
- [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
[Mohamed et al.2012]
Abdel-rahman Mohamed, George E. Dahl, and Geoffrey E. Hinton.
Acoustic modeling using deep belief networks.IEEE Trans. Audio, Speech & Language Processing, 20(1):14–22, 2012.
- [Nipkow et al.2002] Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL - A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Science. Springer, 2002.
- [Pillay1981] Anand Pillay. Models of peano arithmetic. Journal of Symbolic Logic, 67(3):1265–1273, 1981.
Barry K. Rosen.
Tree-manipulating systems and church-rosser theorems.
Acm Symposium on Theory of Computing, pages 117–127, 1973.
- [Sarikaya et al.2014] Ruhi Sarikaya, Geoffrey E. Hinton, and Anoop Deoras. Application of deep belief networks for natural language understanding. IEEE/ACM Trans. Audio, Speech & Language Processing, 22(4):778–784, 2014.
- [Serafini and d’Avila Garcez2016] Luciano Serafini and Artur S. d’Avila Garcez. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. CoRR, abs/1606.04422, 2016.
- [Silver et al.2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.
- [Sun et al.2015] Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. Deepid3: Face recognition with very deep neural networks. CoRR, abs/1502.00873, 2015.
- [Toyama1987] Yoshihito Toyama. On the church-rosser property for the direct sum of term rewriting systems. J. ACM, 34(1):128–143, 1987.
[Turian et al.2010]
Joseph P. Turian, Lev-Arie Ratinov, and Yoshua Bengio.
Word representations: A simple and general method for semi-supervised learning.In ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden, pages 384–394, 2010.
- [Wielemaker et al.2012] Jan Wielemaker, Tom Schrijvers, Markus Triska, and Torbjörn Lager. Swi-prolog. TPLP, 12(1-2):67–96, 2012.
- [Wiesler et al.2014] Simon Wiesler, Alexander Richard, Ralf Schlüter, and Hermann Ney. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, pages 180–184, 2014.
- [Zantema1995] Hans Zantema. Termination of term rewriting by semantic labelling. Fundam. Inform., 24(1/2):89–105, 1995.