Generalize Symbolic Knowledge With Neural Rule Engine

08/30/2018 ∙ by Shen Li, et al. ∙ 0

Neural-symbolic learning aims to take the advantages of both neural networks and symbolic knowledge to build better intelligent systems. As neural networks have dominated the state-of-the-art results in a wide range of NLP tasks, it attracts considerable attention to improve the performance of neural models by integrating symbolic knowledge. Different from existing works, this paper investigates the combination of these two powerful paradigms from the knowledge-driven side. We propose Neural Rule Engine (NRE), which can learn knowledge explicitly from logic rules and then generalize them implicitly with neural networks. NRE is implemented with neural module networks in which each module represents an action of the logic rule. The experiments show that NRE could greatly improve the generalization abilities of logic rules with a significant increase on recall. Meanwhile, the precision is still maintained at a high level.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Rules used in Neural Rule Engine.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human cognition successfully integrates the connectionist and symbolic paradigms of Artificial Intelligence (AI). Yet the modelling of cognition develops separately in neural computation and symbolic logic areas

Garcez et al. (2012). Neural networks are well known for their inductive learning and generalization capabilities on large amounts of data, while symbolic logic rules are mostly constructed with expert knowledge, which yields high precision and good interpretabilities.

Minsky (1991) states that both symbolic and connectionist have virtues and deficiencies, and we need integrated systems that can exploit the advantages of both. Recently, there is a movement towards a fruitful combination of these two streams. Hu et al. (2016) present a teacher-student framework encapsulating the logical structured knowledge into a neural network, which forces NN to emulate the predictions of a rule-regularized teacher. Lu et al. (2017) propose Object-oriented Neural Programming (OONP), a framework that incorporates the symbolic structures and neural models for semantic parsing. Luo et al. (2018) integrate regular expressions into neural networks at the input level, the network module level and the output level. The combination significantly enhances the performance of neural models when a small number of training examples are available.

As pre-defined symbolic knowledge can greatly improve the learning effectiveness of neural models, it raises the question that can neural networks help to improve the generalization abilities of rules? Nowadays, symbolic knowledge is still widely used in few data scenarios or combined with statistical models. However, logic rules built with symbolic knowledge have poor generalization abilities and thus relatively low recalls. For human it is very natural to learn a piece of knowledge at first, and then use a couple of cases to generalize the knowledge. Inspired by this learning strategy, we propose a neural rule engine (NRE), in which rules can acquire higher flexibility and generalization ability with the help of neural networks, while they can still maintain the advantages of high precision and interpretabilities.

In this paper, we construct logic rules with the most widespread regular expressions (REs). NRE transforms the rules into neural module networks (NMN) which have symbolic structures. Specifically, the transformation involves 2 steps:

  • [leftmargin=*]

  • Parse a RE into an action tree composed of finite pre-defined actions. This operation is inspired by Kaplan and Kay (1994) where each RE is considered as a finite-state machine (FSM). The types and orders of actions are determined by a neural parser or a symbolic parser.

  • Represent the RE actions as neural-symbolic modules. Each module can be either customized neural networks, or a symbolic algorithm.

With neural rule engine, a system can learn knowledge explicitly from logic rules and generalize them implicitly with neural networks. It is not only an innovative paradigm of neural-symbolic learning, but also an effective solution to real life applications, including improving the existing rule-based systems and building neural rule methods for applications which do not have sufficient training data.

We apply this learning strategy to two tasks: Chinese crime case classification and Relation classification. The experimental results show that NRE could greatly improve the generalization ability of rules with a significant increase on recall and at the same time, high precision is still maintained.

2 Related Work

Neural symbolic learning The question of how to reconcile the statistical nature of learning with the logical nature of reasoning, that is the effective integration of robust connectionist learning and sound symbolic reasoning, has been considered as a main challenge and fundamental problem in AI (Valiant, 2003, 2008). Neural-symbolic computation, the goals of which are to provide coherent, unifying view for connectionism and logic to integrate neural networks and logic, has shown to be a way of addressing this challenge.

Neural-symbolic systems have been applied to various problems and successful applications include ontology learning, hardware/software specification, fault diagnosis, robotics, training and assessment in simulators (Hitzler et al., 2005; de Penning et al., 2011; Garcez et al., 2015; Besold et al., 2017). Recently, there are other research efforts which are in the topical proximity of the core field in neural-symbolic integration (Besold et al., 2017), including paradigms in computation and representation such as “conceptors” (Jaeger, 2014)

, “Neural Turing Machines” (NTMs)

(Graves et al., 2014) and other application systems which are based on connectionist methods partially or fully but applied to tasks which conceptually operates on a symbolic level such as visual analogy-making (Reed et al., 2015) and Go-playing (Silver et al., 2016).

In addition, a line of research aims to encode symbolic rules into neural networks. Hu et al. (2016) propose a teacher-student framework to combine neural networks with logic rules, transferring the structured information encoded in the logic rules into the network parameters. Lu et al. (2017) present Object-oriented Neural Programming (OONP), allowing neural and symbolic representing and reasoning over complex structures for semantically parsing documents. Luo et al. (2018) exploit expressiveness of regular expressions rules at different levels of a neural network, aiming to use the information provided by rules to improve the performance of the network.

The previous works exploit the symbolic rules to enhance neural networks and one should note that several approaches have encoded prior knowledge into neural networks to enhance their representation and performance, which also have led to the development of neural-symbolic integration. Rules are expressions of knowledge and regular expression is a kind of rules involving a lot of knowledge for specific domains. Strauß et al. (2016) build a decoder based on REs to speed up the decoding of NNs. Li et al. (2017)

encode semantic features into CNNs filters instead of initializing them randomly, which helps filters focus on learning useful n-grams.

Wang et al. (2017)

propose a framework based on CNN that combines representations of short text for classification. The representations are incorporated with words and relevant concepts conceptualized by rule in knowledge base on top of pre-trained word vectors.

Xiao et al. (2017) exploit prior knowledge such as weighted context-free grammar and the likelihood that entities occur in the input to form the “background” to RNN models. These works roughly add knowledge, more precisely, information carried by rules into existing neural networks to improve their performance. In this paper, by contrast, we propose an innovative paradigm of neural-symbolic learning which transforms the rules into neural module networks with symbolic structures, allowing the system to learn knowledge explicitly from logic rules and generalize them implicitly with neural networks.

Neural module networks Neural Module Networks (NMN), first proposed in Visual Question Answering (VQA), is composed of collections of joint-trained neural models (Andreas et al., 2016a)

. Inspired by recurrent neural networks and recursive neural networks, which both involve repeated application of a single computational module, any VQA network can be composed of finite computational structures which are reusable for other networks. In

Andreas et al. (2016a)s’ work, a natural language parser analyze each question to determine the basic computational modules needed to answer the question, as well as the relations between the modules. The specific NMN for a question is dynamically composed of reusable modules based on linguistic structure. All modules in NMN are independent, composable and the computation for each problem is different due to its different architecture.

Andreas et al. (2016a) use rule-based dependency parser (Zhu et al., 2013) to generate layouts to build NMN. Following Andreas et al. (2016a), some improved methods have been proposed. Andreas et al. (2016b) present a model for learning to select layouts from a set of automatically generated candidates structures predicted by dependency parser, which is called Dynamic Neural Module Network (DNMN). Hu et al. (2017) propose End-to-End Module Networks (N2NMNs), which learn to generate network structures without the aid of a parser while at the same time learning network parameters.

Since each RE can be considered as a finite-state machine (FSM), we can use finite pre-defined actions to interpret any RE, each of which corresponds to a module. Inspired by Hu et al. (2017), we propose Neural Rule Engine (NRE), where REs are composed of finite, reusable, computational and joint-trained modules. With NRE, knowledge is learned explicitly from logic rules and generalized implicitly with neural networks.

3 Architecture

Figure 1: Overview. NRE is based on action modules and a rule parser, both of which can either be trained neural networks or algorithm. The circles in “Action Tree” are actions with corresponding parameters, which are predicted by the rule parser in “Rule Parser”. In application, results are outputed by an action sequence which is composed of actions and their parameters following a specific layout.

REs benefit NLP in various tasks such as text classification, information extraction, text summarization, which correspond to various models, for example, sequence labeling model, classification model

(Kaur, 2014), since they are well-known for high precision and strong interpretability. However, REs need to be well designed by human, and have a poor generalization ability. Generally, a RE has limited coverage and large amount of REs are needed to support a rule-based system.

It is worth noting that the learning process of human beings does not rely only on the data (cases) or the logic knowledge, but the combination of them. For example, when a father introduces the concept of bird to his child, he may point to a visual bird (a live one or an image), and tell his child this flying animal with feathers is a bird. Given the knowledge of bird and maybe a couple of visual cases, the child becomes capable to recognize a bird. Inspired by this, we propose a novel learning strategy to teach models to learn rules like children, where we impart a rule to models at first, and then use a couple of cases to help models generalize the knowledge.

The learning strategy can be widely applied to a variety of tasks. As shown in Figure 1, the output can be a labeled sequence, whose length is the same as the case. In this aspect, NRE can be considered as a sequence labeling model for information extraction task. Besides, the output sequence can be mapped to a label, where NRE is served as a classification model for text classification task.

While neural modules are used to perform actions in “Action Tree”, our purpose is to endow rules the generalization ability and flexibility with neural modules and the neural layout parser, rather than improve existing neural networks. Since the proposed system is composed of neural modules, neural layout parser, and symbolic rule, it is called “neural rule engine”. NRE is interpretable and flexible, which can be considered as an “enhanced” rule engine.

We describe the detailed problem definition in 3.1, the overview of the model in 3.2, the implementation details of each neural module in 3.3, and the layout policy in 3.4. In 3.5

, a staged training method is proposed where we first pretrain modules and the rule parser and then use reinforcement learning to jointly finetune them.

3.1 Problem Definition

Figure 2: Rule examples.

As illustrated previously, regular expressions (REs) are rules involving a lot of knowledge and a lot of rule-based systems use REs to represent rules. In this paper, we use REs to interpret rules and focus on how to enhance REs, i.e., making REs flexible and can be generalized. As shown above, our model can be used in various tasks and we take the classification task as an example. We define a rule: and a case: where is a token in the regular expression, means rule corresponds to label , and is a token in the case. If rule matches case , , and if not, . Given a case , models need to choose the correct labels from , and is one of the labels in the problem.

To enhance the RE’s ability to express rules, we define the positive and negative part of rules. The positive part is what should appear in a case and the negative shouldn’t. Then we use “@@” to assemble the rule with two parts, the left side is Positive and the right is Negative. As shown in the second case in Figure 2, rule matches case when matches and doesn’t.

In this example, the relation between and will be labeled as “Origin-Entity” if the rule matches the case. By our method, the rule can be generalized instead of being strictly and entirely limited by the RE grammar, by which we believe the model “understands” the rule and learns how to improve it.

3.2 Overview

Our strategy maintains the high precision as well as the diversity of REs. More specifically, REs have a variety of forms and the strategy can cover the majority of REs. A RE can be considered as a finite-state machine (FSM) so we can use pre-defined operations, each of which corresponds to a module, to interpret the RE and simulate its function.

Our model consists of two main components: a set of actions (neural modules or simply mathematical functions) that provide basic operations, and a layout parser to predict a specific layout for every RE by which modules are dynamically assembled. An overview pipeline of our model is shown in Figure 1.

Given a rule , the rule parser first predicts a specific functional expression , listed in “Action Tree”, consisting of actions and parameters for the actions, where indicates the action and indicates the parameter for action . The circles in the “Action Tree” are the predicted actions, and each of them has its corresponding parameters. In implementation, we use Reverse Polish Notation (RPN) (Burks et al., 1954), post-order traversal over the syntax tree, to represent REs for convenience. The rule parser can either be a trained neural networks or the RE algorithm. Given a case, every action outputs the intermediate result according to the parameter . Then a network is assembled with the modules according to the specific RPN to get the final output.

3.3 Neural Modules

Action Description Detail
Find tokens matching . NN /
Find tokens matching . NN /
Judge if the distance
between and meets .
NN /
Judge if and are found together.
Judge if one of and is found.
Get output labels.

Table 1: Actions
Figure 3: “Find” module. Given a case and pattern , “Find” module is designed to label every token of the case, indicating whether the token is “found” by pattern . The figure is an example showing how to label for token .

In this paper, we define 6 basic actions and the most of REs can be interpreted by them.

As shown in Table 1, “Find_Positive”, “Find_Negative” and “And_Ordered” modules are based on neural networks while other modules on mathematical method. “Find” modules (“Find_Positive” and “Find_Negative”) aim to find the related words in a sentence then label them, which is similar to sequence labeling. “And” modules are designed to process the relation between two groups of labels and output new labels.

“Find” module is illustrated in Figure 3. Given case and action , “Find” Module aims to label 0 or 1 (1 indicates this token is matched with ) for every token of . Different contexts are obtained by slide windows of various lengths and each of them is encoded to a fixed-length vectors by NN 1.


where is the context of the word, and is the representation of context .

Then is encoded to a fixed-length vector by the same NN. Finally, we use function 2 to calculate the scores between every context and fixed-length vector and then we decide the label 0 or 1 by the scores using a sequence labeling model, which maps scores to a label in every position of case.


where W is a matrix which can be trained. For instance, given a word and a pattern , we first extract different levels contexts , when the window size is 2, where is , is , is and is the concatenation operator. Then and are encoded to fixed-length vectors and by the same encoder. After that, we use 2 to calculate scores between and . Finally, label for is decided according to by any sequence labeling model. As shown in Figure 3, the contexts for token is where , , and and pattern is . Then and are encoded by the same encoder to and . In the end, we get the score and output label .

Figure 4: “And_Ordered” module. The rule is disassemble to a tree structure as the action sequence. The figure shows how the root node “And_Ordered” merges the two label sequences calculated by two subnodes “And_Ordered”. The input to CNN or RNN model is composed of five parts: the case, sequential labels and from left and right child nodes, distance and calculated by the distance parameter of the “And_Ordered” module based on and . The module outputs the combined labels for every token of the case.

As shown in Figure 4, “And_Ordered” module is designed to process the relation between two groups of sequential labels and output new labels. Distance parameter is required for “And_Ordered” module, indicating the maximum distances between two input sequential labels. Distance means that the two input labels can be combined at an arbitrary distance. For the root node “And_Ordered” in figure 4, the left child output is , indicating that tokens and are “found” respectively by “Find_Positive: ” and “Find_Positive: ”, and they are merged by “And_Ordered” in the left node. The right child output is calculated in the same way. , indicating the distance from the found token in , is calculated based on and distance parameter .

We conduct experiments with CNN and RNN to determine which model works best. Experiments show that CNN and RNN are comparable in accuracy, but CNN has a faster speed. So we select CNN to implement the modules.

3.4 Layout Policy

Figure 5: Rules are disassemble to tree structures. Each node is an action with a specific parameter and its child nodes which are served as the inputs.
Figure 6: An example showing how to linearize a RE to a sequence of modules and their parameters.

For every module, it can be an analytical algorithm or a customized neural network. Similarly, for a rule, it can be disassembled to a layout by an analytical algorithm or a neural network. As shown in Figure 5, the rule can be taken apart to a tree structure. Inspired by Hu et al. (2017), we treat the tree structure as a RPN for convenience. Figure 6 shows a RE example and its linearized RPN sequence, which consists of actions and parameters. Then the layout prediction problem turns into a sequence-to-sequence learning problem from REs to modules and their parameters. We train a novel seq2seq model to predict RPNs from REs.

3.5 Training Method

The training is based on the pre-trained word vectors trained with fastText (Bojanowski et al., 2017) and all vectors are kept static during the training. We introduce methods to train modules and layouts separately.

For modules, the training is divided into two phases: pretraining and finetune. In the pretraining phase, we generate the training data for modules based on sentences in training set, i.e. we randomly select patterns and sentences which are matched by the patterns. For a better recall, the maximum epoch number is set to avoid overfitting. In the finetuning phase, we use an algorithm to analyze the rule and construct a layout. Then the network is assembled according to the layout and the final prediction is given based on the results of each module. We use the final result to finetune the modules with reinforcement learning. Taking “Find_Positive” modules as an example, a “Find_Positive” module is to label 0 or 1 in every token of

for . If the final prediction contains while the ground truth doesn’t, we believe that “Find_Positive” modules “Find” too many tokens and we punish “Find_Positive” modules whose predictions are 1.

Figure 7: An attentional seq2seq model to predict a layout.

The model for layout parser is shown in Figure 7. Given a rule , the layout parser is to predict a specific RPN consisting of actions and parameters, where indicates the action and indicates the parameter for action . With the consideration that predicting actions and parameters at the same time is very difficult, we split the training process into three stages:

  1. predict actions

  2. predict parameters for every action based on the actions and the input sentence

  3. jointly finetune actions and parameters

In Phase 1, a sequence-to-sequence (seq2seq) model is utilized to predict an action sequence. Recently, seq2seq models are greatly benefited from Attention (Bahdanau et al., 2014) and Beam Search has been proven to be a better method to decode target sequences than the greedy decoding. As shown in Figure 7, we use an attentional seq2seq model, which consists of a three-layer bidirectional LSTM and a one-layer unidirectional LSTM as an encoder and a one-layer LSTM as a decoder. Beam Search is used to decode actions during prediction.

In Phase 2, the model shares the same encoder with the first phase and add an encoder to encode the actions predicted in the first phase. Then the model predicts its parameters for each predicted action. There is a trick to speed up training. The parameters of an action need to be searched across the entire vocabulary, which is inefficient. Since fixed word vectors are used, we can change the optimization target from 3 to 4, i.e., force the model to predict a word vector rather than to directly find out the id of the target word.


where n is the length of the sentence, and m the size of the vocabulary. This trick greatly increases the speed of training and doesn’t affect the accuracy. At the time of prediction, we predict a vector and consider the closest word as the predicted word based on all word vectors.

In the end, we adopt the same strategy as in the module training to adjust the layout policy. According to the predicted layout, we assemble the whole network with trained modules and get the final result. Then reinforcement learning is exploited to finetune the layout model based on the ground truth.

3.6 Assembling

The whole system is called Neural Rule Engine (NRE) since our learning strategy fully combines the advantages of the symbolic and the neural. Due to the modular design, each module can be implemented either by a specific algorithm of RE or by NN. At the same time, the layout of the action sequence can also be analyzed by an algorithm or a NN model. NRE has a symbolic architecture, while its internal modules and the rule parser of modules are NNs, organically combining the characteristics of the symbolic and the neural. The modules can be assembled in different ways for different needs at will.

4 Experiment

4.1 Baseline models

Figure 8: Figure (a)a is the basic sequence model without attention, called “LSTM-No-Attention”. Figure (b)b and Figure (c)c are attentional sequence models, which are “LSTM-Attention-Input” and “LSTM-Attention-Output” respectively.

Long Short-Term Memory (LSTM), one of recurrent neural networks (RNN), is highlighted for its strong ability to model temporal sequences and capture long-range dependencies (Sak et al., 2014). Recently, attention mechanism has become an integral part of a sequence model (Bahdanau et al., 2014) and improved model performance greatly. We propose three baseline models based on LSTM and attention, which are shown in Figure 8.

“LSTM-No-Attention” is illustrated in Figure (a)a. Given a RE , and a case , the case and the rule are both encoded with LSTM-No-Attention. The final hidden state of the case LSTM is used as the initial hidden state of the rule LSTM. We use the final state of the rule LSTM to get prediction.

Inspired by Bahdanau et al. (2014), we force the RE to attend different parts of the case at each step of RE input in “LSTM-Attention-Input” baseline, which is shown in Figure (b)b. Similarly, we still use the final state of LSTM to gather the final prediction.

The last baseline model “LSTM-Attention-Output” is shown in (c)c, which is based on “LSTM-No-Attention”. Outputs for all timestamps of the case and RE are and . We force every timestamp of case output to attend different parts of and concatenate them with . They become

. Inspired by max-pooling in CNN

(Collobert et al., 2011), which encourages the network to capture the most useful local features produced by convolutional layers, we utilize max-pooling on to predict final labels.

4.2 Data

Figure 9: Data division.

We select two datasets to evaluate the performance of our proposed approach. One is the Chinese crime case classification and the other is the relation classification dataset in English (Hendrickx et al., 2009)111We released rules accompanying the relation classification dataset on Github:

In Chinese crime case classification, each case is composed of one to three sentences in general. There are 8 categories totaling 150 labels in the dataset, such as “Burglary”, “Motorcycling Robbery”. The entire data set consists of 12555 cases and each case corresponds to zero, one or more than one labels. The cases are split into training, validation and test sets with the ratio of 80%, 10%, and 10%. Meanwhile, we write 239 REs, and split the rules also according to 8:1:1. As shown in Figure 9, the data division should be strictly followed in order to avoid data leakage.

Besides, we also conduct experiments in the relation classification dataset “SemEval-2010 Task 8”, a multi-way classification of semantic relations between pairs of common nominals. Each case corresponds to one category, and there are 19 categories in total. Among them, there are 7109 cases in training set, 891 in validation set and 2717 in test set. We write 490 REs, of which 125 are for testing. Importantly, all the written rules are based on the training set.

4.3 Pipeline

As shown above, “Find_Positive”, “Find_Negative” and “And_Ordered” modules are based on neural networks while other modules on mathematical method. Since “Find_Negative” is the same with “Find_Positive”, we only present how to train “Find_Positive” module. In this section, we introduce “Find_Positive”, “And_Ordered” and the pipeline for NRE.

For “Find_Positive” module, we randomly generate patterns and samples based on cases in the training set. Then we test “Find_Positive” module in the real test set. More specifically, “Find_Positive” module is to find patterns in sentence. In training, patterns are randomly chosen from sentences, by which we can generate massive data for training. When tested, the module is fed with real patterns of REs in the test set for convincing result.

For “And_Ordered” module, we randomly generate the outputs of subnodes and “distance” based on training set. Like “Find_Positive” module, we test “And_Ordered” module in real test set.

After the modules are trained, we finetune the trained modules according to labels of cases in the training set. Firstly, a rule is disassembled to generate a Reverse Polish Notation, which corresponds to a action sequence. Then the network is assembled based on the action sequence and trained modules to output labels. Finally, reinforcement learning (RL) is used to finetune the trained modules according to the true labels in the training set.

As for the generation of RPN, a staged seq2seq model is proposed to implement the layout policy to generate RPNs from REs. The model first predicts actions and then parameters of the actions. After the layout policy is trained, we fix the trained modules and finetune the neural parser using RL just as we finetune modules.

4.4 Hyperparamters

Module #Filters Filter Size Slide Window Dropout Embedding Size
Find_Positive 200 [1, 2, 3] [1, 2, 3] 0.5 300
Find_Negative 200 [1, 2] [1, 2] 0.5 300
And_Ordered 100 [3, 4, 5] N/A 0.5 300
Module RNN Size Beam Width Dropout Embedding Size
Rule Parser 500 3 0.5 300
Table 2: Hyperparameters

We train models with Adadelta Optimizer (Zeiler, 2012)

and finetune them by Stochastic Gradient Descent (SGD) with learning rate 0.001

(Kiefer et al., 1952). In Chinese crime case classification dataset, we utilize fastText (Bojanowski et al., 2017) to pretrain the word vectors based on corpus collected by Internet crawlers. In relation classification dataset, we choose a publicly available word vectors (Mikolov et al., 2018) 222 The embedding table in every model is not trainable, in other words, all words vectors are kept static during training. More hyperparameters are shown in Table 2

4.5 Result

Figure 10: Tuning the threshold of confidence in baselines. Figure (a)a is “LSTM-No-Attention”. Figure (b)b and Figure (c)c are “LSTM-Attention-Input” and “LSTM-Attention-Output” respectively.
Model Precision Recall F1
RE 1.0 0.1779 0.3021
RE-Synonyms 0.9773 0.1897 0.3177
LSTM-No-Attention 0.2348 0.6382 0.3434
LSTM-Attention-Input 0.3106 0.3824 0.3428
LSTM-Attention-Output 0.3132 0.3515 0.3312
NRE-Char 0.8104 0.2515 0.3838
NRE-Char-Finetune 0.9261 0.2765 0.4258
NRE-Word 0.6127 0.3559 0.4502
NRE-Word-Finetune 0.9237 0.3382 0.4952
Table 3: NRE on the Chinese crime case classification dataset. With “Finetune”, NRE increases the recall of rules significantly while still maintaining a high precision.
Model Precision Recall F1
RE 0.9143 0.0589 0.1107
NRE-Word-Finetune 0.8837 0.1121 0.1990
Table 4: NRE on the relation classification dataset.

The results on the Chinese crime case classification dataset are summarized in Table 3, where our method significantly outperforms “RE” and three sequence baselines. “RE” shows the result, in which rules are served as the traditional regular expression. To enhance the generalization of REs, we augment rules by replacing words of REs with their synonyms in dictionary, and the result is shown in “RE-Synonyms”. Performances of three sequence baselines are listed in “LSTM-No-Attention”, “LSTM-Attention-Input”, and “LSTM-Attention-Output” respectively. Since tokens can be represented as characters (Hanzi) or words in Chinese, we conduct experiments “NRE-Char” and “NRE-Word”, where cases and REs are based on characters or words. As shown in 3.5, “Finetune” jointly optimize modules and the layout policy. We conduct experiments to test whether “Finetune” takes effect, where results with “Finetune” are listed in “NRE-Char-Finetune” and “NRE-Word-Finetune” while results without “Finetune” are in “NRE-Char” and “NRE-Word”.

It can be seen that “RE” achieves 100% precision, which indicates that the REs we write are accurate. However, as illustrated previously, a RE can only cover a small part of data and the recall of “RE” is very low. “RE-Synonyms”, aiming to enhance RE generalization by synonyms in a hand-crafted way, gains slightly improvements in recall. An apparent reason is that the synonyms are based on dictionary and not suitable for the dataset.

The sequence baselines achieve awful but reasonable results. Our goal is to improve the generalization abilities of logic rules while still maintaining the advantages of high precision, i.e., to increase the recall and maintain the precision at a high level. In the baselines, the increase on recall makes no sense since the precision is unacceptable, resulting in unserviceable application. It is obvious that precision and recall have a trade-off, so we adjust the decision threshold to further analyze the performance of baseline models. As shown in Figure

10, the precision increases and the recall decreases when we raise the threshold, which is in line with our intuition. However, it can be seen that the precision is still awful even we adjust the threshold to a very high level. The reasons are as follows. A RE can be considered as a hierarchical structure while a sequence model reads RE in a linear way. And special characters in RE such as “.”, “?” are difficult to be modeled by a sequence model. Besides, RE requires focusing on whole context but a sequence model generally considers local patterns of inputs. By integrating attention, which endows sequence structures the ability to model dependencies of patterns regardless of their distance, the model performance is slighted improved while still unacceptable. The “LSTM-Attention-Output” outperforms “LSTM-Attention-Input” since the rule attends the case directly in output and the most useful pattern is captured by max-pooling. The baselines can not handle the hierarchical structure and the symbolic reasoning of logic rules, leading to the unacceptable performances.

Compared to “RE”, “NRE-Word-Finetune” increases the recall of rules significantly while still maintaining a high precision. It can be seen that “Finetune” is critical since it makes the modules and the layout more fitted. Moreover, “NRE-Word-Finetune” is more effective than “NRE-Char-Finetune” because words encode more semantics and have better generalization ability than characters.

The results are summarized in Table 4 on relation classification dataset, where NRE improves the performance of rules by a significant margin.

Experimental results demonstrate that NRE is capable of handling the hierarchical structure of RE and promoting REs, and still maintaining the high precision and interpretability.

4.6 Further Analysis of NRE

As shown in 4.5, NRE generalizes REs and increases recall of them. We further analyze where the generalization comes from and what the generalization is.

Model Precision Recall F1
RE 1.0 0.1779 0.3021
PN_AS 0.9313 0.3191 0.4754
PNA_S 0.9331 0.3279 0.4853
PNAS_ 0.9237 0.3382 0.4952
Table 5: Results on different combinations of neural networks and RE algorithm in NRE. “RE” is the result when we implement RE in traditional way without NNs. “P”, “N” and “A” indicate “Find_Positive”, “Find_Negative” and “And_Ordered” modules respectively and “S” indicates the action sequence parser. For convenience, the parts of NRE implemented by NNs are in the left side of “_” and the parts implemented by RE algorithm are in the right side. For example, “PN_AS” means that “Find_Positive” and “Find_Negative” modules are implemented by NNs while “And_Ordered” and the layout parser are RE algorithms.
Label Entity-Origin
RE </e1>.* extracted from .*< e2> @@
The < e1>woman</e1> was taken from her native < e2>family</e2>
and adopted in England on some relocation scheme in the 1960’s.
Label Entity-Origin
RE </e1>.* originate .*</e2>@@(signal|aberration|peakeffect|reduction).*</e1>
A <e1>nunt</e1> is a pastry originating from
Jewish <e2>cuisine</e2> and vaguely resembles nougat.
Label Entity-Destination
RE </e1>.* pushed into .*<e2>@@<e2>(function|care).*</e2>
Then, the target PET <e1>bottle</e1> was put inside
of a metal <e2>container</e2> , which was grounded.
Label Entity-Destination
RE </e1>.* pushed into .*<e2>@@<e2>(function|care).*</e2>
NASA Kepler mission sends <e1>names</e1>
into <e2>space</e2>.
Label Entity-Destination
RE </e1>.* pushed into .*<e2>@@<e2>(function|care).*</e2>
Ten million quake <e1>survivors</e1> moved into
makeshift <e2>houses</e2>.
Label Entity-Destination
RE </e1>.* derived from .*<e2>@@
A tidal wave of <e1>talent</e1> has emanated from
this lush <e2>village</e2>.
Table 6: Cases in the relation classification dataset.
Label 入室作案 (Burglary)
RE 入室@@死亡|工地
受害人籍希逵(男, 1976年6月23日,湖北省邯山县梅源乡岭根村兴岭路226-1)
Label 持锐器 (Holding a sharp instrument)
RE 水果刀@@刺绣厂|铁管|马刀帮|被盗|划分|石块|划破
Label 持枪 (Holding a gun)
RE 气枪@@
(女, 32岁, 江苏省三台忠孝乡南街080号附01号)门前的狗用猎枪打死后
Table 7: Cases in the Chinese crime case classification dataset.
Label 持钝器 (Holding the blunt)
RE 锤子@@围墙|被盗|落水管|刺伤|石头|钢管厂|不锈钢管
RE Parser
锤子 Find_Positive 围墙 Find_Negative 被盗 Find_Negative Or
落水管 Find_Negative 刺伤Find_Negative Or Or 石头 Find_Negative 钢管厂
Find_Negative Or 不锈钢 Find_Negative Find_Negative 0 And_Ordered Or
Or And_Unordered Output
Neural Parser
锤子 Find_Positive 围墙 Find_Negative 被盗 Find_Negative Or 钢管厂
Find_Negative Or Find_Negative Or Or Output
Label 墙上挖洞 (Digging holes in the wall)
RE 打墙洞@@
RE Parser
打 Find_Positive 墙 Find_Positive 0 And_Ordered Find_Positive 0
And_Ordered Output
Neural Parser
打 Find_Positive 墙 Find_Positive 0 And_Ordered Output
Table 8: Action sequences in the Chinese crime case classification dataset.

Where the generalization comes from. We conduct experiments where we combine neural networks and RE algorithm in different forms. The results are shown in Table 5. Specifically, “RE” can be considered a special NRE model where all modules and the layout parser are both implemented by RE algorithm. “PNAS_” is the model with best flexibility and generalization ability where all modules are neural networks. It can be seen that the recall is improved by a significant margin when we use neural “Find_Positive” and “Find_Negative” modules to replace the traditional algorithm. The performance is increased again on the basis of “PN_AS” if the “And_Ordered” is built with a neural network. And it gains another improvement when the neural layout parser is introduced to replace the traditional disassembling algorithm. In a word, the generalization comes from the neural “Find”, “And_Ordered” modules and the layout parser, among which “Find” modules make the key contribution. Intuitively, “Find” modules aim to match the related words with patterns and neural “Find” modules are capable of finding words which are not strictly same as the required words but are similar with them in semantic. For example, “Find” module can find “put inside” given “pushed into”.

What the generalization is. We analyze some cases, which NRE covers and REs not, to discover what the generalization of NRE is. As shown in Table 6, given a RE, NRE covers some cases which don’t strictly match RE but are similar with them in semantic. For example, case “The <e1> woman </e1> was taken from her native <e2> family </e2> and adopted in England on some relocation scheme in the 1960s.” is covered by in NRE. NRE generalizes , thus can be matched and activated by this RE. This is demonstrated by several cases, both in Chinese and English datasets shown in Table 6 and Table 7. Besides, action sequences can also be optimized with the neural parser. Taking the first case in Table 8 as an example, some actions and parameters are removed, which leads to better performance, such as “落水管” (downpipe) and “不锈钢管” (stainless steel pipe) are concentrated into “管” (pipe).

5 Conclusion

In this paper, we present a novel learning strategy where the neural networks and symbolic knowledge are combined from the knowledge-driven side. Based on this learning strategy, we propose Neural Rule Engine (NRE), where rules obtain the flexibility and generalization ability of neural networks while still maintaining the precision and interpretability. NRE is able to learn knowledge explicitly from logic rules and generalize them implicitly with neural networks. NRE consists of action modules and a rule parser, both of which can either be customized neural networks or a symbolic algorithm. Given a rule, NRE first predicts a specific layout and then modules are dynamically assembled by the layout to output the result. Besides, a staged training method is proposed where we first pretrain modules and the neural rule parser, and then use reinforcement learning to jointly finetune them. The experiments show that NRE could greatly improve the generalization abilities of logic rules with a significant increase on recall. Meanwhile, the precision is still maintained at a high level. NRE is not only an innovative paradigm of neural-symbolic learning, but also an effective solution to industrial applications, e.g. upgrading the existing rule-based systems and developing neural rule approaches which do not rely on a mass of training data.


  • Andreas et al. (2016a) Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016a). Neural module networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 39–48.
  • Andreas et al. (2016b) Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016b). Learning to compose neural networks for question answering. In Proceedings of NAACL-HLT, pages 1545–1554.
  • Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Besold et al. (2017) Besold, T. R., Garcez, A. d., Bader, S., Bowman, H., Domingos, P., Hitzler, P., Kühnberger, K.-U., Lamb, L. C., Lowd, D., Lima, P. M. V., et al. (2017). Neural-symbolic learning and reasoning: A survey and interpretation. arXiv preprint arXiv:1711.03902.
  • Bojanowski et al. (2017) Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  • Burks et al. (1954) Burks, A. W., Warren, D. W., and Wright, J. B. (1954). An analysis of a logical machine using parenthesis-free notation. Mathematical tables and other aids to computation, 8(46):53–57.
  • Collobert et al. (2011) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch.

    Journal of Machine Learning Research

    , 12(Aug):2493–2537.
  • de Penning et al. (2011) de Penning, H. L. H., d’Avila Garcez, A. S., Lamb, L. C., and Meyer, J.-J. C. (2011). A neural-symbolic cognitive agent for online learning and reasoning. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1653.
  • Garcez et al. (2015) Garcez, A., Besold, T. R., De Raedt, L., Földiak, P., Hitzler, P., Icard, T., Kühnberger, K.-U., Lamb, L. C., Miikkulainen, R., and Silver, D. L. (2015). Neural-symbolic learning and reasoning: contributions and challenges. In Proceedings of the AAAI Spring Symposium on Knowledge Representation and Reasoning: Integrating Symbolic and Neural Approaches, Stanford.
  • Garcez et al. (2012) Garcez, A. S. d., Broda, K. B., and Gabbay, D. M. (2012). Neural-symbolic learning systems: foundations and applications. Springer Science & Business Media.
  • Graves et al. (2014) Graves, A., Wayne, G., and Danihelka, I. (2014). Neural turing machines. arXiv preprint arXiv:1410.5401.
  • Hendrickx et al. (2009)

    Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P., Ó Séaghdha, D., Padó, S., Pennacchiotti, M., Romano, L., and Szpakowicz, S. (2009).

    Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pages 94–99. Association for Computational Linguistics.
  • Hitzler et al. (2005) Hitzler, P., Bader, S., and Garcez, A. (2005). Ontology learning as a use-case for neural-symbolic integration. In Proceedings of the IJCAI-05 Workshop on Neural-Symbolic Learning and Reasoning.
  • Hu et al. (2017) Hu, R., Andreas, J., Rohrbach, M., Darrell, T., and Saenko, K. (2017). Learning to reason: End-to-end module networks for visual question answering. CoRR, abs/1704.05526, 3.
  • Hu et al. (2016) Hu, Z., Ma, X., Liu, Z., Hovy, E., and Xing, E. (2016). Harnessing deep neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2410–2420.
  • Jaeger (2014) Jaeger, H. (2014). Controlling recurrent neural networks by conceptors. arXiv preprint arXiv:1403.3369.
  • Kaplan and Kay (1994) Kaplan, R. M. and Kay, M. (1994). Regular models of phonological rule systems. Computational linguistics, 20(3):331–378.
  • Kaur (2014) Kaur, G. (2014). Usage of regular expressions in nlp. International Journal of Research in Engineering and Technology IJERT, 3(01):7.
  • Kiefer et al. (1952) Kiefer, J., Wolfowitz, J., et al. (1952).

    Stochastic estimation of the maximum of a regression function.

    The Annals of Mathematical Statistics, 23(3):462–466.
  • Li et al. (2017) Li, S., Zhao, Z., Liu, T., Hu, R., and Du, X. (2017). Initializing convolutional filters with semantic features for text classification. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1884–1889.
  • Lu et al. (2017) Lu, Z., Cui, H., Liu, X., Yan, Y., and Zheng, D. (2017). Object-oriented neural programming (oonp) for document understanding. arXiv preprint arXiv:1709.08853.
  • Luo et al. (2018) Luo, B., Feng, Y., Wang, Z., Huang, S., Yan, R., and Zhao, D. (2018). Marrying up regular expressions with neural networks: A case study for spoken language understanding. arXiv preprint arXiv:1805.05588.
  • Mikolov et al. (2018) Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2018). Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  • Minsky (1991) Minsky, M. L. (1991). Logical versus analogical or symbolic versus connectionist or neat versus scruffy. AI magazine, 12(2):34.
  • Reed et al. (2015) Reed, S. E., Zhang, Y., Zhang, Y., and Lee, H. (2015). Deep visual analogy-making. In Advances in neural information processing systems, pages 1252–1260.
  • Sak et al. (2014) Sak, H., Senior, A., and Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484.
  • Strauß et al. (2016) Strauß, T., Leifert, G., Grüning, T., and Labahn, R. (2016). Regular expressions for decoding of neural network outputs. Neural Networks, 79:1–11.
  • Valiant (2003) Valiant, L. G. (2003). Three problems in computer science. Journal of the ACM (JACM), 50(1):96–99.
  • Valiant (2008) Valiant, L. G. (2008). Knowledge infusion: In pursuit of robustness in artificial intelligence. In LIPIcs-Leibniz International Proceedings in Informatics, volume 2. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
  • Wang et al. (2017) Wang, J., Wang, Z., Zhang, D., and Yan, J. (2017).

    Combining knowledge with deep convolutional neural networks for short text classification.

    In Proceedings of IJCAI, volume 350.
  • Xiao et al. (2017) Xiao, C., Dymetman, M., and Gardent, C. (2017). Symbolic priors for rnn-based semantic parsing. In wenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pages 4186–4192.
  • Zeiler (2012) Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
  • Zhu et al. (2013) Zhu, M., Zhang, Y., Chen, W., Zhang, M., and Zhu, J. (2013). Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 434–443.