Rules used in Neural Rule Engine.
Neural-symbolic learning aims to take the advantages of both neural networks and symbolic knowledge to build better intelligent systems. As neural networks have dominated the state-of-the-art results in a wide range of NLP tasks, it attracts considerable attention to improve the performance of neural models by integrating symbolic knowledge. Different from existing works, this paper investigates the combination of these two powerful paradigms from the knowledge-driven side. We propose Neural Rule Engine (NRE), which can learn knowledge explicitly from logic rules and then generalize them implicitly with neural networks. NRE is implemented with neural module networks in which each module represents an action of the logic rule. The experiments show that NRE could greatly improve the generalization abilities of logic rules with a significant increase on recall. Meanwhile, the precision is still maintained at a high level.READ FULL TEXT VIEW PDF
Symbolic has been long considered as a language of human intelligence wh...
Intelligent systems based on first-order logic on the one hand, and on
In the recent past, there has been a growing interest in Neural-Symbolic...
This work describes a methodology that combines logic-based systems and
This work describes a methodology to combine logic-based systems and
Walk-based models have shown their unique advantages in knowledge graph ...
Arguments in favor of injecting symbolic knowledge into neural architect...
Rules used in Neural Rule Engine.
Human cognition successfully integrates the connectionist and symbolic paradigms of Artificial Intelligence (AI). Yet the modelling of cognition develops separately in neural computation and symbolic logic areasGarcez et al. (2012). Neural networks are well known for their inductive learning and generalization capabilities on large amounts of data, while symbolic logic rules are mostly constructed with expert knowledge, which yields high precision and good interpretabilities.
Minsky (1991) states that both symbolic and connectionist have virtues and deficiencies, and we need integrated systems that can exploit the advantages of both. Recently, there is a movement towards a fruitful combination of these two streams. Hu et al. (2016) present a teacher-student framework encapsulating the logical structured knowledge into a neural network, which forces NN to emulate the predictions of a rule-regularized teacher. Lu et al. (2017) propose Object-oriented Neural Programming (OONP), a framework that incorporates the symbolic structures and neural models for semantic parsing. Luo et al. (2018) integrate regular expressions into neural networks at the input level, the network module level and the output level. The combination significantly enhances the performance of neural models when a small number of training examples are available.
As pre-defined symbolic knowledge can greatly improve the learning effectiveness of neural models, it raises the question that can neural networks help to improve the generalization abilities of rules? Nowadays, symbolic knowledge is still widely used in few data scenarios or combined with statistical models. However, logic rules built with symbolic knowledge have poor generalization abilities and thus relatively low recalls. For human it is very natural to learn a piece of knowledge at first, and then use a couple of cases to generalize the knowledge. Inspired by this learning strategy, we propose a neural rule engine (NRE), in which rules can acquire higher flexibility and generalization ability with the help of neural networks, while they can still maintain the advantages of high precision and interpretabilities.
In this paper, we construct logic rules with the most widespread regular expressions (REs). NRE transforms the rules into neural module networks (NMN) which have symbolic structures. Specifically, the transformation involves 2 steps:
Parse a RE into an action tree composed of finite pre-defined actions. This operation is inspired by Kaplan and Kay (1994) where each RE is considered as a finite-state machine (FSM). The types and orders of actions are determined by a neural parser or a symbolic parser.
Represent the RE actions as neural-symbolic modules. Each module can be either customized neural networks, or a symbolic algorithm.
With neural rule engine, a system can learn knowledge explicitly from logic rules and generalize them implicitly with neural networks. It is not only an innovative paradigm of neural-symbolic learning, but also an effective solution to real life applications, including improving the existing rule-based systems and building neural rule methods for applications which do not have sufficient training data.
We apply this learning strategy to two tasks: Chinese crime case classification and Relation classification. The experimental results show that NRE could greatly improve the generalization ability of rules with a significant increase on recall and at the same time, high precision is still maintained.
Neural symbolic learning The question of how to reconcile the statistical nature of learning with the logical nature of reasoning, that is the effective integration of robust connectionist learning and sound symbolic reasoning, has been considered as a main challenge and fundamental problem in AI (Valiant, 2003, 2008). Neural-symbolic computation, the goals of which are to provide coherent, unifying view for connectionism and logic to integrate neural networks and logic, has shown to be a way of addressing this challenge.
Neural-symbolic systems have been applied to various problems and successful applications include ontology learning, hardware/software specification, fault diagnosis, robotics, training and assessment in simulators (Hitzler et al., 2005; de Penning et al., 2011; Garcez et al., 2015; Besold et al., 2017). Recently, there are other research efforts which are in the topical proximity of the core field in neural-symbolic integration (Besold et al., 2017), including paradigms in computation and representation such as “conceptors” (Jaeger, 2014)
, “Neural Turing Machines” (NTMs)(Graves et al., 2014) and other application systems which are based on connectionist methods partially or fully but applied to tasks which conceptually operates on a symbolic level such as visual analogy-making (Reed et al., 2015) and Go-playing (Silver et al., 2016).
In addition, a line of research aims to encode symbolic rules into neural networks. Hu et al. (2016) propose a teacher-student framework to combine neural networks with logic rules, transferring the structured information encoded in the logic rules into the network parameters. Lu et al. (2017) present Object-oriented Neural Programming (OONP), allowing neural and symbolic representing and reasoning over complex structures for semantically parsing documents. Luo et al. (2018) exploit expressiveness of regular expressions rules at different levels of a neural network, aiming to use the information provided by rules to improve the performance of the network.
The previous works exploit the symbolic rules to enhance neural networks and one should note that several approaches have encoded prior knowledge into neural networks to enhance their representation and performance, which also have led to the development of neural-symbolic integration. Rules are expressions of knowledge and regular expression is a kind of rules involving a lot of knowledge for specific domains. Strauß et al. (2016) build a decoder based on REs to speed up the decoding of NNs. Li et al. (2017)
encode semantic features into CNNs filters instead of initializing them randomly, which helps filters focus on learning useful n-grams.Wang et al. (2017)
propose a framework based on CNN that combines representations of short text for classification. The representations are incorporated with words and relevant concepts conceptualized by rule in knowledge base on top of pre-trained word vectors.Xiao et al. (2017) exploit prior knowledge such as weighted context-free grammar and the likelihood that entities occur in the input to form the “background” to RNN models. These works roughly add knowledge, more precisely, information carried by rules into existing neural networks to improve their performance. In this paper, by contrast, we propose an innovative paradigm of neural-symbolic learning which transforms the rules into neural module networks with symbolic structures, allowing the system to learn knowledge explicitly from logic rules and generalize them implicitly with neural networks.
Neural module networks Neural Module Networks (NMN), first proposed in Visual Question Answering (VQA), is composed of collections of joint-trained neural models (Andreas et al., 2016a)
. Inspired by recurrent neural networks and recursive neural networks, which both involve repeated application of a single computational module, any VQA network can be composed of finite computational structures which are reusable for other networks. InAndreas et al. (2016a)s’ work, a natural language parser analyze each question to determine the basic computational modules needed to answer the question, as well as the relations between the modules. The specific NMN for a question is dynamically composed of reusable modules based on linguistic structure. All modules in NMN are independent, composable and the computation for each problem is different due to its different architecture.
Andreas et al. (2016a) use rule-based dependency parser (Zhu et al., 2013) to generate layouts to build NMN. Following Andreas et al. (2016a), some improved methods have been proposed. Andreas et al. (2016b) present a model for learning to select layouts from a set of automatically generated candidates structures predicted by dependency parser, which is called Dynamic Neural Module Network (DNMN). Hu et al. (2017) propose End-to-End Module Networks (N2NMNs), which learn to generate network structures without the aid of a parser while at the same time learning network parameters.
Since each RE can be considered as a finite-state machine (FSM), we can use finite pre-defined actions to interpret any RE, each of which corresponds to a module. Inspired by Hu et al. (2017), we propose Neural Rule Engine (NRE), where REs are composed of finite, reusable, computational and joint-trained modules. With NRE, knowledge is learned explicitly from logic rules and generalized implicitly with neural networks.
REs benefit NLP in various tasks such as text classification, information extraction, text summarization, which correspond to various models, for example, sequence labeling model, classification model(Kaur, 2014), since they are well-known for high precision and strong interpretability. However, REs need to be well designed by human, and have a poor generalization ability. Generally, a RE has limited coverage and large amount of REs are needed to support a rule-based system.
It is worth noting that the learning process of human beings does not rely only on the data (cases) or the logic knowledge, but the combination of them. For example, when a father introduces the concept of bird to his child, he may point to a visual bird (a live one or an image), and tell his child this flying animal with feathers is a bird. Given the knowledge of bird and maybe a couple of visual cases, the child becomes capable to recognize a bird. Inspired by this, we propose a novel learning strategy to teach models to learn rules like children, where we impart a rule to models at first, and then use a couple of cases to help models generalize the knowledge.
The learning strategy can be widely applied to a variety of tasks. As shown in Figure 1, the output can be a labeled sequence, whose length is the same as the case. In this aspect, NRE can be considered as a sequence labeling model for information extraction task. Besides, the output sequence can be mapped to a label, where NRE is served as a classification model for text classification task.
While neural modules are used to perform actions in “Action Tree”, our purpose is to endow rules the generalization ability and flexibility with neural modules and the neural layout parser, rather than improve existing neural networks. Since the proposed system is composed of neural modules, neural layout parser, and symbolic rule, it is called “neural rule engine”. NRE is interpretable and flexible, which can be considered as an “enhanced” rule engine.
, a staged training method is proposed where we first pretrain modules and the rule parser and then use reinforcement learning to jointly finetune them.
As illustrated previously, regular expressions (REs) are rules involving a lot of knowledge and a lot of rule-based systems use REs to represent rules. In this paper, we use REs to interpret rules and focus on how to enhance REs, i.e., making REs flexible and can be generalized. As shown above, our model can be used in various tasks and we take the classification task as an example. We define a rule: and a case: where is a token in the regular expression, means rule corresponds to label , and is a token in the case. If rule matches case , , and if not, . Given a case , models need to choose the correct labels from , and is one of the labels in the problem.
To enhance the RE’s ability to express rules, we define the positive and negative part of rules. The positive part is what should appear in a case and the negative shouldn’t. Then we use “@@” to assemble the rule with two parts, the left side is Positive and the right is Negative. As shown in the second case in Figure 2, rule matches case when matches and doesn’t.
In this example, the relation between and will be labeled as “Origin-Entity” if the rule matches the case. By our method, the rule can be generalized instead of being strictly and entirely limited by the RE grammar, by which we believe the model “understands” the rule and learns how to improve it.
Our strategy maintains the high precision as well as the diversity of REs. More specifically, REs have a variety of forms and the strategy can cover the majority of REs. A RE can be considered as a finite-state machine (FSM) so we can use pre-defined operations, each of which corresponds to a module, to interpret the RE and simulate its function.
Our model consists of two main components: a set of actions (neural modules or simply mathematical functions) that provide basic operations, and a layout parser to predict a specific layout for every RE by which modules are dynamically assembled. An overview pipeline of our model is shown in Figure 1.
Given a rule , the rule parser first predicts a specific functional expression , listed in “Action Tree”, consisting of actions and parameters for the actions, where indicates the action and indicates the parameter for action . The circles in the “Action Tree” are the predicted actions, and each of them has its corresponding parameters. In implementation, we use Reverse Polish Notation (RPN) (Burks et al., 1954), post-order traversal over the syntax tree, to represent REs for convenience. The rule parser can either be a trained neural networks or the RE algorithm. Given a case, every action outputs the intermediate result according to the parameter . Then a network is assembled with the modules according to the specific RPN to get the final output.
|Find tokens matching .||NN /|
|Find tokens matching .||NN /|
|Judge if and are found together.|
|Judge if one of and is found.|
|Get output labels.|
In this paper, we define 6 basic actions and the most of REs can be interpreted by them.
As shown in Table 1, “Find_Positive”, “Find_Negative” and “And_Ordered” modules are based on neural networks while other modules on mathematical method. “Find” modules (“Find_Positive” and “Find_Negative”) aim to find the related words in a sentence then label them, which is similar to sequence labeling. “And” modules are designed to process the relation between two groups of labels and output new labels.
“Find” module is illustrated in Figure 3. Given case and action , “Find” Module aims to label 0 or 1 (1 indicates this token is matched with ) for every token of . Different contexts are obtained by slide windows of various lengths and each of them is encoded to a fixed-length vectors by NN 1.
where is the context of the word, and is the representation of context .
Then is encoded to a fixed-length vector by the same NN. Finally, we use function 2 to calculate the scores between every context and fixed-length vector and then we decide the label 0 or 1 by the scores using a sequence labeling model, which maps scores to a label in every position of case.
where W is a matrix which can be trained. For instance, given a word and a pattern , we first extract different levels contexts , when the window size is 2, where is , is , is and is the concatenation operator. Then and are encoded to fixed-length vectors and by the same encoder. After that, we use 2 to calculate scores between and . Finally, label for is decided according to by any sequence labeling model. As shown in Figure 3, the contexts for token is where , , and and pattern is . Then and are encoded by the same encoder to and . In the end, we get the score and output label .
As shown in Figure 4, “And_Ordered” module is designed to process the relation between two groups of sequential labels and output new labels. Distance parameter is required for “And_Ordered” module, indicating the maximum distances between two input sequential labels. Distance means that the two input labels can be combined at an arbitrary distance. For the root node “And_Ordered” in figure 4, the left child output is , indicating that tokens and are “found” respectively by “Find_Positive: ” and “Find_Positive: ”, and they are merged by “And_Ordered” in the left node. The right child output is calculated in the same way. , indicating the distance from the found token in , is calculated based on and distance parameter .
We conduct experiments with CNN and RNN to determine which model works best. Experiments show that CNN and RNN are comparable in accuracy, but CNN has a faster speed. So we select CNN to implement the modules.
For every module, it can be an analytical algorithm or a customized neural network. Similarly, for a rule, it can be disassembled to a layout by an analytical algorithm or a neural network. As shown in Figure 5, the rule can be taken apart to a tree structure. Inspired by Hu et al. (2017), we treat the tree structure as a RPN for convenience. Figure 6 shows a RE example and its linearized RPN sequence, which consists of actions and parameters. Then the layout prediction problem turns into a sequence-to-sequence learning problem from REs to modules and their parameters. We train a novel seq2seq model to predict RPNs from REs.
The training is based on the pre-trained word vectors trained with fastText (Bojanowski et al., 2017) and all vectors are kept static during the training. We introduce methods to train modules and layouts separately.
For modules, the training is divided into two phases: pretraining and finetune. In the pretraining phase, we generate the training data for modules based on sentences in training set, i.e. we randomly select patterns and sentences which are matched by the patterns. For a better recall, the maximum epoch number is set to avoid overfitting. In the finetuning phase, we use an algorithm to analyze the rule and construct a layout. Then the network is assembled according to the layout and the final prediction is given based on the results of each module. We use the final result to finetune the modules with reinforcement learning. Taking “Find_Positive” modules as an example, a “Find_Positive” module is to label 0 or 1 in every token offor . If the final prediction contains while the ground truth doesn’t, we believe that “Find_Positive” modules “Find” too many tokens and we punish “Find_Positive” modules whose predictions are 1.
The model for layout parser is shown in Figure 7. Given a rule , the layout parser is to predict a specific RPN consisting of actions and parameters, where indicates the action and indicates the parameter for action . With the consideration that predicting actions and parameters at the same time is very difficult, we split the training process into three stages:
predict parameters for every action based on the actions and the input sentence
jointly finetune actions and parameters
In Phase 1, a sequence-to-sequence (seq2seq) model is utilized to predict an action sequence. Recently, seq2seq models are greatly benefited from Attention (Bahdanau et al., 2014) and Beam Search has been proven to be a better method to decode target sequences than the greedy decoding. As shown in Figure 7, we use an attentional seq2seq model, which consists of a three-layer bidirectional LSTM and a one-layer unidirectional LSTM as an encoder and a one-layer LSTM as a decoder. Beam Search is used to decode actions during prediction.
In Phase 2, the model shares the same encoder with the first phase and add an encoder to encode the actions predicted in the first phase. Then the model predicts its parameters for each predicted action. There is a trick to speed up training. The parameters of an action need to be searched across the entire vocabulary, which is inefficient. Since fixed word vectors are used, we can change the optimization target from 3 to 4, i.e., force the model to predict a word vector rather than to directly find out the id of the target word.
where n is the length of the sentence, and m the size of the vocabulary. This trick greatly increases the speed of training and doesn’t affect the accuracy. At the time of prediction, we predict a vector and consider the closest word as the predicted word based on all word vectors.
In the end, we adopt the same strategy as in the module training to adjust the layout policy. According to the predicted layout, we assemble the whole network with trained modules and get the final result. Then reinforcement learning is exploited to finetune the layout model based on the ground truth.
The whole system is called Neural Rule Engine (NRE) since our learning strategy fully combines the advantages of the symbolic and the neural. Due to the modular design, each module can be implemented either by a specific algorithm of RE or by NN. At the same time, the layout of the action sequence can also be analyzed by an algorithm or a NN model. NRE has a symbolic architecture, while its internal modules and the rule parser of modules are NNs, organically combining the characteristics of the symbolic and the neural. The modules can be assembled in different ways for different needs at will.
Long Short-Term Memory (LSTM), one of recurrent neural networks (RNN), is highlighted for its strong ability to model temporal sequences and capture long-range dependencies (Sak et al., 2014). Recently, attention mechanism has become an integral part of a sequence model (Bahdanau et al., 2014) and improved model performance greatly. We propose three baseline models based on LSTM and attention, which are shown in Figure 8.
“LSTM-No-Attention” is illustrated in Figure (a)a. Given a RE , and a case , the case and the rule are both encoded with LSTM-No-Attention. The final hidden state of the case LSTM is used as the initial hidden state of the rule LSTM. We use the final state of the rule LSTM to get prediction.
Inspired by Bahdanau et al. (2014), we force the RE to attend different parts of the case at each step of RE input in “LSTM-Attention-Input” baseline, which is shown in Figure (b)b. Similarly, we still use the final state of LSTM to gather the final prediction.
The last baseline model “LSTM-Attention-Output” is shown in (c)c, which is based on “LSTM-No-Attention”. Outputs for all timestamps of the case and RE are and . We force every timestamp of case output to attend different parts of and concatenate them with . They become
. Inspired by max-pooling in CNN(Collobert et al., 2011), which encourages the network to capture the most useful local features produced by convolutional layers, we utilize max-pooling on to predict final labels.
We select two datasets to evaluate the performance of our proposed approach. One is the Chinese crime case classification and the other is the relation classification dataset in English (Hendrickx et al., 2009)111We released rules accompanying the relation classification dataset on Github: https://github.com/shenshen-hungry/Neural-Rule-Engine..
In Chinese crime case classification, each case is composed of one to three sentences in general. There are 8 categories totaling 150 labels in the dataset, such as “Burglary”, “Motorcycling Robbery”. The entire data set consists of 12555 cases and each case corresponds to zero, one or more than one labels. The cases are split into training, validation and test sets with the ratio of 80%, 10%, and 10%. Meanwhile, we write 239 REs, and split the rules also according to 8:1:1. As shown in Figure 9, the data division should be strictly followed in order to avoid data leakage.
Besides, we also conduct experiments in the relation classification dataset “SemEval-2010 Task 8”, a multi-way classification of semantic relations between pairs of common nominals. Each case corresponds to one category, and there are 19 categories in total. Among them, there are 7109 cases in training set, 891 in validation set and 2717 in test set. We write 490 REs, of which 125 are for testing. Importantly, all the written rules are based on the training set.
As shown above, “Find_Positive”, “Find_Negative” and “And_Ordered” modules are based on neural networks while other modules on mathematical method. Since “Find_Negative” is the same with “Find_Positive”, we only present how to train “Find_Positive” module. In this section, we introduce “Find_Positive”, “And_Ordered” and the pipeline for NRE.
For “Find_Positive” module, we randomly generate patterns and samples based on cases in the training set. Then we test “Find_Positive” module in the real test set. More specifically, “Find_Positive” module is to find patterns in sentence. In training, patterns are randomly chosen from sentences, by which we can generate massive data for training. When tested, the module is fed with real patterns of REs in the test set for convincing result.
For “And_Ordered” module, we randomly generate the outputs of subnodes and “distance” based on training set. Like “Find_Positive” module, we test “And_Ordered” module in real test set.
After the modules are trained, we finetune the trained modules according to labels of cases in the training set. Firstly, a rule is disassembled to generate a Reverse Polish Notation, which corresponds to a action sequence. Then the network is assembled based on the action sequence and trained modules to output labels. Finally, reinforcement learning (RL) is used to finetune the trained modules according to the true labels in the training set.
As for the generation of RPN, a staged seq2seq model is proposed to implement the layout policy to generate RPNs from REs. The model first predicts actions and then parameters of the actions. After the layout policy is trained, we fix the trained modules and finetune the neural parser using RL just as we finetune modules.
|Module||#Filters||Filter Size||Slide Window||Dropout||Embedding Size|
|Find_Positive||200||[1, 2, 3]||[1, 2, 3]||0.5||300|
|Find_Negative||200||[1, 2]||[1, 2]||0.5||300|
|And_Ordered||100||[3, 4, 5]||N/A||0.5||300|
|Module||RNN Size||Beam Width||Dropout||Embedding Size|
We train models with Adadelta Optimizer (Zeiler, 2012)
and finetune them by Stochastic Gradient Descent (SGD) with learning rate 0.001(Kiefer et al., 1952). In Chinese crime case classification dataset, we utilize fastText (Bojanowski et al., 2017) to pretrain the word vectors based on corpus collected by Internet crawlers. In relation classification dataset, we choose a publicly available word vectors (Mikolov et al., 2018) 222https://fasttext.cc/docs/en/english-vectors.html. The embedding table in every model is not trainable, in other words, all words vectors are kept static during training. More hyperparameters are shown in Table 2
The results on the Chinese crime case classification dataset are summarized in Table 3, where our method significantly outperforms “RE” and three sequence baselines. “RE” shows the result, in which rules are served as the traditional regular expression. To enhance the generalization of REs, we augment rules by replacing words of REs with their synonyms in dictionary, and the result is shown in “RE-Synonyms”. Performances of three sequence baselines are listed in “LSTM-No-Attention”, “LSTM-Attention-Input”, and “LSTM-Attention-Output” respectively. Since tokens can be represented as characters (Hanzi) or words in Chinese, we conduct experiments “NRE-Char” and “NRE-Word”, where cases and REs are based on characters or words. As shown in 3.5, “Finetune” jointly optimize modules and the layout policy. We conduct experiments to test whether “Finetune” takes effect, where results with “Finetune” are listed in “NRE-Char-Finetune” and “NRE-Word-Finetune” while results without “Finetune” are in “NRE-Char” and “NRE-Word”.
It can be seen that “RE” achieves 100% precision, which indicates that the REs we write are accurate. However, as illustrated previously, a RE can only cover a small part of data and the recall of “RE” is very low. “RE-Synonyms”, aiming to enhance RE generalization by synonyms in a hand-crafted way, gains slightly improvements in recall. An apparent reason is that the synonyms are based on dictionary and not suitable for the dataset.
The sequence baselines achieve awful but reasonable results. Our goal is to improve the generalization abilities of logic rules while still maintaining the advantages of high precision, i.e., to increase the recall and maintain the precision at a high level. In the baselines, the increase on recall makes no sense since the precision is unacceptable, resulting in unserviceable application. It is obvious that precision and recall have a trade-off, so we adjust the decision threshold to further analyze the performance of baseline models. As shown in Figure10, the precision increases and the recall decreases when we raise the threshold, which is in line with our intuition. However, it can be seen that the precision is still awful even we adjust the threshold to a very high level. The reasons are as follows. A RE can be considered as a hierarchical structure while a sequence model reads RE in a linear way. And special characters in RE such as “.”, “?” are difficult to be modeled by a sequence model. Besides, RE requires focusing on whole context but a sequence model generally considers local patterns of inputs. By integrating attention, which endows sequence structures the ability to model dependencies of patterns regardless of their distance, the model performance is slighted improved while still unacceptable. The “LSTM-Attention-Output” outperforms “LSTM-Attention-Input” since the rule attends the case directly in output and the most useful pattern is captured by max-pooling. The baselines can not handle the hierarchical structure and the symbolic reasoning of logic rules, leading to the unacceptable performances.
Compared to “RE”, “NRE-Word-Finetune” increases the recall of rules significantly while still maintaining a high precision. It can be seen that “Finetune” is critical since it makes the modules and the layout more fitted. Moreover, “NRE-Word-Finetune” is more effective than “NRE-Char-Finetune” because words encode more semantics and have better generalization ability than characters.
The results are summarized in Table 4 on relation classification dataset, where NRE improves the performance of rules by a significant margin.
Experimental results demonstrate that NRE is capable of handling the hierarchical structure of RE and promoting REs, and still maintaining the high precision and interpretability.
As shown in 4.5, NRE generalizes REs and increases recall of them. We further analyze where the generalization comes from and what the generalization is.
|RE||</e1>.* extracted from .*< e2> @@|
|RE||</e1>.* originate .*</e2>@@(signal|aberration|peakeffect|reduction).*</e1>|
|RE||</e1>.* pushed into .*<e2>@@<e2>(function|care).*</e2>|
|RE||</e1>.* pushed into .*<e2>@@<e2>(function|care).*</e2>|
|RE||</e1>.* pushed into .*<e2>@@<e2>(function|care).*</e2>|
|RE||</e1>.* derived from .*<e2>@@|
|Label||持锐器 (Holding a sharp instrument)|
|Label||持枪 (Holding a gun)|
|Label||持钝器 (Holding the blunt)|
|Label||墙上挖洞 (Digging holes in the wall)|
Where the generalization comes from. We conduct experiments where we combine neural networks and RE algorithm in different forms. The results are shown in Table 5. Specifically, “RE” can be considered a special NRE model where all modules and the layout parser are both implemented by RE algorithm. “PNAS_” is the model with best flexibility and generalization ability where all modules are neural networks. It can be seen that the recall is improved by a significant margin when we use neural “Find_Positive” and “Find_Negative” modules to replace the traditional algorithm. The performance is increased again on the basis of “PN_AS” if the “And_Ordered” is built with a neural network. And it gains another improvement when the neural layout parser is introduced to replace the traditional disassembling algorithm. In a word, the generalization comes from the neural “Find”, “And_Ordered” modules and the layout parser, among which “Find” modules make the key contribution. Intuitively, “Find” modules aim to match the related words with patterns and neural “Find” modules are capable of finding words which are not strictly same as the required words but are similar with them in semantic. For example, “Find” module can find “put inside” given “pushed into”.
What the generalization is. We analyze some cases, which NRE covers and REs not, to discover what the generalization of NRE is. As shown in Table 6, given a RE, NRE covers some cases which don’t strictly match RE but are similar with them in semantic. For example, case “The <e1> woman </e1> was taken from her native <e2> family </e2> and adopted in England on some relocation scheme in the 1960s.” is covered by in NRE. NRE generalizes , thus can be matched and activated by this RE. This is demonstrated by several cases, both in Chinese and English datasets shown in Table 6 and Table 7. Besides, action sequences can also be optimized with the neural parser. Taking the first case in Table 8 as an example, some actions and parameters are removed, which leads to better performance, such as “落水管” (downpipe) and “不锈钢管” (stainless steel pipe) are concentrated into “管” (pipe).
In this paper, we present a novel learning strategy where the neural networks and symbolic knowledge are combined from the knowledge-driven side. Based on this learning strategy, we propose Neural Rule Engine (NRE), where rules obtain the flexibility and generalization ability of neural networks while still maintaining the precision and interpretability. NRE is able to learn knowledge explicitly from logic rules and generalize them implicitly with neural networks. NRE consists of action modules and a rule parser, both of which can either be customized neural networks or a symbolic algorithm. Given a rule, NRE first predicts a specific layout and then modules are dynamically assembled by the layout to output the result. Besides, a staged training method is proposed where we first pretrain modules and the neural rule parser, and then use reinforcement learning to jointly finetune them. The experiments show that NRE could greatly improve the generalization abilities of logic rules with a significant increase on recall. Meanwhile, the precision is still maintained at a high level. NRE is not only an innovative paradigm of neural-symbolic learning, but also an effective solution to industrial applications, e.g. upgrading the existing rule-based systems and developing neural rule approaches which do not rely on a mass of training data.
Journal of Machine Learning Research, 12(Aug):2493–2537.
Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P., Ó Séaghdha, D., Padó, S., Pennacchiotti, M., Romano, L., and Szpakowicz, S. (2009).Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pages 94–99. Association for Computational Linguistics.
Stochastic estimation of the maximum of a regression function.The Annals of Mathematical Statistics, 23(3):462–466.
Combining knowledge with deep convolutional neural networks for short text classification.In Proceedings of IJCAI, volume 350.