Specific information extraction is a basic stage of text understanding, which is always conducted as a sequence labelling task. Typical approaches include traditional linear models and recurrent neural network based model. The previous type of models, Hidden Markov Models (HMM) and Conditional Random Fields (CRF) included, are based on hand-craft features, which lacks information such like the order of words. On the other side, neural based models
achieved outstanding successes recently. The most significant difference between two types of models is neural based models use distributed presentation and non-linear transformation, which enable the algorithms to build a more complex language model. The most successful models are LSTM-crf and LSTM-CNNs-crf.
about the human brain cortex shows that, several parts of cortex are activated listening speeches, corresponding to different levels of language structure. Moreover, this phenomenon only happens when the listener knows the language, which infers using a multi-level distributed representation maybe a providential way to build a language model. Many attempts were made based on multi-scale recurrent neural network, which can be divided into two types. The first type has several recurrent layers and each layer has its own update period, CW-RNN included. While the second type uses a gating mechanism controlling the flow from low level hidden state to high level ones, such like hierarchical recurrent neural networks. Here, sfIN is designed in a different way. Its representation levels correspond to the language structures, including word, sentence and paragraph.
Different from models like LSTM-crf and other neural network based models, mRR encode text into a hierarchical memory stack, which enable more complex non-linear transformation of the whole text. After establishing the memory stack, a RNN based controller will read part of the memory at each time and make an action to predict current tag. There are three read-heads which will be updated after an action is made, indicating which part of memory is accessible. The whole process will end when one of the read-head reaches the bottom of text. Figure 1 shows the architecture of mRR.
Road Map The remainder of this section is organized as follows. Section 2.1 describes Text Encoder part of mRR and Section 2.2 shows the Controller part.
2.1 Text Encoder
Text encoder takes not only the raw text as input but also the structure information, and output a hierarchical memory, which has three level parts: word level, sentence level and paragraph level. A memory is generally defined as a matrix with potentially infinite size, while here we limit the memory with three pre-claimed matrix , , , with locations and values in each location at each level. is always instance-dependent and is pre-determined by the algorithm. In our implementation, memory of different level has different .
with , denotes the begin indexes of sentences, denotes the begin indexes of paragraphs. The Text Encoder have three part: word encoder, sentence encoder and paragraph encoder
2.1.1 word encoder
The word encoder takes raw text as input and output the word level memory , where is the number of words in the document, is the dimension of word level memory. As illustrated in Figure 3
, the establishing process is as follow: 1. We apply a word embedding layer on the raw text to gain several vector sequences, each corresponding to a sentence. 2. The vector sequences was put into a bidirectional Long short term memory (bi-LSTM) layer to generate the memory matrixby concatenating all hidden states.
2.1.2 sentence encoder
2.1.3 paragraph encoder
For each paragraph, we apply element-wise max-pooling on the sub matrix of corresponding to the sentences belong to it and generate , as shown in Figure 5
The structure of the controller part is a RNN layer and nine feed-forward neural networks (FNN), which has an output of one dimension. At each time, the RNN uses three read-heads to read the hierarchical memory and update its hidden state, which is feed to the FNN to generate an action. The tag sequence is add to previous result and the location of read-heads is updated at the same time, as illustrated inFigure 6.
The controller is trained as an action agent, which can read part of the hierarchical memory and make a choice of actions (and generate part of tag sequence) at each time step. There are nine available actions corresponding to nine feed-forward neural networks as follow: mark a word/sentence/paragraph as non event, current event or new event. Available actions and corresponding tag sequence are shown in Table 1
|action level||non-event||current event||new event|
|word||-1||mark||mark + 1|
Read Memory: The decision process is conducted from the beginning of text to the end. There is a vector indicating the current location, which has three dimensions corresponding to three level of the accessible hierarchical memory, initialized as . At each time step, the available memory is loaded in to the Controller, as illustrated in Figure 6
Generate Action: The chosen part of memory is used to update the state of controller, which then generates nine scores. The action with the highest score is chosen to generate tag sequence along table 1 at each time step.
Update Location: The action is also used to update location vector mentioned above based on which level action was made. The new vector points at the next word if the action is at word level or the first word of the next section if the action is at sentence/paragraph level. An instance is shown in Figure 8
We used a law papers dataset, each contains the information of a criminal and his criminal record. In each sample, there can be one or several stealing events with different time, location and victims. The length of a sample ranges from 1500 words to 7000 words, and the number of event in each one ranges from 1 to 74. The whole dataset has 8299 samples and labelled by 5 individuals.
We use both supervised learning and reinforcement learning (policy gradient) strategy in training our model. At each time step, we generate action labels by comparison the predicted tag sequence and ground true. The action label will be set as ’1’ only if all of its predictions are right, otherwise, ’0’. However, the may be several actions are all correct. So, we use multinomial sampling of the correct actions at each time step to guarantee it follows the right path. Because the length of text influences the speech of applying, we want the model can take actions that corresponding to longer sequence as long as it is correct. This can not be achieved by supervised learning, because the fewest actions policy may not gain the highest accuracy. So we introduced a policy gradient based method that after processing one sample, the model will gain a reward based on the number of actions divided by the length of text. In this way, the model finds a balance between efficiency and precise.
We compared our model and LSTM-crf, finding that the f1 value of mRR reaches 93.02%, which exceeds about 10%.
|model||test accuracy||test recall||test f1|
We have demonstrate that multi-scale language model can combine global features and local features to help extract key information of an ontology, which performances better the single RNN layer models. And the model trained to take large scale actions has great advantage on processing efficiency over previous models.
We found that mRR is perfectly good at tagging tasks on long texts. It prefer low level actions at some key words which infers the function in the section, for instance, ’the People’s Court establishes the truth based on facts’ and high level actions at some ’no big deal’ sections, like the basic information of a criminal. This is very intuitive that the same progress happens when human process the same task.
-  D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. End-to-end attention-based large vocabulary speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 4945–4949. IEEE, 2016.
-  K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
-  J. Chung, S. Ahn, and Y. Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio.
Gated feedback recurrent neural networks.
International Conference on Machine Learning, pages 2067–2075, 2015.
-  R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
-  Z. H. T. X. P. D. Ding N, Melloni L. Cortical tracking of hierarchical linguistic structures in connected speech. Nature neuroscience, 9(8):158–164, 2016 January ; 19(1).
-  S. El Hihi and Y. Bengio. Hierarchical recurrent neural networks for long-term dependencies. In Advances in neural information processing systems, pages 493–499, 1996.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Z. Huang, W. Xu, and K. Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
-  J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber. A clockwork rnn. In International Conference on Machine Learning, pages 1863–1871, 2014.
-  G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer. Neural architectures for named entity recognition. CoRR, abs/1603.01360, 2016.
-  X. Ma and E. Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354, 2016.