Event Identification as a Decision Process with Non-linear Representation of Text

by   Yukun Yan, et al.
Tsinghua University

We propose scale-free Identifier Network(sfIN), a novel model for event identification in documents. In general, sfIN first encodes a document into multi-scale memory stacks, then extracts special events via conducting multi-scale actions, which can be considered as a special type of sequence labelling. The design of large scale actions makes it more efficient processing a long document. The whole model is trained with both supervised learning and reinforcement learning.


page 1

page 2

page 3

page 4


Zooming Network

Structural information is important in natural language understanding. A...

Neural Multi-scale Image Compression

This study presents a new lossy image compression method that utilizes t...

WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia

Cross-document event coreference resolution is a foundational task for N...

Deep progressive multi-scale attention for acoustic event classification

Convolutional neural network (CNN) is an indispensable building block fo...

Extracting Event-Centric Document Collections from Large-Scale Web Archives

Web archives are typically very broad in scope and extremely large in sc...

An Event Network for Exploring Open Information

In this paper, an event network is presented for exploring open informat...

Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning

This paper presents a reinforcement learning approach to extract noise i...

1 Introduction

Specific information extraction is a basic stage of text understanding, which is always conducted as a sequence labelling task. Typical approaches include traditional linear models and recurrent neural network based model. The previous type of models, Hidden Markov Models (HMM) and Conditional Random Fields (CRF) included, are based on hand-craft features, which lacks information such like the order of words. On the other side, neural based models


achieved outstanding successes recently. The most significant difference between two types of models is neural based models use distributed presentation and non-linear transformation, which enable the algorithms to build a more complex language model. The most successful models are LSTM-crf

[11][9] and LSTM-CNNs-crf[12].

One research[6]

about the human brain cortex shows that, several parts of cortex are activated listening speeches, corresponding to different levels of language structure. Moreover, this phenomenon only happens when the listener knows the language, which infers using a multi-level distributed representation maybe a providential way to build a language model

[5]. Many attempts were made based on multi-scale recurrent neural network, which can be divided into two types. The first type has several recurrent layers and each layer has its own update period[7], CW-RNN[10] included. While the second type[1][4][3] uses a gating mechanism controlling the flow from low level hidden state to high level ones, such like hierarchical recurrent neural networks. Here, sfIN is designed in a different way. Its representation levels correspond to the language structures, including word, sentence and paragraph.

2 Architecture

Different from models like LSTM-crf and other neural network based models, mRR encode text into a hierarchical memory stack, which enable more complex non-linear transformation of the whole text. After establishing the memory stack, a RNN based controller will read part of the memory at each time and make an action to predict current tag. There are three read-heads which will be updated after an action is made, indicating which part of memory is accessible. The whole process will end when one of the read-head reaches the bottom of text. Figure 1 shows the architecture of mRR.

Figure 1: Coarse-Reader Network

Road Map The remainder of this section is organized as follows. Section 2.1 describes Text Encoder part of mRR and Section 2.2 shows the Controller part.

2.1 Text Encoder

Text encoder takes not only the raw text as input but also the structure information, and output a hierarchical memory, which has three level parts: word level, sentence level and paragraph level. A memory is generally defined as a matrix with potentially infinite size, while here we limit the memory with three pre-claimed matrix , , , with locations and values in each location at each level. is always instance-dependent and is pre-determined by the algorithm. In our implementation, memory of different level has different .

with , denotes the begin indexes of sentences, denotes the begin indexes of paragraphs. The Text Encoder have three part: word encoder, sentence encoder and paragraph encoder

Figure 2: Text Encoder

2.1.1 word encoder

The word encoder takes raw text as input and output the word level memory , where is the number of words in the document, is the dimension of word level memory. As illustrated in Figure 3

, the establishing process is as follow: 1. We apply a word embedding layer on the raw text to gain several vector sequences, each corresponding to a sentence. 2. The vector sequences was put into a bidirectional Long short term memory (bi-LSTM) layer to generate the memory matrix

by concatenating all hidden states.

Figure 3: establishment of word level memory

2.1.2 sentence encoder

Inspired by convolutional neural networks, we apply element-wise max-pooling on the word level memory to generate ’sentence vector’, in extracting global feature from local features. Another bi-LSTM layer is used to generate sentence level memory matrix

like generating , as illustrated in Figure 4.

Figure 4: establishment of sentence level memory

2.1.3 paragraph encoder

For each paragraph, we apply element-wise max-pooling on the sub matrix of corresponding to the sentences belong to it and generate , as shown in Figure 5


Figure 5: establishment of paragraph level memory

2.2 Controller

The structure of the controller part is a RNN layer and nine feed-forward neural networks (FNN), which has an output of one dimension. At each time, the RNN uses three read-heads to read the hierarchical memory and update its hidden state, which is feed to the FNN to generate an action. The tag sequence is add to previous result and the location of read-heads is updated at the same time, as illustrated in

Figure 6.

The controller is trained as an action agent, which can read part of the hierarchical memory and make a choice of actions (and generate part of tag sequence) at each time step. There are nine available actions corresponding to nine feed-forward neural networks as follow: mark a word/sentence/paragraph as non event, current event or new event. Available actions and corresponding tag sequence are shown in Table 1

Figure 6: controller
action level non-event current event new event
word -1 mark mark + 1
Table 1: available actions and corresponding tag sequence

Read Memory: The decision process is conducted from the beginning of text to the end. There is a vector indicating the current location, which has three dimensions corresponding to three level of the accessible hierarchical memory, initialized as . At each time step, the available memory is loaded in to the Controller, as illustrated in Figure 6

Figure 7: reading memory, current location vector is [11, 2, 2]

Generate Action: The chosen part of memory is used to update the state of controller, which then generates nine scores. The action with the highest score is chosen to generate tag sequence along table 1 at each time step.

Update Location: The action is also used to update location vector mentioned above based on which level action was made. The new vector points at the next word if the action is at word level or the first word of the next section if the action is at sentence/paragraph level. An instance is shown in Figure 8

Figure 8: action sequence: [word-none, word-none, sentence-new, sentence-none, paragraph-new]

3 Dataset

We used a law papers dataset, each contains the information of a criminal and his criminal record. In each sample, there can be one or several stealing events with different time, location and victims. The length of a sample ranges from 1500 words to 7000 words, and the number of event in each one ranges from 1 to 74. The whole dataset has 8299 samples and labelled by 5 individuals.

4 Training

We use both supervised learning and reinforcement learning (policy gradient) strategy in training our model. At each time step, we generate action labels by comparison the predicted tag sequence and ground true. The action label will be set as ’1’ only if all of its predictions are right, otherwise, ’0’. However, the may be several actions are all correct. So, we use multinomial sampling of the correct actions at each time step to guarantee it follows the right path. Because the length of text influences the speech of applying, we want the model can take actions that corresponding to longer sequence as long as it is correct. This can not be achieved by supervised learning, because the fewest actions policy may not gain the highest accuracy. So we introduced a policy gradient based method that after processing one sample, the model will gain a reward based on the number of actions divided by the length of text. In this way, the model finds a balance between efficiency and precise.

5 Experiments

We compared our model and LSTM-crf, finding that the f1 value of mRR reaches 93.02%, which exceeds about 10%.

model test accuracy test recall test f1
LSTM-crf % - %
mRR-ne % - %
mRR-le % - %
Table 2: available actions and corresponding tag sequence

6 Conclusion

We have demonstrate that multi-scale language model can combine global features and local features to help extract key information of an ontology, which performances better the single RNN layer models. And the model trained to take large scale actions has great advantage on processing efficiency over previous models.

7 Discussion

We found that mRR is perfectly good at tagging tasks on long texts. It prefer low level actions at some key words which infers the function in the section, for instance, ’the People’s Court establishes the truth based on facts’ and high level actions at some ’no big deal’ sections, like the basic information of a criminal. This is very intuitive that the same progress happens when human process the same task.