Tale of tails using rule augmented sequence labeling for event extraction

08/19/2019 ∙ by Hrishikesh Patel, et al. ∙ IIT Bombay 0

The problem of event extraction is a relatively difficult task for low resource languages due to the non-availability of sufficient annotated data. Moreover, the task becomes complex for tail (rarely occurring) labels wherein extremely less data is available. In this paper, we present a new dataset (InDEE-2019) in the disaster domain for multiple Indic languages, collected from news websites. Using this dataset, we evaluate several rule-based mechanisms to augment deep learning based models. We formulate our problem of event extraction as a sequence labeling task and perform extensive experiments to study and understand the effectiveness of different approaches. We further show that tail labels can be easily incorporated by creating new rules without the requirement of large annotated data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Event occurrences involve several entities such as time, date, place, reason, etc. Event extraction recovers structured representations from the text, often characterised by complex argument and nested events, involving several entities. Entities have associated properties and attributes such as reason of the happening, after-effects of the events, etc.

Event extraction has several applications in tasks that include text summarization, knowledge-base construction, machine translation,

etc.

Entity extraction is a pre-requisite for building any event extractor. Additionally, event extraction (EE) often requires the discovery of relations between entities and dependencies between relations from the text. Typically, EE systems find triggers and their associated arguments. The discovery of triggers and arguments poses several challenges. Firstly, event detection is highly contextual driven. The same event may appear with different trigger words and a trigger word can evoke different event expressions. For example, consider a sentence

‘On 29 December 2017 a massive fire broke in Kamala Mills, Mumbai the capital of Maharashtra, killed at least 14 people and injured several’.

The trigger word killed may co-occur with fire_accident or attack or any other keyword associated with an incident involving such a trigger word. Similarly, the trigger word killed can evoke different events depending upon the context. Secondly, the presence of a large number of entities requires large annotated data focused on deep learning based systems. Typical deep learning systems require a sufficiently large amount of data for training Sun et al. (2017)

. Named entity recognition (NER) systems are not effective for extracting a large number of tags given less data. Third, of particular concerns are the labels with very few instances in the dataset, hereafter referred to as ‘tail labels’. Most event extraction methods are not designed to be fair to tail labels and land up being biased by the more frequent labels. Additionally, the absence of embeddings that capture language models effectively is especially pronounced for low resource languages. This can lead to propagation of errors resulting from the improper initialisation of embeddings for out-of-vocabulary words.

In this paper, we address these challenges to create an end-to-end EE system in the disaster domain in 5 languages, namely, Marathi, English, Hindi, Bengali and Tamil. We introduce rule augmented deep learning methods and demonstrate the effectiveness of our approach with extensive experiments. We model EE as a sequence labelling task and investigate ways of enhancing state-of-the-art entity extractors by augmenting using rule-based approaches. Our contributions are summarised as follows :

  • We re-purpose existing rule augmented deep learning models for learning event structures that captures event arguments and their inter-dependencies on disaster domain in low-resource languages.

  • We conduct extensive experiments on our dataset and demonstrate the importance of rule-augmented deep learning models in improving performance on tail labels.

  • We release a new dataset named InDEE-2019 that consists of tagged event extraction data in the disaster domain covering five languages: Marathi, English, Hindi, Bengali and Tamil.

2 Related Work

Event extraction is a well studied problem, albeit mostly in the English language. Presently, most of the event extraction method consist of deep learning model which requires large annotated training datasets. These data hungry methods work in resources rich language like English Nguyen et al. (2016) but it fails in low resource languages like Indic languages. Few previous works have used rule based Valenzuela-Escárcega et al. (2015), however, the model is not able to learn complex functions of data. Other work Reschke et al. (2014) uses sources of external information from the knowledge bases to improve the performance of linear chain Conditional random field (CRF) baselines. Therefore, we need hybrid approach to leverage benefits of both methods. Snorkel Ratner et al. (2017)

deals with candidate extractors and use weak supervision sources to classify the correct set of arguments. However, snorkel requires candidate to be NER tagged and it works well with lesser number of labels. To address this issue, we have augmented rules in different ways with deep learning architectures to learn contextual information along with rules.

3 Event Extraction Task

We focus on the EE task and model as a sequence labeling problem. We use the following terminologies throughout the paper:

  • Event Trigger : The main word that identifies occurrence of the event mentioned in the document.

  • Event Arguments : Several words that define an events such as place , time , reason , after-effects , participant , casualties.

We are interested in extracting event triggers, event arguments from the document. Table 1 shows the number of labels for each language.

Figure 1: The current word is checked for the similarity with positive and the negative dictionary in marathi along with the vicinity words as shown in dotted box.

4 Approach

In this section, we describe our approach for the event extraction task. We model the extraction as a sequence labeling problem. We adopt Bi-directional LSTM (Long Short Term Memory) and CRF (Conditional Random Fields)

Huang et al. (2015) for sequence labeling as a baseline and augment it by incorporating information from rules.

4.1 Bi-directional LSTM

Recurrent Neural Networks (RNN) has achieved significant gains in sequence labeling tasks. Due to its sequential architecture, RNNs are able to capture previous sequential inputs to predict the output. LSTM are variants of RNN and has outperformed RNNs to capture long-range dependencies in the sequence Hochreiter and Schmidhuber (1997). Recently, state-of-the-art named entity extractors Lample et al. (2016)

have used bi-directional LSTMs with CRF. CRF effectively capture transition states during inference, however, they do not perform well on skewed distribution of tags . CRF layer enables to add constraints during inference such that invalid label sequences are discarded from the search space. However, this benefit diminishes on less training data when transition probabilities are not learned effectively.

Given an input sequence words , LSTM capture current state based on current input word and previous hidden state

. Single directional LSTM are not sufficient to capture dependencies from future words. In order to address this issue, a LSTM in reverse direction is used to generate another hidden vector representation

.

It is well documented that LSTMs reduces vanishing gradient problem to model inter-dependencies in long sequences. LSTM output tag probability distribution for each individual word in an input sequence while CRF gives a score for the tag sequence. Due to skewed tag distribution in our dataset, CRF gives very low probabilities for the rarely occurring tags. Therefore, we chose Bi-LSTM as our baseline model and eliminate CRF due to highly skewed tag distribution in our dataset.

4.2 Rule Definition

Our approach extends Bi-LSTM with a rule-based approach that generates a rule vector for each word. The rule vectors are built using event anchors to capture class information. We create several dictionaries that correspond to characteristics of event triggers. These dictionaries augment the features learned by the Bi-LSTM. Figure 1 gives an example of dictionary for developing rules.

4.2.1 Synonym-based dictionary

The dictionary corresponds to synonyms of the trigger word. We create sets of words that correspond to the synonyms of the trigger word for each language. For instance, synonyms of flood are shown in Figure 1. Once we create robust dictionary for one language is easy to extend it to other languages. We will refer synonym dictionaries as positive dictionaries.

4.2.2 Negative Dictionary

In several documents, many sentences refer to events that happened in the past or have a chance of happening. Such events needs to be ignored by our system such that unreasonable mentions of events are not considered as actual events. For example, ’ There exists a strong possibility of spreading of Malaria after 2015 floods in Mumbai ’. The possibility word in the sentence , causes the event to classify as probable event. Therefore, it is tagged as a negative mention in the annotation. In order to capture these negative instances, we created a negative dictionary as shown in Figure 1 that contains such words.

4.3 Rule Vector

Given a sentence , containing word sequences , we create rule vector for each word . Due to overlapping labels in our tag set, is a multi-hot vector. For instance, any attack can belong to both Normal_bombing and Terrorist_attack.

Algorithm 1 generates rule vector for each word in the sentence . Dimension of is equal to number of labels + 1 (for tag). If any word in the sentence found in the negative dictionary, then the word should be tagged as . If any word in window of {, } for current word , present in particular dictionary then should be tagged as tag . If none of the word from the window matched with any of the synonym dictionaries then current word should be tagged as . The algorithm for rule vector formation is shown in Algorithm 1. We have taken multiple values of for experimentation. We also used similarity based matching in instead of exact word matching.

0:  Sentence with words , current word in , synonym dictionary for each label and negative dictionary , is the window size for , indicates word appeared in or not.
0:   Rule vector for each word
1:  Initialize each # =number of labels + 1
2:  Initialize
3:  if {} in  then
4:     
5:     return
6:  end if
7:  for each tag  do
8:     if  {, } exists in  then
9:        set #for corresponding position .
10:        
11:     end if
12:  end for
13:  if  is  then
14:     
15:  end if
16:  return
Algorithm 1 Algorithm for creating rule vector
Figure 2:

a) Architecture for rule augmentation concatenating with word embeddings. Words sequences are fed to the rule-based system and resultant rule vector is concatenated with respective word vectors and given as input to the Bi-LSTM layer. The hidden vectors are used to retrieve predicted labels. b) Rule vector and word embeddings are given to separate Bi-LSTM and their hidden vectors are concatenated to predict labels.

4.4 Techniques for incorporating rule vector

We use rules to augment our deep learning labeling architecture using following three methods. The complete architecture is shown in Figure 2.

4.4.1 Augment embeddings with rule vector

In this method, we append the rule vector along with the word representation. In the encoding phase, we extract word representation for word using pre-trained fastText embeddingsBojanowski et al. (2017) fine-tuned on our dataset. The word embeddings are appended with the rule vectors . The concatenated vector on application of dropouts Srivastava et al. (2014) is fed to the Bi-LSTM that extract essential features and learn representation.

4.4.2 Explicit rule addition

In addition to the word embeddings fed to the Bi-LSTM, a rule vector through a separate Bi-LSTM is passed. Two parallel Bi-LSTM are fed with word embeddings and rule vectors. The hidden layer representation are concatenated to learn a joint representation of words and rules. The architecture is shown in Figure 2.

We expect this method to retain essential information from rules. In comparison to the concatenation approach proposed in the previous section, we expect it to retain more relevant information by passing rules explicitly.

Languages Marathi(Mr) Hindi(Hi) English(En) Tamil(Ta) Bengali(Bn)
Doc Sen Doc Sen Doc Sen Doc Sen Doc Sen
Train 815 15920 678 13184 456 5378 1085 15302 699 18533
Val 117 2125 150 2775 56 642 155 2199 100 2621
Test 233 4411 194 3790 131 1649 311 4326 199 4661
#Labels 43 44 48 47 46
Table 1: InDEE-2019 dataset for five languages, namely, Marathi, Hindi, English, Tamil and Bengali. Number of tags or labels for each dataset and their respective train, validation and test split used in the experiments.

4.4.3 Rule projection using distillation

In this method, the knowledge of rules is distilled within deep learning network. This is achieved by biasing the weights of neural networks according to the rule vector. Hu et al. (2016) proposed a student and teacher model such that model weights are learnt within constraints of rule based system. The teacher-student models bias each other such that weights are learnt by modelling logical rules as constraints and projecting rules within constrained space. Using similar framework we made small changes to make it work to with side information provided by the rules. As given by the eq. (1)

(1)

The teacher distribution, is the modified form of the student distribution, to fit to transform according to rule vector keeping the basic intuition same, where is the tag distribution. The formulation modifies the prediction probabilities to have a bias towards the rule vector. For example, being a multi-hot vector can be interpreted as set of labels(tags) assigned by the rules. The formula tries to the bias the student distribution by keeping the probability of the tag predicted by the rules, and penalises the probabilities of other labels by a factor of , where is an arbitrary constant, we set for our experiments .

Using eq.(2), the student tries to mimic the teacher prediction with an imitation parameter and also tries to optimize for ground truth . For our experiments we set our imitation parameter to value .

(2)

Parameters of the student distribution is updated using the above update rule, is the student prediction at the n-th word in a sequence where is application specific loss (cross-entropy here), is KL divergence loss. Final inference is made using the projected distribution, i.e. the teacher distribution.

4.5 Word Embeddings

Due to the very nature of low resource languages, out-of-vocabulary words are common occurrences. To handle such words, we use pre-trained fastText word embeddings Bojanowski et al. (2017) and fine-tune on our corpus. We observe that word coverage of fast-text embeddings on Marathi, Tamil and Bengali is around 50%. Here, we focused only on building an EE system instead of developing word embeddings.

5 Experiments

5.1 Dataset

We release the new dataset named InDEE-2019 which consists of tagged event extraction data in disaster domain covering five languages: Marathi, English, Hindi, Bengali and Tamil. Dataset statistics can be found in Table 1.

For each language, we crawled disaster related documents from regional news websites. To annotate this documents we followed IOB (Inside-Other-Beginning) tagging scheme proposed by Ramshaw and Marcus (1999). IOB tagging helps in differentiating between starting and ending of adjacent tags. In our disaster related documents, adjacent occurrence of same tags forms a phrase. Therefore, we adopted a simpler scheme that merges B and I together and tag under TO (T:Tag and O:Other) scheme. Table 2 represents sample sentence from the tagged dataset where first column contains tokens of the sentence, second column represents doc id corresponding to sentence and third column represents tag of token. TO tagging scheme can be seen in third column.

We hired linguistic experts for each language to annotate the dataset as per tagging guidelines. For each language, we divided dataset into three parts train (70%), validation (10%) and test (20%). The train dataset is used to train the model for event extraction task, validation is used for hyper-parameter tuning and test dataset is used for testing our model.

This dataset is quite challenging because of multiple reasons. First, there are very large number of tags and very less training data. Moreover, there are closely related tags that made the annotation task difficult. Secondly, data distribution is skewed, therefore very less training data exists for many tags. Third, our main focus is on low resource Indic languages that makes the task more challenging.

Token Doc Id TAG
A 2 O
moderate 2 O
intensity 2 O
earthquake 2 EARTHQUAKE
measuring 2 O
4.7 2 MAGNITUDE-ARG
hit 2 O
Meghalaya 2 PLACE-ARG
on 2 O
Monday 2 TIME-ARG
Table 2: TO Tagging Scheme
Figure 3: a) Comparison of Micro-F1 scores for different experiments over various training sizes (in %) b) Comparison of Macro-F1 scores for different experiments over various training sizes (in %)

5.2 Results and Discussion

We use Bi-LSTM as our baseline and compare with the proposed three approaches. We conduct experiments on our InDEE dataset on 5 languages, namely, Hindi, Marathi, Bengali, Tamil and English. Our evaluation metric is standard micro-F1 and macro-F1 scores. Micro-F1 score counts the global true positives, false positives and false negatives whereas Macro-F1 captures the average unweighted class scores. Macro does not take class imbalance into consideration. We observe that due to highly skewed label distribution in our dataset, micro score is of more interest to us. For the ease of use, we will call our baseline as A and three proposed approaches as B(

4.4.1), C(4.4.2), D(4.4.3).

We train our marathi model on 15.9K sentences and predict on 43 labels. Training sets and sizes for all languages are shown in Table 1. We train our model on varying training set size, namely, 20%, 40%, 60%, 80% and 100% to ascertain the impact of rules with decreasing amount of dataset.
Event Extraction Figure 3 shows F1-macro and F1-micro scores for the marathi EE. It is observed that at lesser training instances, our rule based approach is able to outperform baseline on 3K(20%) and 6K(40%) training instances. Both macro and micro F1 scores show improvement over baseline model. As training instances increases, deep learning models are able to learn large number of parameters that were difficult to learn on lesser training instances.

Figure 4: a) Comparison of proposed rule based approaches on improvement over all labels. b) Comparison of proposed approaches over tail labels. Lowest stack represents number of labels shown improvement over baseline, middle stack represent count of labels that has equal score with baseline and upper stack represent count of labels that has lesser score than baseline.
Language 20% 40%
B C D B C D
Mr 10 8 8 12 8 8
Hi 9 12 8 10 6 6
En 5 5 4 5 2 3
Ta 4 3 4 9 5 6
Bn 8 6 6 8 7 3
Table 3: Number of tail labels improved on 20% and 40% training instances.
Language 20% 40%
B C D B C D
Mr 27 22 20 28 24 23
Hi 19 19 16 21 16 17
En 20 21 20 16 9 9
Ta 13 10 11 16 14 12
Bn 14 12 11 14 11 9
Table 4: Number of total labels improving over baseline on 5 languages. We have shown results for 20% and 40% training instances.

Table 6 shows macro-F1 and micro-F1 scores for all 5 languages. We observe that on smaller training instances, implicit rule approach tend to perform better on Marathi & Hindi. Most of the cases shows that our methods performs better than the baseline on less training data. Due to better word representation and data annotation on English, the parameters are even learnt on smaller dataset by Bi-LSTM. However, for low resource languages, Bi-LSTM is not able to learn parameters on similar training instances.
Tail Labels Most deep learning based methods are not able to capture tail labels due to lesser training data. However, this is of interest to us since most real world data has small size and has large label set. Figure 4 shows improvement of F1-score for each classes over baselines. We only considered scores that are greater than baseline scores. We observe that more classes are improved by including rules over baselines. At lesser training instances, we observe that large number of tags are correctly classified. Table 4 shows number of tags improved over baseline. In order to further prove effectiveness of rule based approach, we tested the improvements of classes over tail labels. We chose those tails labels whose sum forms 5% of total training set instances. We notice significant improvement of tags over 20% and 40% training instances. We can see the number of tail labels improved over baseline in table 3. Moreover, we calculated micro-F1 scores for the tails labels. The results are shown in Table 5. We notice significant improvement on tails for 20% and 40% training instances. It is clear from the results that at lesser training instances, our approach is effective in capturing tail labels. While working with low resource languages and lesser annotated data, combination of rules and Bi-LSTM gives higher score. Additionally, rule based approach is effective at capturing tail labels.

L M 20 40 60 80 100
A 39.76 42.67 46.69 52.85 53.88
B 44.62 47.44 48.91 54.3 50.71
C 42.45 43.02 53.05 55.54 53.77
Mr D 41.72 43.06 48.83 50.93 51.25
A 29.57 42.68 48.11 47.35 46.99
B 31.33 46.35 47.15 48.65 46.52
C 34.08 42.87 49.21 46.35 47.29
Hi D 29.87 42.84 47.58 47.87 46.45
A 56.96 64.58 73.7 73.92 82.29
B 61.65 63.66 75.34 75.31 82.71
C 58.51 63.78 74.47 73.64 83.25
En D 63.85 62.24 71.67 69.06 79.01
A 44.88 55.94 59.64 62.15 62.62
B 44.95 57.55 58.37 62.31 60.55
C 44.36 55.75 57.06 64.08 62.73
Ta D 41.59 50.93 55.34 60.75 63.37
A 38.56 49.69 42.06 44.29 49.38
B 42.92 51.38 38.26 47.57 48.3
C 42.7 50.11 42.99 47.65 42.15
Bn D 41.41 46.54 39.34 42.33 44.72
Table 5: Comparison of Micro F1-score consisting tail labels for all 5 languages. L: Language, M: Models
Lang Model 20% 40% 60% 80% 100%
Micro Macro Micro Macro Micro Macro Micro Macro Micro Macro
A 56.04 39.58 60.39 48.98 60.95 50.19 62.76 53.26 63.37 55.2
B 57.75 41.89 61.04 51.72 61.86 50.66 63.22 54.43 63.35 53.36
C 57.26 40.59 60.67 49.64 62.01 53.42 62.8 55.54 63.69 54.77
Mr D 56.68 40.14 60.39 49.82 61.88 52.07 62.99 54.06 63.55 54.97
A 48.56 37.73 51.63 42.71 53.12 46.08 54.09 46.83 56.93 47.12
B 48.44 38.54 51.67 43.44 54.87 44.2 55.11 46.18 56.01 46.33
C 49.34 38.29 51.54 41.86 54.23 47.27 53.16 45.19 55.81 46.4
Hi D 48.85 38.17 51.22 41.3 52.65 45.45 54.18 45.53 54.93 45.88
A 64.79 50.39 76.51 69.07 81.59 75.85 81.75 77.98 85.76 82.44
B 66.47 51.71 75.94 68.81 81.44 76.73 83.88 80.01 86.66 83.44
C 66.56 54.82 76.32 68.63 81.46 74.91 82.11 78.36 86.48 82.69
En D 65.95 52.54 75.24 67.06 81.4 73.06 82.11 74.43 86.04 81.04
A 67.44 61.19 70.99 64.01 73.97 67.19 73.62 69.77 73.62 69.77
B 68.28 61 70.82 63.53 72.95 66.3 74.3 69.1 74.3 69.1
C 67.96 59.13 70.85 64.08 72.8 65.84 73.7 69.33 73.7 69.33
Ta D 66.25 58.42 70.03 60.36 72.78 63.84 73.69 69.2 73.69 69.2
A 64.26 38.07 66.58 46.82 67.52 43.3 67.78 45.23 67.94 47.94
B 63.9 41.21 65.97 46.61 67.24 40.96 67.61 46.12 67.81 45.56
C 63.16 40.72 66.24 44.75 67.25 42.54 68 45.23 67.71 42.15
Bn D 63.28 38.16 65.86 40.73 65.81 37.31 66.14 40.64 66.61 43.78
Table 6: Overall Micro and Macro F1-score for all languages and different training set sizes

6 Conclusion

In this paper, we introduce a hybrid approach for automatic extraction of events and arguments. We present a new dataset in the disaster domain for five languages consisting of large number of tags than usual datasets. We propose several variants of rule based system to augment deep learning based models. Extensive experimental results demonstrate that our rule augmented methods outperforms deep learning based models on lesser annotated data and low resource languages. We further shows more improvement on tail labels using our approach. For future work, we plan to integrate cross linking between events and its arguments.

References

  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9:1735–1780.
  • Hu et al. (2016) Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard H. Hovy, and Eric P. Xing. 2016. Harnessing deep neural networks with logic rules. CoRR, abs/1603.06318.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. CoRR, abs/1508.01991.
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In HLT-NAACL.
  • Nguyen et al. (2016) Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grishman. 2016. Joint event extraction via recurrent neural networks. In HLT-NAACL.
  • Ramshaw and Marcus (1999) Lance A Ramshaw and Mitchell P Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large corpora, pages 157–176. Springer.
  • Ratner et al. (2017) Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, 11 3:269–282.
  • Reschke et al. (2014) Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D. Manning, and Daniel Jurafsky. 2014. Event extraction using distant supervision. In LREC.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    , 15:1929–1958.
  • Sun et al. (2017) Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting unreasonable effectiveness of data in deep learning era.

    2017 IEEE International Conference on Computer Vision (ICCV)

    , pages 843–852.
  • Valenzuela-Escárcega et al. (2015) Marco A Valenzuela-Escárcega, Gus Hahn-Powell, Mihai Surdeanu, and Thomas Hicks. 2015. A domain-independent rule-based framework for event extraction. Proceedings of ACL-IJCNLP 2015 System Demonstrations, pages 127–132.