Log In Sign Up

Training with Streaming Annotation

by   Tongtao Zhang, et al.

In this paper, we address a practical scenario where training data is released in a sequence of small-scale batches and annotation in earlier phases has lower quality than the later counterparts. To tackle the situation, we utilize a pre-trained transformer network to preserve and integrate the most salient document information from the earlier batches while focusing on the annotation (presumably with higher quality) from the current batch. Using event extraction as a case study, we demonstrate in the experiments that our proposed framework can perform better than conventional approaches (the improvement ranges from 3.6 to 14.9 noise in the early annotation; and our approach spares 19.1 to the best conventional method.


page 1

page 2

page 3

page 4

1 Introduction

Successful supervised models depend on high quality annotation. Although there are platforms such as Amazon Mechanical Turk or tools such as LightTag or brat [20] which facilitates “crowd-sourcing” annotation and help the researcher and developers to acquire large scale annotation within a short period, we do not neglect merits of hiring a professional annotation group if the data set contains credential or privacy information and/or the task requires intensive training on background knowledge of expertise domains.

In an ideal setting, professional annotation construction is performed with the following steps:

  1. [noitemsep,nolistsep,leftmargin=*]

  2. The task (e.g. set of a labels) is defined;

  3. Task relevant data is collected;

  4. Annotators are recruited and trained;

  5. Annotators annotate the data, and adjudicate the annotation.

In practice, the above procedure is repeatedly iterated, i.e., data sponsors or sources provide additional data, annotators review the previous data and annotation, and adjudicators correct errors or resolve disagreements. This process is usually lengthy, e.g., ACE2005 has documents and sentences and it took more than three years for them to be fully annotated and adjudicated [22]. From system developers and researchers’ perspective, waiting for a large corpus to be fully annotated is impractical.

To shorten the waiting time, annotation organizers release annotation in small batches – stream – so that system developers and researchers are able to get familiar with the early release and start to work. This requires domain experts to carefully define and confirm the schema. Moreover, with limited budget, time constraint and work load balance, it is very likely that annotations among batches yield to different quality. Usually the quality in early batches is lower than that in later counterparts due to various reasons, e.g., annotators and adjudicators may misunderstand the schema, or they may be unfamiliar with the annotation tools and make mistakes in early stages, and they may not have sufficient efforts or time to re-visit and re-annotate those early data sets while focusing on later batches with more skillful annotation.

Figure 1: An overview of the proposed framework. For an input sentence in the batch, it retrieves the most similar sentence-level embeddings from the previous batches – one embedding from one batch (Section 2.1). The sentence-level embeddings from previous batches go through a forward LSTM network which outputs memory embeddings (Section 2.2). Memory embeddings are then concatenated with the context embedding from the Bi-LSTM and fed into the CRF model. (Section 2.3).

Data and annotation available in stream-style might be handled by the following strategies:

  1. [noitemsep,nolistsep,leftmargin=*]

  2. When a new batch of training data arrives, aggregate them with the old batch and train a new model.

  3. Train a model with the first batch, and when a new batch arrives, finetune the model with the new batch.

  4. When a new batch of training data arrives, train a new model with that batch and dump the models from previous batches.

The first strategy assumes that the data and annotation have already been delivered as a complete corpus and consumes all the data available, this strategy will inevitably introduce errors from the data in the earlier batch. The second strategy is also risky since the distribution of errors is unknown and newly arrived data does not ensure an efficient suppression on those error annotations. Although the third strategy seems promising because the later batches will come with better annotation, it usually brings an over fitted model especially when size of batches is small. Hence, we require a paradigm that handles streaming scenario by efficiently capturing information from annotation in current batch and similar sentences in early batches and alleviate errors in early annotation.

According to our empirical practises with and as annotators, we notice that the improvement in the annotation quality with the progress going on depends on a few reasons. Besides a better understanding in the annotation schema and skilling up in operating the annotation tool, annotators’ accumulated impression on the passed documents also helps with improved annotations. For example, when seeing the word “strike”, which may be a trigger word of Attack event or a trigger word of Demonstrate event, skillful annotators immediately determine an “Attack” label on it if the target sentence resembles those passed documents with country and military forces names, or a “Demonstrate” label if the target sentence resembles those ones with business names and individual employee names.

In this paper, to emulate the aforementioned process, we propose a framework that capture the sentence-level features extracted from old stream batches, whose annotations are ignored by the core networks and the core network focuses on the annotation from the current batch. This framework will ensure one single model which is continuously finetuned mainly by the current batch while preserving the most informative features from old data and suppressing errors in the past batches. We use event extraction as a study case. The event annotation and the extraction framework follow

ACE222Automatic Content Extraction, schema where 33 event types (such as Attack and Meet) are defined. The extraction system aims to extract event triggers, which mostly express events (e.g., “shoot” for Attack). We assume the following conditions in the study:

  1. [noitemsep,nolistsep,leftmargin=*]

  2. Once annotation and adjudication are done on a data instance, annotators and adjudicators do not re-visit the instance.

  3. The quality of annotation is lower in early batches and increases in the later ones.

The contributions of the paper are summarized as follows:

  1. [noitemsep,nolistsep,leftmargin=*]

  2. We propose a new approach that tackles annotated data that arrives in small batches, and the proposed approach is robust against annotation errors among the batches.

  3. To the best of our knowledge, the proposed framework is the first to explore a scenario where annotation data arrives in small-scale batches, especially for event extraction task.

2 Approach

2.1 Sentence-level embeddings

We retrieve the most similar sentences from the previous batches in terms of the existence of same event types for an input sentence. Additional features and embeddings from these sentences improve event trigger extraction performance of the core network (Section 2.3).

Given a sentence where denotes the index of a batch and denotes the index within the batch. We have


where denotes a sentence-level embedding extractor. As shown in Figure 1, we use the BERT[3] embedding from the “CLS” token (CLS stands for class, this token is used for sentence classification) and we append a fully-connected layer after the embedding as a sentence-level feature extractor. We will introduce how we train this additional FC layer in the experiment section.

The sentence-level embeddings are stored based on the batch tag of the source sentence. For the batch, we denote


where denotes the index of a sentence in this batch.

For each input sentence embedding , we retrieve the closest embedding from each of the previous batches. We denote


where the footnote denotes the index of the closest sentence-level embedding in the batch and :


We define the distance function as the or Euclidean distance.

2.2 Memory Embedding

Starting from the second batch, we feed the queried sentence-level embedding

from the previous batch(es) into a forward LSTM (Long short-term memory

[6] network. Each previous batch provides one sentence-level embedding and these embeddings form a sequence input to the LSTM network. The output, or the final state of the LSTM network is a memory embedding .

This submodule simulates how human annotator recalls similar sentences that contain the same event types in the past documents when he/she is working on a current sentence. For example, we use the sentence embedding for “Putin last spoke to Bush on April 5 at the US president’s own initiative” as a query and retrieve the sentence “Putin last visited Bush at his Texas ranch in November 2001” which contains similar entities and related event (Meet event triggered by “visited”). The memory embeddings calculated from the sentence embeddings trough the LSTM network provide the most salient information to enhance the detection of target labels such as event triggers.

2.3 Core network

The core network is based on a Bi-LSTM-CRF framework [14, 4], which models trigger extraction as a sequence labeling problem.

We feed the current sentence which is tokenized into a Bi-LSTM network and acquire the context embeddings consisting of the concatenated hidden output from each token step. In the “vanilla” version of this framework, the context embeddings are then fed into an FC-layer and CRF layer to generate the final output. In our framework, according to Figure 1, starting from the second batch where we are able to retrieve sentence and memory embeddings from previous batch(es), we concatenate the memory embedding with each word contextual embedding before we feed them into the FC layer and the CRF (conditional random field) layer.

3 Experiments

3.1 Data and Simulation

Since most publicly available data sets have already been carefully annotated and adjudicated and we are not able to access and recover the annotation history, to evaluate the performance with our proposed framework, we simulate the scenario of increasing quality by randomly dropping or swapping the labels. In this section, we use the ACE2005 data set, which follows ACE schema and contains 33 types of event annotation, as the test bed. In comparison with other sequence labeling data sets, it has a substantially large label space (67 considering BIO333In BIO settings, B denotes the beginning of a word/phrase segment, I denotes inside the segment, e.g., the phrase “shot down” is labeled as “shot” with B-Attack and “down” with I-Attack. We have 2 labels for each event type and 33 event types provide 66 labels, and we also have an O label to denote a token without any event label, hence we have 67.). We select documents from bc (broadcast conversations), bn (broadcast news) and nw (newswire). These documents are tagged with release dates between March and June 2003. We simulate the batches according to the months and list the batch information in Table 1. For each batch, we split the data into training, development and test set as shown in Table 2.

We simulate annotation errors by noise levels, which indicate the probability of swapping and dropping labels, in the training sets which decreases along with the month tags. We have two settings of simulation, and we denote the settings using the noise level of the first month, and we have noise levels for each month tag as follows:

  • [noitemsep,nolistsep,leftmargin=*]

  • 25%: 25%, 10%, 5%, and 0%.

  • 10%: 10%, 5%, 0%, and 0%.

In the tables, we use “Noise” represent the two simulation groups.

We use spaCy 444 to acquire the stem and lemmatized string of each token in the training set, and then we have a “confusing” list of trigger stems with multiple event type labels. For example, “fire” can be an Attack trigger or a End-Position trigger. When simulating the erroneous annotation, we set a threshold based on the noise level for that month, and if the random generator generates a number lower than the noise level number, we either drop the label for that trigger (or we just make the label as O), or if the trigger appears on the aforementioned “confusing” list, we randomly select another event label which also includes this trigger (and we can also set a O label). These actions simulate the scenario where annotators “miss” (with O label) or “are confused” with other event types that the word may also trigger ((with other labels). However, for the sake of valid parameter tuning and fair comparison, the labels in the development and test sets remain intact.

For each noise level, we generate 10 groups of simulation data, and we run experiments 10 times and calculate the average of precision and recall scores, then calculate F1 scores.

Months Doc. Sent. Words Triggers
Table 1: The stat of ACE2005 with month tags.
Months Train Dev Test All Test
Table 2: Training, development and test splits with month tags. For intuitive comparison, we use all 40 documents as the test set regardless of their month tags.
Noise Slice 200303 200304 200305 200306
Metric P R F1 P R F1 P R F1 P R F1
25% All 56.9 56.6 56.7 63.4 68.7 65.9 70.1 68.6 69.3 71.3 70.4 70.8
Current 56.9 56.6 56.7 60.5 55.8 58.1 60.4 48.3 53.7 63.5 56.0 59.5
Finetune 56.9 56.6 56.7 67.9 67.7 67.9 66.6 64.8 65.7 70.2 66.5 68.3
Proposed 56.9 56.6 56.7 74.2 66.7 70.3 71.2 74.1 72.7 72.9 76.1 74.4
10% All 67.9 70.4 69.1 80.1 78.3 79.2 82.7 79.9 81.2 83.4 82.3 82.8
Current 67.9 70.4 69.1 64.8 60.8 62.7 54.4 57.2 55.8 63.5 56.0 59.5
Finetune 67.9 70.4 69.1 72.7 75.6 74.1 77.3 74.8 76.0 74.5 76.7 75.5
Proposed 67.9 70.4 69.1 80.5 79.6 80.1 85.7 78.9 82.2 84.5 82.8 83.6
Table 3: Performance (%) comparison with different strategies and noise level. Names of strategies (All, Current, Finetune, Proposed) and definition of noise level are introduced in Section 3.2. Note that there is no “previous” batch for the first batch, the results in the first batch are identical.

3.2 Settings and Strategies

We have the following settings to simulate the strategies (including our proposed method) of dealing with stream data:

  • [noitemsep,nolistsep,leftmargin=*]

  • All data: Denoted as “All”. In this setting, in each month, we use all available training data at that time point (e.g., in the month we use training data from the and months).

  • Current batch only: Denoted as “Current”. In this setting, we exclusively use the current month to train a new model for the current month.

  • Aggregated model: Denoted as “Finetune”. In this setting, after training/finetuning a model using the data from the month, we iteratively finetune the model using the data from the month starting from the best model according to the F1-score of the development set.

  • Proposed framework: Denoted as “Proposed”, our proposed method.

3.3 Document Pre-Processing

To preprocess documents, we use spaCy for segmentation, tokenization and PoS (Part-of-Speech) tagging.

Due to different tokenization tools, we do not use BERT for event trigger detection: when we use WordPiece tokenizer as BERT, the PoS tagger from spaCy does not work well; when we use spaCy tokenization results and input them in the pretrained BERT framework, the performance degrades significantly. Hence, in this paper, we use BERT to extract sentence-level embeddings only.

3.4 Hyper parameters

For the sentence-level embedding extractor, the parameters of BERT framework are pretrained555The model is available at The CLS output from BERT framework with regard to an input sentence is a

-dim vector

666We did not use the -dim embedding due to limited RAM on GPU, and the output dimension size of the sentence-level embedding is set to with a FC layer that project the -dim vectors to

-dim vectors. We pretrain the parameters on this FC layer so that it works as a fixed sentence-level embedding extractor. We use the 360 example sentences in the ACE annotation guideline and use a Softmax layer with 33 (the number for event types) labels and train a sub-framework to predict the existence of an event type in the guideline sentences. After we remove the softmax layer, we consider the output from FC layer as a sentence-level embedding that captures the features indicating the existence of event types.

For the memory embedding extractor, the dimension size of the hidden state of the memory network is set to .

For the core network, we use -dim pre-trained Word2Vec [15] embeddings which are trained from Wikipedia article dump on January 1st, 2017 and tokenized with spaCy. -dim PoS tag embeddings, and -dim character embeddings (from a character-based Bi-LSTM network with the original input of -dim character embedding and each direction has -dim hidden state size). The Bi-LSTM that extracts the token’s context embeddings is set to -dim hidden state on both directions, with -dim as the whole embedding size and a total of after being concatenated with memory embeddings. The FC layer before the CRF layer has an output dimension of .

We have two optimizers in this framework (will be discussed in the following subsection). Both of them are Adam[8], with learning rate set to .

3.5 Train and Optimization

In our work, we first pretrain the sentence-level embedding extractor from the FC layer as mentioned in Section 3.1. We use example sentences from ACE annotation guideline. We use 90% as training sentences and 10% as test sentences. To prevent over-fitting, we run the training process multiple times and select the model that converges (achieves highest test score) at the earliest epoch number.

The core network has another independent optimizer. The back-propagation from the core network will update both parameters on the Bi-LSTM-CRF framework [14] and the memory network; however, the sentence-level embedding extractor’s parameters are fixed and not updated.

Conventional strategies (All, Finetune, Current) do not train with memory network or sentence-level embedding extractor, they only go through the Bi-LSTM-CRF network (no memory embeddings are concatenated). Moreover, since there is no "previous" batch for the first batch, our proposed framework is trained from the second batch, and we inherit the parameters on Bi-LSTM-CRF network from conventional strategies and continue to train the parameters for both core and memory network as mentioned above. Moreover,at the second batch, the FC layer before CRF layer in our proposed framework is a newly initialized one with input dimension of and output of and later batches still follow this FC layer and its parameters, our empirical output did not see any negative impact with this approach.

3.6 Results


In Table 3 we show the performance of the strategies on two noise levels. It is expected that the performance of the Current strategy is the lowest among all strategies because the model does not have enough training data and easily over-fits; hence we observe a drop in the batch of 200305 even this batch has a higher quality of annotations. The Finetune and All strategies perform better, but the noise still influences the performance. The Finetune strategy does not sufficiently alleviate the influence from wrong annotations in the early batches. Our method outperforms all other strategies because the memory embeddings successfully enhance the information with same event occurrence in the sentences in the past batches. Moreover, the improvement is more significant if the noise level is higher.


We also demonstrate the average training time with different strategies in Table 4, we run the codes on 8 Intel CPU cores with GHz and one Nvidia P100 GPU. The Finetune and Batch strategies spend a similar length of time and are faster than All strategy because All strategy takes data instances from both current and previous batches and this prolongs training time. The core network in our proposed framework processes the same amount of training instances as the Finetune and Batch strategies, with additional time for retrieving sentence-level embeddings and training the memorization module. Since the size of input to the memorization module is much smaller than all token embeddings from the previous batches, it takes less extra time than All strategy. The cost for saved training time is small. Each sentence merely requires 6k bytes for storage if we use -dim embedding and double floating-point.

Time Training Time (Seconds)
All 18328
Current 12473
Finetune 12567
Proposed 14832
Table 4: Time consumption with different strategies.

4 Related Work

Most recent event extraction approaches, e.g., [1, 11, 7, 17, 18, 12], consider the training data as a whole batch, [10] also utilizes memory network for event extraction but it only considers the sentence from the same document and training instances come in a single batch. Our framework is the first one that tackles annotation arriving in batches.

Most approaches regard streaming data processing [5, 9, 16] focus on methods that output labels on the data arriving in stream, while our framework consumes annotation arriving in stream.

We use the term “memory” to denote information recalled from past batches. We survey work such as [23, 2] where memory mechanism resembles attention while the memorization module in our framework follows the methods in [21, 10, 13], which are based on RNN or LSTMs.

Although we do not discuss in the previous sections, there are also other ways to work with the small batches of data sets, such as activate learning

[19, 24]. However, as we emphasized in Section 1, the annotators do not iterate on the early batches of data set, hence we do not consider learning methods derived from those strategies as our baselines.

5 Conclusion and Future Work

In this paper, we propose a framework which recalls similar information in the past to handle a scenario where more annotation errors appear in early batches and annotators do not revisit those errors. This framework works well with the assumption that the annotation quality increases as the annotators will skill up in the later stage of annotation task. We use a memory embedding to emulate the progress and improvement of annotator’s skills. In the future, we will focus on “crowd-sourcing” scenarios where the annotation quality is volatile across different batches due to unstable skills of ad-hoc annotators.


  • [1] Y. Chen, H. Yang, K. Liu, J. Zhao, and Y. Jia (2018) Collective event detection via a hierarchical and bias tagging networks with gated multi-level attention mechanisms. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 1267–1276. Cited by: §4.
  • [2] E. Chu, P. Vijayaraghavan, and D. Roy (2018) Learning personas from dialogue with attentive memory networks. arXiv preprint arXiv:1810.08717. Cited by: §4.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1.
  • [4] X. Feng, B. Qin, and T. Liu (2018)

    A language-independent neural network for event detection

    Science China Information Sciences 61 (9), pp. 092106. Cited by: §2.3.
  • [5] T. Ge, Q. Dou, H. Ji, L. Cui, B. Chang, Z. Sui, F. Wei, and M. Zhou (2018) Fine-grained coordinated cross-lingual text stream alignment for endless language knowledge acquisition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2496–2506. Cited by: §4.
  • [6] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8). Cited by: §2.2.
  • [7] Y. Hong, W. Zhou, G. Zhou, Q. Zhu, et al. (2018) Self-regulation: employing a generative adversarial network to improve event detection. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 515–526. Cited by: §4.
  • [8] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • [9] J. Kuo and H. Chen (2004) Event clustering on streaming news using co-reference chains and event words. In Proceedings of the Conference on Reference Resolution and Its Applications, Cited by: §4.
  • [10] S. Liu, R. Cheng, X. Yu, and X. Cheng (2018) Exploiting contextual information via dynamic memory network for event detection. arXiv preprint arXiv:1810.03449. Cited by: §4, §4.
  • [11] X. Liu, Z. Luo, and H. Huang (2018) Jointly multiple events extraction via attention-based graph information aggregation. arXiv preprint arXiv:1809.09078. Cited by: §4.
  • [12] W. Lu and T. H. Nguyen (2018) Similar but not the same: word sense disambiguation improves event detection via neural representation matching. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4822–4828. Cited by: §4.
  • [13] F. Ma, R. Chitta, S. Kataria, J. Zhou, P. Ramesh, T. Sun, and J. Gao (2017) Long-term memory networks for question answering. arXiv preprint arXiv:1707.01961. Cited by: §4.
  • [14] X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354. Cited by: §2.3, §3.5.
  • [15] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    arXiv preprint arXiv:1301.3781. Cited by: §3.4.
  • [16] S. Miranda, A. Znotins, S. B. Cohen, and G. Barzdins (2018) Multilingual clustering of streaming news. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Cited by: §4.
  • [17] T. H. Nguyen and R. Grishman (2018) Graph convolutional networks with argument-aware pooling for event detection. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §4.
  • [18] T. M. Nguyen and T. H. Nguyen (2018) One for all: neural joint modeling of entities and events. arXiv preprint arXiv:1812.00195. Cited by: §4.
  • [19] B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §4.
  • [20] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, and J. Tsujii (2012-04) brat: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations Session at EACL 2012, Avignon, France. Cited by: §1.
  • [21] S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In Advances in neural information processing systems, pp. 2440–2448. Cited by: §4.
  • [22] C. Walker, S. Strassel, J. Medero, and K. Maeda (2006) ACE 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia 57. Cited by: §1.
  • [23] J. Weston, S. Chopra, and A. Bordes (2014) Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §4.
  • [24] B. Zhang, X. Pan, T. Wang, A. Vaswani, H. Ji, K. Knight, and D. Marcu (2016) Name tagging for low-resource incident languages based on expectation-driven learning. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 249–259. Cited by: §4.