Active Learning with Siamese Twins for Sequence Tagging

11/01/2019 ∙ by Rishi Hazra, et al. ∙ 15

Deep learning, in general, and natural language processing methods, in particular, rely heavily on annotated samples to achieve good performance. However, manually annotating data is expensive and time consuming. Active Learning (AL) strategies reduce the need for huge volumes of labelled data by iteratively selecting a small number of examples for manual annotation based on their estimated utility in training the given model. In this paper, we argue that since AL strategies choose examples independently, they may potentially select similar examples, all of which do not aid in the learning process. We propose a method, referred to as Active^2 Learning (A^2L), that actively adapts to the sequence tagging model being trained, to further eliminate such redundant examples chosen by an AL strategy. We empirically demonstrate that A^2L improves the performance of state-of-the-art AL strategies on different sequence tagging tasks. Furthermore, we show that A^2L is widely applicable by using it in conjunction with different AL strategies and sequence tagging models. We demonstrate that the proposed A^2L able to reach full data F-score with ≈2-16 % less data compared to state-of-art AL strategies on different sequence tagging datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Actively reducing redundancies for neural machine translation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many information extraction tasks utilize modules that identify and classify tokens (words) of interest in a sequence (sentence) into predefined categories. These modules play a key role in determining the performance of the parent information extraction task and are particularly interesting due to their wide applicability. The term sequence tagging

[16] is used to describe the general task that is performed by these modules. For example, consider the utility of being able to identify named entities (like people, organisations, places and so on) in news headlines for personalized content recommendation.

Recently, methods based on deep learning have significantly improved the state-of-art for sequence tagging, yet, the intense labeled data dependence of these methods has presented a formidable challenge under the scope of restricted annotation budgets. The field of Active Learning addresses this problem [9, 18]. Here, the training process alternates between improving the performance of model on given labeled examples and intelligently choosing new examples to be manually annotated. The goal is to train the model quickly with minimal manual annotation requirements.

More precisely, an active learning setup consists of a small number of labeled examples and a relatively large number of unlabeled examples. A model trained on labeled examples is used to label the unlabeled examples. A few examples from this newly labeled data are then selectively sampled to be fed back to the training setup after being annotated by experts. This process is repeated until it converges at a point where the cost of retraining exceeds the cost of annotations.

As active learning strategies try to acquire candidate examples to be labeled, they suffer from inherent redundancy in form of similar examples being selected and fed back, thus, entailing higher cost of annotations and wastage of resources. In order to overcome this bottleneck, a second step is needed to shred redundant examples and one of our major contributions is in improving the quality of this step.

Once an active learning strategy has selected examples (that are possibly redundant), we expect the second step to have the following three qualities: (i) since examples are redundant only in the context of a model, the induced redundancy should preferably be addressed from the perspective of the sequence tagging model, (ii) the proposed step should be applicable to a wide variety of sequence tagging models and (iii) it should be compatible with several existing active learning strategies.

In this paper111, we focus on active learning for sequence tagging models as manually annotating sequences is especially challenging and time consuming. We develop a Siamese twins based neural architecture [3] that is trained to predict a similarity score for each pair of input examples in the context of a sequence tagging model. These similarity scores are then used for eliminating redundant examples via clustering.

The model needs to be partially retrained whenever a new set of labeled examples is furnished by the active learning strategy. It is essential to ensure that the model learns from new examples without forgetting what it already knows. We investigate a mixed sampling strategy that achieves this while avoiding unnecessary training overhead.

Contributions: (i) We propose a method to compute model aware similarity between input sequences to actively eliminate redundancy in examples chosen by the active learning strategy and we call this Active Learning, (ii)

We further improve the efficiency of training process (in terms of both training time and data requirement) by incorporating a mixed sampling strategy on the lines of transfer learning,

(iii) We demonstrate the utility of our approach by experimenting with various combinations of sequence tagging tasks, active learning strategies and datasets.

2 Related Work

Much of the classical work done in Active learning [8, 1]

cannot generalize to deep learning architectures, which are known to produce state-of-art results. Active learning strategies like margin maximization have been studied in the domain of automatic speech recognition 

[24]. Recent deep learning models for sequence tagging rely upon Conditional Random Field (CRF) as the decoder [15]. However, in the recent past, Bayesian formulations of deep learning like dropout based approach [11] and Bayes-by-Backprop [2]

have provided an alternate technique, that is independent of the CRF scores. These methods have been shown to perform better than other active learning strategies for Named Entity Recognition (NER) 

[26]. While the use of diversity based querying algorithms [5]

has been explored for quite some time, most of them are independent of the model, and almost all of them employ cosine similarity as the standard measure. In our experiments, we use cosine similarity as a baseline and elucidate its ineffectiveness in the active learning task. Few active learning algorithms 

[28, 12] tried to combine the informativeness measure with the representativeness measure for finding the optimal query instances using clustering. In our work, we use a dynamic similarity measure to actively select the most representative samples.

3 Preliminaries

Let denote the sentence in the training set, is the token of this sentence. Our goal is to assign each token to one of the classes in the set . For example, will contain tags like person, place etc. in case of a named entity recognition task. We denote the expert annotation for the sentence by where is the label assigned to token .

3.1 Active Learning Strategies

Once a model has been trained on an initial small set of labeled examples, previously unseen and unlabeled examples are presented to it. The model may or may not be able to correctly label these examples. An Active Learning (AL) strategy chooses a subset of these new unseen examples to be labeled by an expert. The goal of an AL strategy is to select examples that, if labeled, would be most informative

for the model. Different AL strategies use different heuristics to select this subset of most informative examples. To demonstrate that our approach works across AL strategies, we tested it with three of the most popular AL strategies - margin based AL strategy, entropy based AL strategy and Bayesian Active Learning by Disagreement (BALD). We describe these strategies here.

Margin Based AL strategy:

Let be the score assigned by a sequence tagging model to output sequence where denotes model parameters and is the length of sequence . In our implementation, is obtained as the output of CRF [14] after using Viterbi algorithm. In the context of sequence tagging, margin is defined as the difference in scores obtained by the best scoring sequence of tags and the second best scoring sequence for a sentence , i.e.:


Here, and is the margin score calculated for sentence

. A hyperparameter threshold

is empirically determined and all sentences with are considered to be examples on which the model is less confident (lower the difference, higher the confusion). We also tried a variant where was normalized using but the results did not show any significant improvement. This may be because the difference in (1) neutralizes bias due to length.

Entropy Based AL strategy:

For a given sentence and tag sequence , we define the normalized entropy score as:



is the probability of assigning the most likely tag to the

token in sequence . In our implementation, is computed by applying softmax function to output of the encoder of sequence tagging model (described in Section 5.2

). Application of softmax produces a probability distribution over tags in set

. Empirically, it seems important to normalize the entropy with respect to the sentence length as is correlated with it and annotating longer length sequences is undesirable [6]. The model is considered to be less confident on all sentences that have normalized entropy score () higher than a threshold (determined empirically).


Due to stochasticity, models that use dropout [27] produce a different output, even for the same input, each time a forward pass is executed. In Bayesian Active Learning by Disagreement (BALD), during inference, for each example, a fixed number of forward passes are executed with a set dropout rate. The variability in predicted tag sequences due to dropout is used to compute the confidence of the model on the given example.

Let represent the best scoring sequence of tags for in the forward pass, and be the number of forward passes, then:


The model is considered to be less confident on all samples where .

3.2 Sequence Tagging Tasks

To demonstrate that our approach works across different tasks, we tested it with two different sequence tagging tasks.

Named Entity Recognition (NER):

NER [19] is a popular sequence tagging task that is concerned with identifying and classifying the named entities in unstructured text, thus augmenting its semantic content. It is expensive to obtain annotated data for NER, even as there is abundance of free unlabeled data. Through active learning, one can efficiently select the most informative examples to be labeled, based on what the model knows, thus sidestepping the huge data annotation issues. In this paper, we use three NER datasets : CoNLL 2003 [23], MIT Movie Corpus and GENIA corpus [21] which is a biomedical dataset.

Part-of-Speech (POS) Tagging:

We also experiment with the POS tagging task [16]

. In POS tagging, each word in the text has to be assigned a part-of-speech tag based on its definition and context. This task is not straightforward because some words can represent more than one part-of-speech under different contexts. POS tagging finds applications in sentiment analysis, question answering, and word sense disambiguation among other tasks. We use the CoNLL 2003 dataset for our experiments on POS tagging.

4 Proposed Method

4.1 Redundancy in AL strategies for Sequence Tagging

Let denote the sequence tagging model that is being trained and denote the Siamese twins model. As discussed in Section 3.1, an AL strategy selects examples to be labeled by human experts from a pool of unlabeled examples. We hypothesize that standard AL strategies exhibit redundancy in sampling the most confused examples, i.e. there are many similar examples that can be discarded. This happens, for instance, because when the model is confused about a particular sentence structure, it will have low confidence on all examples that follow that sentence structure. An AL strategy will pick all these examples for manual labeling, but in practice, since most of these examples are similar, there will be redundancy in annotation efforts. Existing approaches (which we have used as baselines) either ignore the issue of redundancy altogether or use simple metrics like cosine distance to compute similarity and eliminate redundant examples. We, therefore, propose to add a step to the existing active learning setup to actively sample (adapts as the training proceeds) the most representative examples, in order to minimize the annotation cost and effort and we term the whole setup as Active Learning.

4.2 Model Aware Similarity Computation

In order to choose representatives, a measure of similarity between examples is required. We argue that this measure of similarity should be dependent on the current state of model as we want to improve the diversity of selected examples in the context of the model. For instance, based on the downstream task, may learn to differentiate between active and passive voice, so an AL strategy must choose representative examples from both categories even though both forms are semantically similar when is not taken into consideration.

One strategy is to use cosine similarity on the encodings generated by model to filter out similar examples [5, 25], however, we claim that this approach would be incapable of discovering higher order similarity information, say for example semantic similarity, because the model hasn’t encountered enough samples during the initial phases of training. Thus, a similarity measure that is more expressive than cosine similarity is needed and hence we propose to use a Siamese twins network [3] that is trained using auxiliary data.

The Siamese network that we use in our experiments consists of two bidirectional Long Short-Term Memory (LSTM) encoders. For each sentence in the pair, one of the two encoders is randomly chosen and for each word in the sentence, this encoder takes the corresponding output of encoder in

(which is again a bidirectional LSTM in our case) as input and produces a concatenation of the last hidden state of forward LSTM and first hidden state of backward LSTM as output. The similarity between sentences in the pair is computed using the following energy function: , where represent the output of Siamese encoder for both input sentences in the pair.

We train our Siamese network (Fig 1) using SICK (Sentences Involving Compositional Knowledge) dataset [17] to predict the similarity of sentences. This dataset contains pairs of sentences and manually annotated similarity scores between sentences in each pair. While training the Siamese network, the encoder in is kept frozen. Since the Siamese network works on the output of the sequence tagging encoder, it can be incorporated along with any model . As is trained over time, the distribution of output of its encoder changes and hence we periodically retrain the Siamese network (to convergence) to sustain its model awareness. For GENIA corpus, we use biomedical semantic similarity corpus BIOSSES instead of SICK.

Once an AL strategy selects examples on which the model is most confused, we use our pretrained Siamese network to assign a similarity score to every pair of selected examples. This yields a similarity matrix where

is the number of examples selected by AL strategy. We group examples into clusters by passing this similarity matrix to the spectral clustering algorithm

[20]. We then rank the examples in each cluster based on their confidence scores (obtained from the AL strategy) and retain a fixed number of lowest scoring examples from each cluster as most representative examples. These examples are then fed back to the training process after manual annotation.

Figure 1: Modeling similarity using the Siamese encoder (enclosed by dotted lines). Pair of sentences from SICK dataset are fed to the pretrained sequence tagging model. The output of the word encoder is then fed to the Siamese encoder. Last hidden state corresponding to the sequence length of the sentence is collected from the Siamese network to assign similarity scores to every pair of sentences.

4.3 Mixed Sampling Strategy

After obtaining expert annotations for the most representative sentences, we add them to a fixed proportion of sentences, randomly sampled from the existing train data and retrain our sequence tagging model using this mixed batch of data. We call this mixed sampling. Through this process, we ensure that our model maintains a balance between the new representative examples and the existing training data. This not only helps us in averting over-fitting to new examples, but also reduces the training time.

Similar ideas have been used, for instance, in transfer learning literature and one can draw parallels, for example, by considering the existing train data as source domain data and new training data as target domain data. However, to the best of our knowledge, no existing approach uses mixed sampling in the context of active learning. Hence, we believe that the application of this idea to the active learning problem (and a subsequent empirical demonstration that it works) is both novel and useful for the community.

Data: : sequence tagging dataset;
          : semantic similarity dataset
 Input: : Partitioning of unlabeled data, where each is a set.
 Output: Labeled data
Initialization: of dataset ;
for  to  do
       (), confused examples;
       for each pair in  do
             scores[] ;
Algorithm 1 Active Learning Framework

Algorithm 1 summarizes our setup. In Algorithm 1, denotes an AL strategy and and denote the set of confused and representative examples respectively.

5 Experiments

Through our experiments, we investigate: (i) utility of model aware similarity computation (Section 4.2) and (ii) effect of using the mixed sampling strategy (Section 4.3). Additionally, we perform experiments on different datasets (Section 5.1), sequence tagging tasks (Sections 3.2) and active learning strategies (Section 3.1) to demonstrate that the proposed approach is applicable in a wide range of contexts.

5.1 Dataset Description

We experimented with three datasets: CoNLL 2003, MIT Movie Review Corpus and GENIA; in addition to using SICK and BIOSSES as auxiliary datasets for training the Siamese network. Refer to Table 1 for details.

CoNLL 2003 [23], (, ) sentences, ground truth NER and POS tags are available.
MIT Movie Review Corpus (, ) sentences, ground truth NER tags are available.
GENIA [21, 13], (, ) sentences, ground truth NER tags are available. From biomedical domain.
SICK [17], sentence pairs, for each pair ground truth similarity score in range is available.
BIOSSES sentence pairs, for each pair ground truth similarity score in range is available. From biomedical domain.
Table 1: Description of datasets. For the first three rows, figures in tuples correspond to the number of sentences in train and test set respectively.
Setup% data
AL (MA Siamese) 48.47 67.54
Iso Siamese 48.47 61.18 64.50 67.13 67.43 67.90 69.02
Cosine 48.57 60.00 64.83 67.14 67.66 68.50 67.91 68.92
InferSent 48.57 61.09 64.60 66.77 67.61 67.70 68.44 69.15
None (BALD) 48.57 59.80 62.82 64.80 66.62 66.80 66.84 68.15
Random (No AL strategy) 48.57 57.50 63.10 63.67 65.30 66.15 66.80 67.80
Table 2: Interpretation of top left plot of Fig 2 (MIT Movie, CNN BiLSTM CRF, BALD). The values in the cells are F-scores on corresponding percentage of labeled data. It can be seen that with the increase in % labeled data, AL (MA Siamese) adapts best to the active learning procedure and consistently performs better than other baselines.

5.2 Experimental Setup & Hyperparameters

AL strategy hyperparameters:

We perform experiments with three different AL strategies. In case of BALD, the dropout and number of forward passes were fixed to 0.5 and 51 respectively. We apply normal dropout in the character encoder as opposed to the use of recurrent dropout [10] in the word encoder of model presented in [26] owing to an improvement in performance. The threshold for margin, entropy and BALD strategies were set to 15, 40 and 0.2 respectively. For numerical stability we use log probabilities and thus the value for margin based AL strategy’s threshold is outside the interval . We use the spectral clustering [20] algorithm to cluster the sentences chosen by AL strategy. The number of clusters was set to 20 and we chose two representative examples from each cluster.

Network architecture:

For our experiments, we use two different architectures: CNN-BiLSTM-CRF model (CNN for character-level encoding and BiLSTM for word-level encoding) and a BiLSTM-BiLSTM-CRF model [15] (BiLSTM for both character-level and word-level encoding). For the CNN-BiLSTM-CRF model, the architecture is slightly modified into a light-weight variation of the model proposed in [26]

, having one layer in CNN encoder with three filters of sizes 1,2 and 3, followed by a max pool, as opposed to three layers in the original setup. This modification was found to produce slightly better results. We use a batch size of 12. We use glove embeddings 

[22] for all datasets except GENIA corpus, for which we instead use BioWordVec [4].

The Siamese network, that has two BiLSTM encoders, as described in Section 4.2, is trained using examples from SICK dataset (except for GENIA corpus, where we use BIOSSES dataset). The similarity scores in both datasets were normalized to lie in range . For model aware Siamese (MA Siamese), input sentences are encoded using encoder of sequence tagging model and the embeddings are then passed through Siamese encoders whereas for model isolated Siamese (Iso Siamese), original word embeddings are directly fed as input to Siamese encoders. The output of BiLSTM corresponding to the length of sentence is considered as encoding for the given sentence. We retrained the Siamese network after every 10 iterations. The baseline cosine-similarity measure is computed by directly using last output embedding (corresponding to the length of the sentence) from word encoder of the sequence tagging model.

Training strategy:

We divide the sequence tagging dataset into 50 splits and begin by training the sequence tagging model on one of the randomly chosen splits (2% of the data). This model is then used to encode sentences from SICK dataset, which are in turn, used as input for training the Siamese network. Next, at each iteration, we randomly pick one of the remaining splits and use an AL strategy to retrieve examples on which the sequence tagging model is most confused. This is followed by clustering to extract the most representative examples. The similarity scores are exponentiated to avoid negative values during clustering. All results reported here were obtained by averaging over five different randomly chosen initial splits. While starting with random splits initially presents some variance in F-score, it does not significantly affect the final results as the model aware similarity setup adjusts accordingly to acquire necessary examples. All hyperparameters were empirically determined based on the performance on validation sets of respective datasets. For datasets which do not have a predefined validation set, we use a 20% split from train file for validation, which is then combined with the train set before the start of active learning.


To demonstrate that model aware similarity computation is useful, we perform the following control experiments.

  • We use pretrained cosine over InferSent model embeddings [7] as input to the clustering algorithm instead of Siamese embedding scores (Legend title in Figure 2: Infersent).

  • We train a Siamese network on raw sentences (using GloVe embeddings) directly (which we call isolated siamese or Iso Siamese in Figure 2) rather than using output from model encoder (AL (MA Siamese) in Figure 2) as described in Section 4.2.

  • To show the utility of a Siamese network, we also provide results for the case when cosine similarity between output encodings of the sequence tagging model is used rather than using a Siamese network (Cosine in Figure 2).

  • Additionally, we compare these results with a setup where no AL is applied (Random in Figure 2).

We perform a similar control experiment to investigate the role played by mixed sampling strategy (Legend title in Figure 2: None). We test variants of model with and without this component and present a summary in Section 5.3.

Figure 2: [Best viewed in color] Comparison of model aware similarity based approach against other approaches on different datasets using different active learning strategies. First three rows correspond to NER tagging and last row corresponds to POS tagging. In each row, from left to right, the three columns represent BALD, Entropy and Margin based AL strategies. Legend Description {100% data : whole data, AL (MA Siamese) : Model Aware Siamese, Iso Siamese : Model isolated Siamese, Cosine : Cosine similarity, InferSent : Cosine similarity based on InferSent encodings, None : Active learning strategy without the similarity based setup and incremental training, Random : Random split (no active learning applied)}. All the results were obtained by averaging over 5 random splits. Legend on the last plot is common for all. These plots have been magnified to highlight the regions of interest. Original plots are given in Figure 3 in the supplementary material.

5.3 Results

Figure 2 shows a blown up view of the performance (F-score) of our approach against percentage of training data used for all combinations of datasets and AL strategies. Table 2 gives us an interpretation of top left plot in Figure 2.

(Q) How does our approach compare with other baseline approaches?

Through experiments we have shown that:

  • Our approach consistently outperforms all other methods on all NER and POS datasets.

  • For BALD AL strategy (which is known to have the best performance for sequence tagging), the improvement over similarity based methods is marginal. Nevertheless, it can be seen that the relative improvement over the baseline (None) remains significant

  • Our approach outperforms existing approaches that use simple metric like cosine distance as described above.

  • By comparing AL (MA Siamese) against the None baseline (AL strategy without incremental training), it can be seen that there is no performance drop due to mixed sampling based incremental training procedure. However, due to the mixed sampling strategy, training time decreases as one need not retrain on all examples seen thus far.

As given in Table 3, using a lightweight architecture, we were able to match the performance obtained by training on full dataset using a smaller fraction of the data. GENIA is a medical domain dataset and it is especially costly to label such data. We believe that even a marginal improvement here will offer significant gains in terms of cost. While comparing different AL strategies is not our primary motive, we empirically demonstrate in Fig 2 that one can achieve performance comparable to a complex AL strategy like BALD using simple AL strategies like Margin and Entropy based approach if one uses our model aware Siamese. For POS tagging, we use the BiLSTM-BiLSTM-CRF model owing to an improvement in performance. For a better understanding of the variation in F-score with percentage of labeled data, refer to Figure 3 in supplementary material.

Dataset % of train data used to reach full dataset F-Score % less data required to reach full data F-score
MIT Movie 31% 7%
CoNLL 2003 (NER) 25% 2%
GENIA 15% 11%
CoNLL 2003 (POS) 31% 16%
Table 3: Fraction of data used to reach full dataset performance and corresponding improvement over the None baseline in terms of the difference in % of data required to reach full data F-score for the best AL strategy (BALD in all cases).

(Q) How do we know that standard AL strategies are suboptimal because of an example diversity issue?

Our experiments with MA Siamese demonstrates that comparable performance (to state-of-the-art) can be achieved using fewer labels as compared to existing AL strategies, even with exactly the same setup (same model, AL strategy and dataset). Thus, we can say that the discarded examples were redundant.Avoiding annotation of such samples saves time and brings down the cost (of both computational resources and annotations). Besides, these gains can be seen across AL strategies and datasets. This can be really effective in the medical domain annotations which require high expertise. For additional experiments with a different sequence tagging architecture: BiLSTM-BiLSTM-CRF, refer to Figure 4 of in supplementary material.

6 Conclusion

In this paper, we proposed an approach to mitigate redundancy in existing AL strategies for sequence tagging that uses model aware similarity scores. We empirically demonstrated that our proposed approach performs consistently well across many datasets, AL strategies and models. Although, we focused exclusively on the sequence tagging problem, we believe that our idea of using model aware similarity scores can be applied in other contexts. We leave this for future work.


  • [1] P. Awasthi, M. F. Balcan, and P. M. Long (2017-01) The power of localization for efficiently learning linear separators with noise. J. ACM 63 (6), pp. 50:1–50:27. External Links: ISSN 0004-5411, Link, Document Cited by: §2.
  • [2] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015)

    Weight uncertainty in neural networks


    Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37

    ICML’15, pp. 1613–1622. External Links: Link Cited by: §2.
  • [3] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a ”siamese” time delay neural network. In , J. D. Cowan, G. Tesauro, and J. Alspector (Eds.), pp. 737–744. External Links: Link Cited by: §1, §4.2.
  • [4] Q. Chen, Y. Peng, and Z. Lu (2018) BioSentVec: creating sentence embeddings for biomedical texts. CoRR abs/1810.09302. External Links: Link, 1810.09302 Cited by: §5.2.
  • [5] Y. Chen, T. A. Lasko, Q. Mei, J. C. Denny, and H. Xu (2015-12) A study of active learning methods for named entity recognition in clinical text. J. of Biomedical Informatics 58 (C), pp. 11–18. External Links: ISSN 1532-0464, Link, Document Cited by: §2, §4.2.
  • [6] V. Claveau and E. Kijak (2017) Strategies to select examples for active learning with conditional random fields. In CICLing 2017 - 18th International Conference on Computational Linguistics and Intelligent Text Processing, pp. 1–14. External Links: Link Cited by: §3.1.
  • [7] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017-09) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. External Links: Link, Document Cited by: 1st item.
  • [8] S. Dasgupta, A. T. Kalai, and C. Monteleoni (2009)

    Analysis of perceptron-based active learning

    J. Mach. Learn. Res. 10, pp. 281–299. External Links: ISSN 1532-4435, Link Cited by: §2.
  • [9] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby (1997) Selective sampling using the query by committee algorithm. Mach. Learn. 28 (2-3), pp. 133–168. External Links: ISSN 0885-6125, Link, Document Cited by: §1.
  • [10] Y. Gal and Z. Ghahramani (2016)

    A theoretically grounded application of dropout in recurrent neural networks

    In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1019–1027. External Links: Link Cited by: §5.2.
  • [11] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1050–1059. External Links: Link Cited by: §2.
  • [12] S. Huang, R. Jin, and Z. Zhou (2010) Active learning by querying informative and representative examples. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 892–900. External Links: Link Cited by: §2.
  • [13] J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier (2004) Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, JNLPBA ’04, pp. 70–75. External Links: Link Cited by: Table 1.
  • [14] J. D. Lafferty, A. McCallum, and F. C. N. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pp. 282–289. External Links: ISBN 1-55860-778-1, Link Cited by: §3.1.
  • [15] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In , pp. 260–270. External Links: Document, Link Cited by: §2, §5.2.
  • [16] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini (1993-06) Building a large annotated corpus of english: the penn treebank. Comput. Linguist. 19 (2), pp. 313–330. External Links: ISSN 0891-2017, Link Cited by: §1, §3.2.
  • [17] M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. bernardi, and R. Zamparelli (2014) A sick cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), External Links: Link Cited by: §4.2, Table 1.
  • [18] A. McCallum and K. Nigam (1998) Employing em and pool-based active learning for text classification. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, pp. 350–358. External Links: ISBN 1-55860-556-8, Link Cited by: §1.
  • [19] D. Nadeau and S. Sekine (2007-01) A survey of named entity recognition and classification. Linguisticae Investigationes 30 (1), pp. 3–26. External Links: Link Cited by: §3.2.
  • [20] A. Y. Ng, M. I. Jordan, and Y. Weiss (2002)

    On spectral clustering: analysis and an algorithm

    In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), pp. 849–856. Cited by: §4.2, §5.2.
  • [21] T. Ohta, Y. Tateisi, and J. Kim (2002) The genia corpus: an annotated research abstract corpus in molecular biology domain. In Proceedings of the Second International Conference on Human Language Technology Research, HLT ’02. Cited by: §3.2, Table 1.
  • [22] J. Pennington, R. Socher, and C. D. Manning (2014)

    GloVe: global vectors for word representation

    In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Link Cited by: §5.2.
  • [23] E. F. T. K. Sang and F. De Meulder (2003-07) Introduction to the conll-2003 shared task: language-independent named entity recognition. Proceeding of the Computational Natural Language Learning (CoNLL). External Links: Document Cited by: §3.2, Table 1.
  • [24] F. Sha and L. K. Saul (2007)

    Large margin hidden markov models for automatic speech recognition

    In Advances in Neural Information Processing Systems 19, B. Schölkopf, J. C. Platt, and T. Hoffman (Eds.), pp. 1249–1256. External Links: Link Cited by: §2.
  • [25] Y. Shen, H. Yun, Z. C. Lipton, Y. Kronrod, and A. Anandkumar (2018) Deep active learning for named entity recognition. In 6th International Conference on Learning Representations, External Links: Link Cited by: §4.2.
  • [26] A. Siddhant and Z. C. Lipton (2018-October-November) Deep bayesian active learning for natural language processing: results of a large-scale empirical study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2904–2909. External Links: Link Cited by: §2, §5.2, §5.2.
  • [27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014-01) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: ISSN 1532-4435, Link Cited by: §3.1.
  • [28] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang (2003)

    Representative sampling for text classification using support vector machines

    In Proceedings of the 25th European Conference on IR Research, ECIR’03, pp. 393–407. External Links: ISBN 3-540-01274-5, Link Cited by: §2.

Appendix A Additional Results (Original Plots)

Appendix B Additional Results (BiLSTM-BiLSTM-CRF Plots)