Active^2 Learning: Actively reducing redundancies in Active Learning methods for Sequence Tagging and Machine Translation

03/11/2021 ∙ by Rishi Hazra, et al. ∙ 0

While deep learning is a powerful tool for natural language processing (NLP) problems, successful solutions to these problems rely heavily on large amounts of annotated samples. However, manually annotating data is expensive and time-consuming. Active Learning (AL) strategies reduce the need for huge volumes of labeled data by iteratively selecting a small number of examples for manual annotation based on their estimated utility in training the given model. In this paper, we argue that since AL strategies choose examples independently, they may potentially select similar examples, all of which may not contribute significantly to the learning process. Our proposed approach, Active^2 Learning (A^2L), actively adapts to the deep learning model being trained to eliminate further such redundant examples chosen by an AL strategy. We show that A^2L is widely applicable by using it in conjunction with several different AL strategies and NLP tasks. We empirically demonstrate that the proposed approach is further able to reduce the data requirements of state-of-the-art AL strategies by an absolute percentage reduction of ≈3-25% on multiple NLP tasks while achieving the same performance with no additional computation overhead.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Active Learning (AL) Freund et al. (1997); McCallum and Nigam (1998)

reduces the need for large quantities of labeled data by intelligently selecting unlabeled examples for expert annotation in an iterative process. Many Natural Language Processing (NLP) tasks like sequence tagging (NER, POS), Neural Machine Translation (NMT), etc., are very data-intensive and require a meticulous, time-consuming, and costly annotation process. On the other hand, unlabeled data is practically unlimited. Due to this, many researchers have explored applications of active learning for NLP

Thompson et al. (1999); Figueroa et al. (2012). A general AL method proceeds as follows: (i) The partially trained model for a given task is used to (possibly incorrectly) annotate the unlabeled examples. (ii) An active learning strategy selects a subset of the newly labeled examples via a criterion that quantifies the perceived utility of examples in training the model. (iii) The experts verify/improve the annotations for the selected examples. (iv) These examples are added to the training set, and the process repeats. AL strategies differ in the criterion used in step (ii).

We claim that all AL strategies select redundant examples in step (ii). If one example satisfies the selection criterion, then many other similar examples will also satisfy it (see the next paragraph for details). As the examples are selected independently, AL strategies redundantly choose all of these examples even though, in practice, it is enough to label only a few of them (ideally just one) for training the model. This leads to higher annotation costs, wastage of resources, and reduces the effectiveness of AL strategies. This paper addresses this problem by proposing a new approach called AL (read as active-squared learning) that further reduces the redundancies of existing AL strategies.

Any approach for eliminating redundant examples must have the following qualities: (i) The redundancy should be evaluated in the context of the trained model. (ii) The approach should apply to a wide variety of commonly used models in NLP. (iii) It should be compatible with several existing AL strategies. The first point merits more explanation. As a model is trained, depending on the downstream task, it learns to focus on certain properties of the input. Examples that share these properties (for instance, the sentence structure) are similar from the model’s perspective. If the model is confused about one such example, it will likely be confused about all of them. We refer to a similarity measure that is computed in the context of a model as a model-aware similarity (Section 3.1).

Contributions: (i) We propose a Siamese twin’s Bromley et al. (1994); Mueller and Thyagarajan (2016) based method for computing model-aware similarity to eliminate redundant examples chosen by an AL strategy. This Siamese network actively adapts itself to the underlying model as the training progresses. We then use clustering based on similarity scores to eliminate redundant examples. (ii) We develop a second, computationally more efficient approach, that approximates the first one with a minimal drop in performance by avoiding the clustering step. Both of these approaches have the desirable properties mentioned above. (iii) We experiment with several AL strategies and NLP tasks to empirically demonstrate that our approaches are widely applicable and significantly reduce the data requirements of existing AL strategies while achieving the same performance. To the best of our knowledge, we are the first to identify the importance of model-aware similarity and exploit it to address the problem of redundancy in AL.

2 Related Work

Active learning has a long and successful history in the field of machine learning

Dasgupta et al. (2009); Awasthi et al. (2017). However, as the learning models have become more complex, especially with the advent of deep learning, the known theoretical results for active learning are no longer applicable Shen et al. (2018)

. This has prompted a diverse range of heuristics to adapt the active learning framework to deep learning models

Shen et al. (2018). Many AL strategies have been proposed Sha and Saul (2007); Blundell et al. (2015); Gal and Ghahramani (2016b); Haffari et al. (2009); Bloodgood and Callison-Burch (2010), however, since they choose the examples independently, the problem of redundancy (Section1) applies to all.

We experiment with various NLP tasks like named entity recognition

Nadeau and Sekine (2007), part-of-speech tagging Marcus et al. (1993), neural machine translation Hutchins (2004); Nepveu et al. (2004); Ortiz-Martínez (2016) and so on Tjong Kim Sang and Buchholz (2000); Landes and Leacock (1998). The tasks chosen by us form the backbone of many practical problems and are known to be computationally expensive in both training and inference. Many deep learning models have recently advanced the state-of-art for these tasks Siddhant and Lipton (2018); Lample et al. (2016); Bahdanau et al. (2014). Our proposed approach is compatible with any NLP model, provided it supports the usage of an AL strategy

Many recent attempts at applying active learning to sequence tagging and NMT have been made Siddhant and Lipton (2018); Peris and Casacuberta (2018), however, the issue of redundancy (Section 1) has largely been ignored. Existing approaches have used model-independent similarity scores to promote diversity in the chosen examples. For instance, in Chen et al. (2015)

, the authors use cosine similarity to pre-calculate pairwise similarity between examples. We instead argue in the favor of model-aware similarity scores and learn an expressive notion of similarity using neural networks. We compare our approach with a modified version of this baseline using cosine similarity on Infersent embeddings

Conneau et al. (2017).

3 Proposed Approaches

We use to denote the model being trained for a given task. has a module called encoder for encoding the input sentences, for instance, the encoder in may be modeled by an LSTM Hochreiter and Schmidhuber (1997).

3.1 Model-Aware Similarity Computation

A measure of similarity between examples is required to discover redundancy. The simplest solution is to compute the cosine similarity between input sentences Chen et al. (2015); Shen et al. (2018) using, for instance, the InferSent encodings Conneau et al. (2017). However, sentences that have a low cosine similarity may still be similar in the context of the downstream task. Model has no incentive to distinguish among such examples. A good strategy is to label a diverse set of sentences from the perspective of the model. For example, it is unnecessary to label sentences that use different verb forms but are otherwise similar, if the task is agnostic to the tense of the sentence. A straightforward extension of cosine similarity to the encodings generated by model achieves this. However, a simplistic approach like this would likely be incapable of discovering complex similarity patterns in the data. Next, we describe two approaches that use more expressive model-aware similarity measures.

Data: : task dataset;
          : auxiliary similarity dataset
 Input: : Partitioning of unlabeled data, each is a set.
 Output: Labeled data
Initialization: of dataset ;
Annotate();
Train();
Train();
for  to  do
       ;
        // confused samples
       if // Model-Aware Siamese
       then
            for each pair in  do
                   ;
                  
             Cluster();
      else
            // Integrated Clustering
             ;
       Annotate();
       ;
       Retrain(
Algorithm 1 Active Learning

3.2 Model-Aware Siamese

In this approach, we use a Siamese twin’s network Bromley et al. (1994) to compute the pairwise similarity between encodings obtained from model . A Siamese twin’s network consists of an encoder (called the Siamese encoder) that feeds on the output of model ’s encoder. The outputs of the Siamese encoder are used for computing the similarity between each pair of examples and as:

(1)

where and are the outputs of the Siamese encoder for sentences and respectively. Let denote the number of examples chosen by an AL strategy. We use the Siamese network to compute the entries of an similarity matrix where the entry

. We then use the spectral clustering algorithm

Ng et al. (2002) on the similarity matrix to group similar examples. A fixed number of examples from each cluster are added to the training dataset after annotation by experts.

We train the Siamese encoder to predict the similarity between sentences from the SICK (Sentences Involving Compositional Knowledge) dataset Marelli et al. (2014) using mean squared error. This dataset contains pairs of sentences with manually annotated similarity scores. The sentences are encoded using the encoder in and then passed on to the Siamese encoder for computing similarities. The encoder in is kept fixed while training the Siamese encoder. The trained Siamese encoder is then used for computing similarity between sentences selected by an AL strategy for the given NLP task as described above. As is trained over time, the distribution of its encoder output changes and hence we periodically retrain the Siamese network to sustain its model-awareness.

The number of clusters and the number of examples drawn from each cluster are user-specified hyper-parameters. The similarity computation can be done efficiently by computing the output of the Siamese encoder for all examples before evaluating equation 1, instead of running the Siamese encoder times. The clustering algorithm runs in time. For an AL strategy to be useful, it should select a small number of examples to benefit from interactive and intelligent labeling. We expect to be small for most practical problems, in which case the computational complexity added by our approach would only be a small fraction of the overall computational complexity of training the model with active learning (see Figure 1).

3.3 Integrated Clustering Model

While the approach described in Section 3.2 works well for small to moderate values of , it suffers from a computational bottleneck when is large. We integrate the clustering step into the similarity computation step to remedy this (see Figure 1) and call the resultant approach as Integrated Clustering Model (Int Model). Here, the output of model ’s encoder is fed to a clustering neural network that has

output units with the softmax activation function. These units correspond to the

clusters and each example is directly assigned to one of the clusters based on the softmax output.

To train the network , we choose a pair of similar examples (say and ) and randomly select a negative example (say ). We experimented with both SICK and Quora Pairs dataset333https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs. All examples are encoded via the encoder of model and then passed to network

. The unit with the highest probability value for example

is treated as the ground-truth class for example . The objective is to maximize the probability of belonging to its ground truth class while minimizing the probability of belonging to the same class:

(2)

Here , , and

are user-specified hyperparameters,

is the softmax output of the unit for example , , , and . The third term encourages the utilization of all the K units across examples in the dataset. As before, a trained network is used for clustering examples chosen by an AL strategy, and we select a fixed number of examples from each cluster for manual annotation.

It is important to note that: (i) These methods are not AL strategies. Rather, they can be used in conjunction with any existing AL strategy. Moreover, given a suitable Siamese encoder or clustering network , they apply to any model . (ii) Our methods compute model-aware similarity since the input to the Siamese or the clustering network is encoded using the model . The proposed networks also adapt to the underlying model as the training progresses. Algorithm 1 describes our general approach called Active Learning.

4 Experiments

We establish the effectiveness of our approaches by demonstrating that they: (i) work well across a variety of NLP tasks and models, (ii) are compatible with the most popular AL strategies, and (iii) further reduce the data requirements of existing AL strategies, while achieving the same performance. In particular, we experiment with two broad categories of NLP tasks: (a) Sequence Tagging; (b) Neural Machine Translation. Table 3 lists these tasks and information about the corresponding datasets (including the two auxiliary datasets for training the Siamese network (Section 3.2)) used in our experiments. We begin by describing the AL strategies for the two kinds of NLP tasks.

4.1 Active Learning Strategies for Sequence Tagging

Margin-based strategy:

Let be the score assigned by a model with parameters to output for a given example . Margin is defined as the difference in scores obtained by the best scoring output and the second best scoring output , i.e.:

(3)

where, . The strategy selects examples for which , where is a hyper-parameter. We use Viterbi’s algorithm Ryan and Nudd (1993) to compute the scores .

Task Dataset Train/Test Example (Input/Output)
NER CoNLL 2003 / Fischler proposed EU measures after reports from Britain B-PER  0  B-MISC  0  0  0  0  B-LOC
POS CoNLL 2003 / He ended the World Cup on the wrong note PRP   VBD   DT   NNP   NNP   IN   DT   JJ   NN
CHUNK CoNLL 2000 / The dollar posted gains in quiet trading B-NP   I-NP   B-VP   B-NP   B-PP   B-NP   I-NP
SEMTR SEMCOR222From a subset of the Brown Corpus Burchfield (1985), using splits from  Martínez Alonso and Plank (2017) / This section prevents the military departments 0   Mental   Agentive   0   0   Object
NMT Europarl (en es) / (1) that is almost a personal record for me this autumn ! (2) es la mejor marca que he alcanzado este otono .
AUX SICK (1) Two dogs are fighting. (2) Two dogs are wrestling and hugging. Similarity Score: 4 (out of 5)
AUX Quora Pairs333https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs / (sets) 444We process the dataset to use only those sentences which are present in at least 5 other pairs. We retrieve 16000 sets, each with a source sentence and 5 other samples (comprising both positive and negative labels). An additional 1000 sets were generated for evaluation. (1) How do I make friends? (2) How to make friends? Label: 1
Table 1: Task and dataset descriptions. AUX is the task of training the Siamese network (Section 3.2) or Integrated network (Section 3.3). Citations: CoNLL 2003 Sang and De Meulder (2003), CoNLL 2000 Tjong Kim Sang and Buchholz (2000), SEMCOR222From a subset of the Brown Corpus Burchfield (1985), using splits from  Martínez Alonso and Plank (2017), Europarl Koehn (2005), SICK Marelli et al. (2014), Quora Pairs333https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs.

Entropy-based strategy:

All the NLP tasks that we consider require the model to produce an output for each token in the sentence. Let be an input sentence that contains tokens and define to be the probability of the most likely output for the token in . Here is set of all possible outputs and is the output corresponding to the token in . We define the normalized entropy score as:

(4)

A length normalization is added to avoid bias due to the example length as it may be undesirable to annotate longer length examples Claveau and Kijak (2017). The strategy selects examples with , where is a hyper-parameter.

Bayesian Active Learning by Disagreement (BALD):

Due to stochasticity, models that use dropout Srivastava et al. (2014) produce a different output each time they are executed. BALD Houlsby et al. (2011) exploits this variability in the predicted output to compute model uncertainty. Let denote the best scoring output for in the forward pass, and let be the number of forward passes with a fixed dropout rate, then:

(5)

Here the operation finds the output which is repeated most often among , and the operation counts the number of times this output was encountered. This strategy selects examples with (hyper-parameter).

4.2 Active Learning Strategies for Neural Machine Translation

Least Confidence (LC)

This strategy estimates the uncertainty of a trained model on a source sentence by calculating the conditional probability of the prediction conditioned on the source sentenceLewis and Catlett (1994).

(6)

A length normalization of (length of the predicted translation ) is added.

Coverage Sampling (CS)

A translation model is said to cover the source sentence if it translates all of its tokens. Coverage is estimated by mapping a particular source token to its appropriate target token, without which, the model may suffer from under-translation or over-translation issues Tu et al. (2016). Peris and Casacuberta (2018) proposed to use translation coverage as a measure of uncertainty given as:

(7)

Here denotes the attention probability calculated by the model for the source word in predicting the target word. It can be noted that the coverage score will be for samples for which the model fully covers the source sentences.

Attention Distraction Sampling (ADS)

Peris and Casacuberta (2018) claimed that in translating an uncertain sample, the model’s attention mechanism will be distracted

(dispersed throughout the sentence). Such samples yield attention probability distribution with light tails (e.g. uniform distribution), which can be obtained by taking the Kurtosis of the attention weights for each target token

.

(8)

where is the mean of the distribution of the attention weights (for a target word) over the source words. Kurtosis value will be lower for distributions with light tails, so the average of the negative kurtosis values for all words in the target sentence is used as the distraction score.

(9)
Figure 1: Comparison of time taken for one data selection step in NMT task by the Model Aware (MA) Siamese and Integrated Clustering (Int) Model across different ALS. It can be observed that AL adds a negligible overhead ( of the time taken for ALS) to the overall process.
Task Dataset % of train data used to reach

of full-data F-Score

% less data required to reach of full-data F-score
POS CoNLL 2003 25% 16%
NER CoNLL 2003 37% 3%
SEMTR SEMCOR 35% 25%
CHUNK CoNLL 2000 23% 11%
Table 2: Fraction of data used for reaching full dataset performance and the corresponding absolute percentage reduction in the data required over the None baseline that uses active learning strategy without the AL step for the best AL strategy (BALD in all cases). Refer Fig 7 in Appendix for CHUNK plots.

4.3 Details about Training

For sequence tagging, we use two kinds of architectures: CNN-BiLSTM-CRF model (CNN for character-level encoding and BiLSTM for word-level encoding) and a BiLSTM-BiLSTM-CRF model (BiLSTM for both character-level and word-level encoding) Lample et al. (2016); Siddhant and Lipton (2018). For the translation task, we use LSTM based encoder-decoder architecture with Bahdanau attention Bahdanau et al. (2014). These models were chosen for their performance and ease of implementation.

The Siamese network used for model-aware similarity computation (Section 3.2) consists of two bidirectional LSTM (BiLSTM) encoders. We pass each sentence in the pair from the SICK dataset to model and feed the resulting encodings to the Siamese BiLSTM encoder. The output is a concatenation of terminal hidden states of the forward and backward LSTMs, which is used to compute the similarity score using (1). As noted before, we keep model fixed while training the Siamese encoders, and use the trained Siamese encoders for computing similarity between examples chosen by an AL strategy. We maintain the model-awareness by retraining the Siamese after every iterations.

The architecture of the clustering model (Section 3.3) is similar to that of the Siamese encoder. Additionally, it has a linear layer with a softmax activation function that maps the concatenation of terminal hidden states of the forward and backward LSTMs to units, where is the number of clusters. To assign an input example to a cluster, we first pass it through the encoder in and feed the resulting encodings to the clustering model . The example is assigned to the cluster with the highest softmax output. This network is also retrained after every iterations to retain model-awareness.

The initial data splits used for training the model were set at of randomly sampled data for Sequence Tagging ( for NMT). These are in accordance with the splitting techniques used in the existing literature on AL. The model is then used to provide input to train the Siamese/Clustering network using the SICK/Quora Pairs. At each iteration, we gradually add another of data for sequence tagging ( for NMT), by passing randomly picked samples through the AL pipeline (which includes the low confidence examples extracted from the AL step). We average the results over five independent runs with randomly chosen initial splits. See Appendix C for details on hyper-parameters.

Figure 2: [Best viewed in color] Comparison of our approach (AL) with baseline approaches on different tasks using different active learning strategies. row: POS, row: NER, row: SEMTR, row: NMT. In the first three rows, from left to right, the three columns represent BALD, Entropy and Margin AL strategies. row represents AL strategies for NMT, from left to right (LC: Least Confidence, CS: Coverage Sampling, ADS: Attention Distraction Sampling) : Legend Description {100% data : full data performance, AL (MA Siamese) : Model Aware Siamese, AL (Int Model) : Integrated Clustering Model, Cosine : Cosine similarity, None : Active learning strategy without clustering step, Random : Random split (no active learning applied)}. See Section 4.4 for more details on baseline. All the results were obtained by averaging over 5 random splits. These plots have been magnified to highlight the regions of interest. For original plots, refer Fig 7 in Appendix.
Figure 3: [Best viewed in color] Ablations studies on POS task using different active learning strategies. From left to right, the three columns represent BALD, Entropy and Margin based AL strategies. Legend Description {100% data : full data performance, AL (MA Siamese) : Model Aware Siamese, AL (Int Model) : Integrated Clustering Model, Iso Siamese : Model isolated Siamese, InferSent : Cosine similarity based on InferSent encodings}. See Figure 6 in Appendix for experiments on other tasks. All the results were obtained by averaging over 5 splits.

4.4 Baselines

We claim that AL mitigates the redundancies in the existing AL strategies by working in conjunction with them. We validate our claims by comparing our approaches with three baselines that highlight the importance of various components.

Cosine:

Clustering is done based on cosine similarity between last output encodings (corresponding to sentence length) from encoder in . Although this similarity computation is model-aware, it is simplistic and shows the benefit of using a more expressive similarity measure.

None:

In this baseline, we use the AL strategy without applying Active learning to remove redundant examples. This validates our claim about redundancy in examples chosen by AL strategies.

Random:

No active learning is used and random examples are selected at each time.

4.5 Ablation Studies

We perform ablation studies to demonstrate the utility of model-awareness using these baselines:

Infersent:

Clustering is done based on cosine similarity between sentence embeddings Chen et al. (2015) obtained from a pre-trained InferSent model Conneau et al. (2017). This similarity computation is not model-aware and shows the utility of model-aware similarity computation.

Iso Siamese:

To show that the Siamese network alone is not sufficient and model-awareness is needed, in this baseline, we train the Siamese network by directly using GloVe embeddings of the words as input rather than using output from the model ’s encoder. This similarity, which is not model-aware, is then used for clustering.

5 Results

Figure 2 compares the performance of our methods with baselines. It shows the test-set metric on the -axis against the percentage of training data used on the -axis for all tasks. See Figures 6 and 7 in Appendix for additional results.

  1. [leftmargin=*,noitemsep]

  2. As shown in Figure 2, our approach consistently outperforms all baselines on chosen tasks. Note that one should observe how fast the performance increases with the addition of training data (and not just the final performance) as we are trying to evaluate the effect of adding new examples (Table 3). Our ablation studies in Figure 3 show the utility of using model-aware similarity. An interpretation of the plot on the top left corner of Figure 2 (CoNLL 2003 (POS), BALD) is given in Table 3 of Appendix.

  3. In sequence tagging, we match the performance obtained by training on the full dataset using only a smaller fraction of the data ( less data as compared to state-of-art AL strategies) (Table 2). On a large dataset in NMT task (Europarl), AL takes sentences lesser than the Least Confidence AL strategy to reach a Bleu score of .

  4. While comparing different AL strategies is not our motive, Figure 2 also demonstrates that one can achieve performance comparable to a complex AL strategy like BALD, using simple AL strategies like margin and entropy, by using the proposed AL framework.

  5. Additionally, from Figure 1, it can be observed that for one step of data selection: (i) The proposed MA Siamese model adds minimal overhead to the overall AL pipeline since it takes an additional time of fewer than 5 seconds ( of the time taken for ALS); (ii) By approximating the clustering step, Integrated Clustering (Int) Model further reduces the overhead down to 2 seconds. However, owing to this approximation, MA Siamese is observed to perform slightly better than the Int Model (Fig 3). A comparison of training time for various stages of the AL pipeline is provided in Figure 4 of Appendix.

In Appendix A, we provide a qualitative case study that demonstrates the problem of redundancy. It should be noted that the reported improvement numbers are not relative with respect to any baseline but represent an absolute improvement and are very significant in the context of similar performance improvements reported in the literature.

6 Conclusion

In this paper, we show that one can further reduce data requirements of Active Learning strategies by proposing a new method AL, which uses a model-aware-similarity computation. We empirically demonstrated that our proposed approaches consistently perform well across many tasks and AL strategies. We compared the performance of our approach with strong baselines to ensure that the role of each component is properly understood.

References

  • P. Awasthi, M. F. Balcan, and P. M. Long (2017) The power of localization for efficiently learning linear separators with noise. J. ACM 63 (6), pp. 50:1–50:27. External Links: Link, Document Cited by: §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. Note: cite arxiv:1409.0473Comment: Accepted at ICLR 2015 as oral presentation External Links: Link Cited by: §2, §4.3.
  • M. Bloodgood and C. Callison-Burch (2010) Bucking the trend: large-scale cost-focused active learning for statistical machine translation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, USA, pp. 854–864. Cited by: §2.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 1613–1622. External Links: Link Cited by: §2.
  • J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a "siamese" time delay neural network. In Advances in Neural Information Processing Systems, J. D. Cowan, G. Tesauro, and J. Alspector (Eds.), pp. 737–744. External Links: Link Cited by: §1, §3.2.
  • R. Burchfield (1985)

    Frequency analysis of english usage: lexicon and grammar. by w. nelson francis and henry kučera with the assistance of andrew w. mackie. boston: houghton mifflin. 1982. x + 561

    .
    Journal of English Linguistics 18 (1), pp. 64–70. External Links: Document, Link, https://doi.org/10.1177/007542428501800107 Cited by: footnote 2, footnote 2, footnote 2.
  • Y. Chen, T. A. Lasko, Q. Mei, J. C. Denny, and H. Xu (2015) A study of active learning methods for named entity recognition in clinical text. J. of Biomedical Informatics 58 (C), pp. 11–18. External Links: Link, Document Cited by: §2, §3.1, §4.5.
  • V. Claveau and E. Kijak (2017) Strategies to select examples for active learning with conditional random fields. In CICLing 2017 - 18th International Conference on Computational Linguistics and Intelligent Text Processing, pp. 1–14. External Links: Link Cited by: §4.1.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. External Links: Link, Document Cited by: §2, §3.1, §4.5.
  • S. Dasgupta, A. T. Kalai, and C. Monteleoni (2009)

    Analysis of perceptron-based active learning

    .
    J. Mach. Learn. Res. 10, pp. 281–299. External Links: Link Cited by: §2.
  • R. Figueroa, Q. Zeng-Treitler, L. Ngo, S. Goryachev, and E. Wiechmann (2012) Active learning for clinical text classification: is it better than random sampling?. Journal of the American Medical Informatics Association : JAMIA 19, pp. 809–16. External Links: Document Cited by: §1.
  • Y. Freund, H. S. Seung, E. Shamir, and N. Tishby (1997) Selective sampling using the query by committee algorithm. Mach. Learn. 28 (2-3), pp. 133–168. External Links: Link, Document Cited by: §1.
  • Y. Gal and Z. Ghahramani (2016a)

    A theoretically grounded application of dropout in recurrent neural networks

    .
    In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1019–1027. External Links: Link Cited by: Appendix C.
  • Y. Gal and Z. Ghahramani (2016b) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1050–1059. External Links: Link Cited by: §2.
  • G. Haffari, M. Roy, and A. Sarkar (2009) Active learning for statistical phrase-based machine translation. In In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 415–423. Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Link, Document Cited by: §3.
  • N. Houlsby, F. Huszar, Z. Ghahramani, and M. Lengyel (2011) Bayesian active learning for classification and preference learning. CoRR abs/1112.5745. External Links: Link Cited by: §4.1.
  • W. J. Hutchins (2004) The georgetown-ibm experiment demonstrated in january 1954. In Machine Translation: From Real Users to Research, R. E. Frederking and K. B. Taylor (Eds.), Berlin, Heidelberg, pp. 102–114. External Links: ISBN 978-3-540-30194-3 Cited by: §2.
  • P. Koehn (2005) Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit, Phuket, Thailand, pp. 79–86. External Links: Link Cited by: Table 1.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. External Links: Document, Link Cited by: Appendix C, §2, §4.3.
  • S. Landes and C. Leacock (1998) Building a semantic concordance of english. WordNet: An Electronic Lexical Database, pp. . Cited by: §2.
  • D. D. Lewis and J. Catlett (1994) Heterogeneous uncertainty sampling for supervised learning. In In Proceedings of the Eleventh International Conference on Machine Learning, pp. 148–156. Cited by: §4.2.
  • M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini (1993) Building a large annotated corpus of english: the penn treebank. Comput. Linguist. 19 (2), pp. 313–330. External Links: Link Cited by: §2.
  • M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. bernardi, and R. Zamparelli (2014) A sick cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), External Links: Link Cited by: §3.2, Table 1.
  • H. Martínez Alonso and B. Plank (2017) When is multitask learning effective? semantic sequence prediction under varying data conditions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 44–53. External Links: Link Cited by: footnote 2, footnote 2, footnote 2.
  • A. McCallum and K. Nigam (1998) Employing em and pool-based active learning for text classification. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, pp. 350–358. External Links: Link Cited by: §1.
  • J. Mueller and A. Thyagarajan (2016) Siamese recurrent architectures for learning sentence similarity. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    ,
    AAAI’16, pp. 2786–2792. External Links: Link Cited by: §1.
  • D. Nadeau and S. Sekine (2007) A survey of named entity recognition and classification. Linguisticae Investigationes 30 (1), pp. 3–26. External Links: Link Cited by: §2.
  • L. Nepveu, G. Lapalme, P. Langlais, and G. Foster (2004) Adaptive language and translation models for interactive machine translation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 190–197. External Links: Link Cited by: §2.
  • A. Y. Ng, M. I. Jordan, and Y. Weiss (2002)

    On spectral clustering: analysis and an algorithm

    .
    In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), pp. 849–856. External Links: Link Cited by: Appendix C, §3.2.
  • D. Ortiz-Martínez (2016) Online learning for statistical machine translation. Computational Linguistics 42 (1), pp. 121–161. External Links: Link, Document Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014)

    GloVe: global vectors for word representation

    .
    In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Link Cited by: Appendix C.
  • Á. Peris and F. Casacuberta (2018) Active learning for interactive neural machine translation of data streams. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 151–160. External Links: Link, Document Cited by: §2, §4.2, §4.2.
  • M. S. Ryan and G. R. Nudd (1993) The viterbi algorithm. Technical report -. Cited by: §4.1.
  • E. F. T. K. Sang and F. De Meulder (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. Proceeding of the Computational Natural Language Learning (CoNLL). External Links: Document Cited by: Table 1.
  • F. Sha and L. K. Saul (2007)

    Large margin hidden markov models for automatic speech recognition

    .
    In Advances in Neural Information Processing Systems 19, B. Schölkopf, J. C. Platt, and T. Hoffman (Eds.), pp. 1249–1256. External Links: Link Cited by: §2.
  • Y. Shen, H. Yun, Z. C. Lipton, Y. Kronrod, and A. Anandkumar (2018) Deep active learning for named entity recognition. In 6th International Conference on Learning Representations, External Links: Link Cited by: §2, §3.1.
  • A. Siddhant and Z. C. Lipton (2018) Deep bayesian active learning for natural language processing: results of a large-scale empirical study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2904–2909. External Links: Link Cited by: Appendix C, §2, §2, §4.3.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: Link Cited by: §4.1.
  • C. A. Thompson, M. E. Califf, and R. J. Mooney (1999) Active learning for natural language parsing and information extraction. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, San Francisco, CA, USA, pp. 406–414. External Links: ISBN 1558606122 Cited by: §1.
  • E. F. Tjong Kim Sang and S. Buchholz (2000) Introduction to the CoNLL-2000 shared task chunking. In Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop, External Links: Link Cited by: §2, Table 1.
  • Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li (2016) Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 76–85. External Links: Link, Document Cited by: §4.2.

Appendix A Understanding Redundancy and Model Aware Similarity

To convey the notion of redundancy and the idea of model aware similarity, in this section, we examine some example sentences that were deemed similar by the model aware Siamese in our proposed approach. To obtain these examples, we followed the training procedure outlined in Section 4.3 for the NER task on CoNLL 2003 dataset. After the model had seen roughly of the data, we collected examples that were: (i) selected by the AL strategy (BALD) as examples on which the model has low confidence, and (ii) grouped by the clustering procedure in the same cluster based on model aware Siamese similarity scores. We present two sentences each from some randomly chosen clusters below:

  1. Cluster 1:

    • Russian (B-MISC) double Olympic (B-MISC) swimming champion Alexander (B-PER) Popov (I-PER) was in a serious condition on Monday after being stabbed on a Moscow (B-LOC) street.

    • Vitaly (B-PER) Smirnov (I-PER), president of the Russian (B-MISC) National (I-MISC) Olympic (I-MISC) Committee (I-MISC), said President Boris (B-PER) Yeltsin (I-PER) had given the swimmer Russia’s (B-LOC) top award for his Olympic (B-MISC) performance.

  2. Cluster 2:

    • The newspaper said the Central (B-ORG) Bank (I-ORG) special administration of Banespa (B-ORG) ends in December 30 and after that the bank has to be liquidated or turned into a federal bank since there are no conditions to return Banespa (B-ORG) to Sao (B-LOC) Paulo (I-LOC) state government.

    • The newspaper said Bamerindus (B-ORG) has sent to the Central (B-ORG) Bank (I-ORG) a proposal for restructuring combined with a request for a 90-day credit line, paying four percent a year plus the Basic Interest Rate of the Central (B-ORG) Bank (I-ORG) ( TBC (B-ORG) ).

Ground truth tags have been reported alongside the words, except for the words that belong to the “Other” class. For the sake of comparison, we also provide examples from two clusters that were obtained by using the cosine similarity metric on the InferSent embedding (Infersent baseline described in Section 4.4). As in the previous case, these examples have been selected by the AL strategy (BALD) for the same task and dataset as before. Note that similarity computation is not model aware in this case.

  1. Cluster 1:

    • "His condition is serious," said Rimma (B-PER) Maslova (I-PER), deputy chief doctor of Hospital (B-LOC) No (I-LOC) 31 (I-LOC) in the Russian (B-MISC) capital.

    • Popov (B-PER) told NTV (B-ORG) television on Sunday he was in no danger and promised he would be back in the pool shortly.

  2. Cluster 2:

    • MOTORCYCLING - JAPANESE (B-MISC) WIN BOTH ROUND NINE SUPERBIKE RACES.

    • Honda’s (B-ORG) Takeda (B-PER) was pursued past Corser (B-PER) by the Yamaha (B-ORG) duo of Noriyuki (B-PER) Haga (I-PER) and Wataru (B-PER) Yoshikawa (I-PER) with Haga (B-PER) briefly taking the lead in the final chicane on the last lap.

As expected, when cosine similarity is used, sentences that have roughly similar content have been assigned to the same cluster. However, when model aware similarity is used, in addition to having similar content, the sentences also have a similar tagging structure. As the InferSent based similarity is agnostic of the downstream task, it cannot predict similarity between sentences based on the downstream task, unlike the model-aware Siamese approach. However, for the NER task, it is sensible to eliminate sentences having similar tagging structures, as they are redundant as far as the learning on the downstream task is concerned.

This example not only supports our claim that AL strategies choose redundant examples, but it also highlights the utility of using model aware similarity computation.

Appendix B Additional Remarks

In this section, we make a number of additional remarks about the proposed approach.

b.1 What is the significance of our work?

Obtaining labeled data is both time-consuming and costly. Active learning is employed to minimize the labeling effort. However, as we point out in Section 1, existing techniques may select redundant examples for manual annotation. Due to this redundancy, there is a scope for improvement in the performance of active learning strategies, and our proposed approach fills this gap. Since we demonstrate that our method is compatible with many active learning strategies and deep learning models that are currently in use, it can be applied in a wide range of contexts and is likely to be useful for many sub-communities within the domain of natural language processing without adding significant complexity to the existing systems.

b.2 How do we validate our claim regarding the sub-optimality of standard AL strategies due to redundancy?

The comparison of our approach with None baseline suggests that performance comparable to the state-of-art can be achieved by using fewer labels if one incorporates the second step, which eliminates allegedly redundant examples even when every other aspect of training is exactly the same (same model, AL strategy and dataset). Thus, we can say that the discarded examples were of no additional help for the model and hence were redundant. Avoiding annotation of such samples saves time and brings down both computational and annotation costs. This can especially be effective in, for instance, the medical domain where high expertise is required.

Figure 4:

Comparison of training time taken for one epoch in NMT full training by the various models at different stages of the pipeline, namely (Base) LSTM with attention encoder-decoder translation model, Model Aware (MA) Siamese and Integrated Clustering (Int) Model. It can be observed that A

L adds a negligible overhead to the overall training time as well ( of the time taken by the base model).

Appendix C Hyper-parameters and other Implementation Details

Figure 5: Modeling similarity using the Siamese encoder (enclosed by dotted lines). A pair of sentences from SICK dataset is fed to the pretrained sequence tagging model. The output of the word encoder is then passed to the Siamese encoder. Last hidden state of the Siamese encoder, corresponding to the sequence length of the sentence, is used for assigning a similarity score to the pair.

Similar hyper-parameter values work across all the tasks. Hence, the same values were used for all experiments and these values were determined using the validation set of CoNLL 2003 dataset for NER task. We use two different sequence tagging architectures: CNN-BiLSTM-CRF model (CNN for character-level encoding and BiLSTM for word-level encoding) and a BiLSTM-BiLSTM-CRF model Lample et al. (2016) (BiLSTM for both character-level and word-level encoding). The CNN-BiLSTM-CRF architecture is a light-weight variant of the model proposed in Siddhant and Lipton (2018)

, having one layer in CNN encoder with two filters of sizes 2 and 3, followed by a max pool, as opposed to three layers in the original setup. This modification was found to improve the results. We use glove embeddings 

Pennington et al. (2014) for all datasets. We apply normal dropout in the character encoder instead of the use of recurrent dropout Gal and Ghahramani (2016a) in the word encoder of the model presented in Siddhant and Lipton (2018) owing to an improvement in performance. For numerical stability, we use log probabilities and, thus, the value for margin-based AL strategy’s threshold is outside the interval . We use the spectral clustering Ng et al. (2002) algorithm to cluster the sentences chosen by the AL strategy. We chose two representative examples from each cluster.

Active Learning strategy
threshold (Margin) 15
threshold (Entropy) 40
threshold (BALD) 0.2
dropout (BALD) 0.5
number of forward passes (BALD) 51
Sequence tagging model
CNN filter sizes [2,3]
training batch size 12
splits of train data 50
number of train epochs 16
dimension of character embedding 100
learning rate (Adam) 0.005
learning rate decay 0.9
Siamese encoder
training batch size 48
number of train epochs 41
train/dev split 0.8
learning rate (Adam) 1e-5
period (of retrain) 10
Clustering
Number of clusters 20
Training
Batch size 12
NMT model
training batch size 128
number of train epochs 20
dimension of (sub)word embedding 256
learning rate (Adam) 1e-3
Siamese encoder
training batch size 1150
number of train epochs 25
dimension of (sub)word embedding 300
learning rate (Adam) 1e-3
period (of retrain) 3
Clustering
Number of clusters 50
Training
Batch size 128
Figure 6: [Best viewed in color] Ablations studies on different tasks using different active learning strategies. row: NER, row: SEMTR, row: CHUNK, row: NMT. In first three rows, from left to right, the three columns represent BALD, Entropy and Margin AL strategies. row represents AL strategies for NMT, from left to right (LC: Least Confidence, CS: Coverage Sampling, ADS: Attention Distraction Sampling). Legend Description {100% data : full data performance, AL (MA Siamese) : Model Aware Siamese, AL (Int Model) : Integrated Clustering Model, Iso Siamese : Model isolated Siamese, InferSent : Cosine similarity based on InferSent encodings}. See Section 4.5 for more details. All results were obtained by averaging over 5 random splits.
Figure 7: [Best viewed in color] Comparison of our approach (AL) with baseline approaches on different tasks using different active learning strategies. row: POS, row: NER, row: SEMTR, row: CHUNK. In each row, from left to right, the three columns represent BALD, Entropy and Margin based AL strategies. Legend Description {100% data : full data performance, AL (MA Siamese) : Model Aware Siamese, AL (Int Model) : Integrated Clustering Model, Cosine : Cosine similarity, None : Active learning strategy without clustering step, Random : Random split (no active learning applied)}. See Section 4.4 for more details. All the results were obtained by averaging over 5 random splits.
Setup% data
Iso Siamese 88.58 89.00 89.14 89.74 90.18 90.20 90.22 90.50 90.48
Cosine 88.34 88.86 89.74 89.90 90.23 90.17 90.25 90.50 90.63
InferSent 88.15 89.00 89.95 90.05 90.12 90.35 90.37 90.60 90.54
None (BALD) 88.58 88.50 89.23 89.51 90.00 90.05 90.12 90.40 90.44
Random (No ALS) 86.79 87.51 88.50 89.00 89.19 89.46 89.42 89.75 90.14
AL (MA Siamese) 89.50 90.10 90.70
AL (Int Model) 89.00 90.20 90.50 90.45 90.50 90.75
Table 3: Interpretation of the plot on the top left corner of Fig 7 (CoNLL 2003 (POS), BALD) in Appendix. The values in the cells are F-scores on the test set after training on the corresponding percentage of the data. It can be seen that with the increase in % labeled data, AL (MA Siamese) consistently performs better than other baselines.