Actively reducing redundancies for neural machine translation
Deep learning, in general, and natural language processing methods, in particular, rely heavily on annotated samples to achieve good performance. However, manually annotating data is expensive and time consuming. Active Learning (AL) strategies reduce the need for huge volumes of labelled data by iteratively selecting a small number of examples for manual annotation based on their estimated utility in training the given model. In this paper, we argue that since AL strategies choose examples independently, they may potentially select similar examples, all of which do not aid in the learning process. We propose a method, referred to as Active^2 Learning (A^2L), that actively adapts to the sequence tagging model being trained, to further eliminate such redundant examples chosen by an AL strategy. We empirically demonstrate that A^2L improves the performance of state-of-the-art AL strategies on different sequence tagging tasks. Furthermore, we show that A^2L is widely applicable by using it in conjunction with different AL strategies and sequence tagging models. We demonstrate that the proposed A^2L able to reach full data F-score with ≈2-16 % less data compared to state-of-art AL strategies on different sequence tagging datasets.READ FULL TEXT VIEW PDF
While deep learning is a powerful tool for natural language processing (...
Annotating training data for sequence tagging tasks is usually very
Active learning (AL) uses a data selection algorithm to select useful
Active learning (AL) concerns itself with learning a model from as few
We propose a general-purpose approach to discovering active learning (AL...
Social media, especially Twitter, is being increasingly used for researc...
Active learning is a widely-used training strategy for maximizing predic...
Actively reducing redundancies for neural machine translation
Many information extraction tasks utilize modules that identify and classify tokens (words) of interest in a sequence (sentence) into predefined categories. These modules play a key role in determining the performance of the parent information extraction task and are particularly interesting due to their wide applicability. The term sequence tagging is used to describe the general task that is performed by these modules. For example, consider the utility of being able to identify named entities (like people, organisations, places and so on) in news headlines for personalized content recommendation.
Recently, methods based on deep learning have significantly improved the state-of-art for sequence tagging, yet, the intense labeled data dependence of these methods has presented a formidable challenge under the scope of restricted annotation budgets. The field of Active Learning addresses this problem [9, 18]. Here, the training process alternates between improving the performance of model on given labeled examples and intelligently choosing new examples to be manually annotated. The goal is to train the model quickly with minimal manual annotation requirements.
More precisely, an active learning setup consists of a small number of labeled examples and a relatively large number of unlabeled examples. A model trained on labeled examples is used to label the unlabeled examples. A few examples from this newly labeled data are then selectively sampled to be fed back to the training setup after being annotated by experts. This process is repeated until it converges at a point where the cost of retraining exceeds the cost of annotations.
As active learning strategies try to acquire candidate examples to be labeled, they suffer from inherent redundancy in form of similar examples being selected and fed back, thus, entailing higher cost of annotations and wastage of resources. In order to overcome this bottleneck, a second step is needed to shred redundant examples and one of our major contributions is in improving the quality of this step.
Once an active learning strategy has selected examples (that are possibly redundant), we expect the second step to have the following three qualities: (i) since examples are redundant only in the context of a model, the induced redundancy should preferably be addressed from the perspective of the sequence tagging model, (ii) the proposed step should be applicable to a wide variety of sequence tagging models and (iii) it should be compatible with several existing active learning strategies.
In this paper111https://github.com/RishiHazra/Actively-reducing-redundancies-in-Active-Learning-for-Sequence-Tagging, we focus on active learning for sequence tagging models as manually annotating sequences is especially challenging and time consuming. We develop a Siamese twins based neural architecture  that is trained to predict a similarity score for each pair of input examples in the context of a sequence tagging model. These similarity scores are then used for eliminating redundant examples via clustering.
The model needs to be partially retrained whenever a new set of labeled examples is furnished by the active learning strategy. It is essential to ensure that the model learns from new examples without forgetting what it already knows. We investigate a mixed sampling strategy that achieves this while avoiding unnecessary training overhead.
Contributions: (i) We propose a method to compute model aware similarity between input sequences to actively eliminate redundancy in examples chosen by the active learning strategy and we call this Active Learning, (ii)
We further improve the efficiency of training process (in terms of both training time and data requirement) by incorporating a mixed sampling strategy on the lines of transfer learning,(iii) We demonstrate the utility of our approach by experimenting with various combinations of sequence tagging tasks, active learning strategies and datasets.
cannot generalize to deep learning architectures, which are known to produce state-of-art results. Active learning strategies like margin maximization have been studied in the domain of automatic speech recognition. Recent deep learning models for sequence tagging rely upon Conditional Random Field (CRF) as the decoder . However, in the recent past, Bayesian formulations of deep learning like dropout based approach  and Bayes-by-Backprop 
have provided an alternate technique, that is independent of the CRF scores. These methods have been shown to perform better than other active learning strategies for Named Entity Recognition (NER). While the use of diversity based querying algorithms 
has been explored for quite some time, most of them are independent of the model, and almost all of them employ cosine similarity as the standard measure. In our experiments, we use cosine similarity as a baseline and elucidate its ineffectiveness in the active learning task. Few active learning algorithms[28, 12] tried to combine the informativeness measure with the representativeness measure for finding the optimal query instances using clustering. In our work, we use a dynamic similarity measure to actively select the most representative samples.
Let denote the sentence in the training set, is the token of this sentence. Our goal is to assign each token to one of the classes in the set . For example, will contain tags like person, place etc. in case of a named entity recognition task. We denote the expert annotation for the sentence by where is the label assigned to token .
Once a model has been trained on an initial small set of labeled examples, previously unseen and unlabeled examples are presented to it. The model may or may not be able to correctly label these examples. An Active Learning (AL) strategy chooses a subset of these new unseen examples to be labeled by an expert. The goal of an AL strategy is to select examples that, if labeled, would be most informative
for the model. Different AL strategies use different heuristics to select this subset of most informative examples. To demonstrate that our approach works across AL strategies, we tested it with three of the most popular AL strategies - margin based AL strategy, entropy based AL strategy and Bayesian Active Learning by Disagreement (BALD). We describe these strategies here.
Let be the score assigned by a sequence tagging model to output sequence where denotes model parameters and is the length of sequence . In our implementation, is obtained as the output of CRF  after using Viterbi algorithm. In the context of sequence tagging, margin is defined as the difference in scores obtained by the best scoring sequence of tags and the second best scoring sequence for a sentence , i.e.:
Here, and is the margin score calculated for sentence
. A hyperparameter thresholdis empirically determined and all sentences with are considered to be examples on which the model is less confident (lower the difference, higher the confusion). We also tried a variant where was normalized using but the results did not show any significant improvement. This may be because the difference in (1) neutralizes bias due to length.
For a given sentence and tag sequence , we define the normalized entropy score as:
is the probability of assigning the most likely tag to thetoken in sequence . In our implementation, is computed by applying softmax function to output of the encoder of sequence tagging model (described in Section 5.2
). Application of softmax produces a probability distribution over tags in set. Empirically, it seems important to normalize the entropy with respect to the sentence length as is correlated with it and annotating longer length sequences is undesirable . The model is considered to be less confident on all sentences that have normalized entropy score () higher than a threshold (determined empirically).
Due to stochasticity, models that use dropout  produce a different output, even for the same input, each time a forward pass is executed. In Bayesian Active Learning by Disagreement (BALD), during inference, for each example, a fixed number of forward passes are executed with a set dropout rate. The variability in predicted tag sequences due to dropout is used to compute the confidence of the model on the given example.
Let represent the best scoring sequence of tags for in the forward pass, and be the number of forward passes, then:
The model is considered to be less confident on all samples where .
To demonstrate that our approach works across different tasks, we tested it with two different sequence tagging tasks.
NER  is a popular sequence tagging task that is concerned with identifying and classifying the named entities in unstructured text, thus augmenting its semantic content. It is expensive to obtain annotated data for NER, even as there is abundance of free unlabeled data. Through active learning, one can efficiently select the most informative examples to be labeled, based on what the model knows, thus sidestepping the huge data annotation issues. In this paper, we use three NER datasets : CoNLL 2003 , MIT Movie Corpus and GENIA corpus  which is a biomedical dataset.
We also experiment with the POS tagging task 
. In POS tagging, each word in the text has to be assigned a part-of-speech tag based on its definition and context. This task is not straightforward because some words can represent more than one part-of-speech under different contexts. POS tagging finds applications in sentiment analysis, question answering, and word sense disambiguation among other tasks. We use the CoNLL 2003 dataset for our experiments on POS tagging.
Let denote the sequence tagging model that is being trained and denote the Siamese twins model. As discussed in Section 3.1, an AL strategy selects examples to be labeled by human experts from a pool of unlabeled examples. We hypothesize that standard AL strategies exhibit redundancy in sampling the most confused examples, i.e. there are many similar examples that can be discarded. This happens, for instance, because when the model is confused about a particular sentence structure, it will have low confidence on all examples that follow that sentence structure. An AL strategy will pick all these examples for manual labeling, but in practice, since most of these examples are similar, there will be redundancy in annotation efforts. Existing approaches (which we have used as baselines) either ignore the issue of redundancy altogether or use simple metrics like cosine distance to compute similarity and eliminate redundant examples. We, therefore, propose to add a step to the existing active learning setup to actively sample (adapts as the training proceeds) the most representative examples, in order to minimize the annotation cost and effort and we term the whole setup as Active Learning.
In order to choose representatives, a measure of similarity between examples is required. We argue that this measure of similarity should be dependent on the current state of model as we want to improve the diversity of selected examples in the context of the model. For instance, based on the downstream task, may learn to differentiate between active and passive voice, so an AL strategy must choose representative examples from both categories even though both forms are semantically similar when is not taken into consideration.
One strategy is to use cosine similarity on the encodings generated by model to filter out similar examples [5, 25], however, we claim that this approach would be incapable of discovering higher order similarity information, say for example semantic similarity, because the model hasn’t encountered enough samples during the initial phases of training. Thus, a similarity measure that is more expressive than cosine similarity is needed and hence we propose to use a Siamese twins network  that is trained using auxiliary data.
The Siamese network that we use in our experiments consists of two bidirectional Long Short-Term Memory (LSTM) encoders. For each sentence in the pair, one of the two encoders is randomly chosen and for each word in the sentence, this encoder takes the corresponding output of encoder in(which is again a bidirectional LSTM in our case) as input and produces a concatenation of the last hidden state of forward LSTM and first hidden state of backward LSTM as output. The similarity between sentences in the pair is computed using the following energy function: , where represent the output of Siamese encoder for both input sentences in the pair.
We train our Siamese network (Fig 1) using SICK (Sentences Involving Compositional Knowledge) dataset  to predict the similarity of sentences. This dataset contains pairs of sentences and manually annotated similarity scores between sentences in each pair. While training the Siamese network, the encoder in is kept frozen. Since the Siamese network works on the output of the sequence tagging encoder, it can be incorporated along with any model . As is trained over time, the distribution of output of its encoder changes and hence we periodically retrain the Siamese network (to convergence) to sustain its model awareness. For GENIA corpus, we use biomedical semantic similarity corpus BIOSSES instead of SICK.
Once an AL strategy selects examples on which the model is most confused, we use our pretrained Siamese network to assign a similarity score to every pair of selected examples. This yields a similarity matrix where
is the number of examples selected by AL strategy. We group examples into clusters by passing this similarity matrix to the spectral clustering algorithm. We then rank the examples in each cluster based on their confidence scores (obtained from the AL strategy) and retain a fixed number of lowest scoring examples from each cluster as most representative examples. These examples are then fed back to the training process after manual annotation.
After obtaining expert annotations for the most representative sentences, we add them to a fixed proportion of sentences, randomly sampled from the existing train data and retrain our sequence tagging model using this mixed batch of data. We call this mixed sampling. Through this process, we ensure that our model maintains a balance between the new representative examples and the existing training data. This not only helps us in averting over-fitting to new examples, but also reduces the training time.
Similar ideas have been used, for instance, in transfer learning literature and one can draw parallels, for example, by considering the existing train data as source domain data and new training data as target domain data. However, to the best of our knowledge, no existing approach uses mixed sampling in the context of active learning. Hence, we believe that the application of this idea to the active learning problem (and a subsequent empirical demonstration that it works) is both novel and useful for the community.
Through our experiments, we investigate: (i) utility of model aware similarity computation (Section 4.2) and (ii) effect of using the mixed sampling strategy (Section 4.3). Additionally, we perform experiments on different datasets (Section 5.1), sequence tagging tasks (Sections 3.2) and active learning strategies (Section 3.1) to demonstrate that the proposed approach is applicable in a wide range of contexts.
We experimented with three datasets: CoNLL 2003, MIT Movie Review Corpus and GENIA; in addition to using SICK and BIOSSES as auxiliary datasets for training the Siamese network. Refer to Table 1 for details.
|CoNLL 2003||, (, ) sentences, ground truth NER and POS tags are available.|
|MIT Movie Review Corpus||(, ) sentences, ground truth NER tags are available.|
|GENIA||[21, 13], (, ) sentences, ground truth NER tags are available. From biomedical domain.|
|SICK||, sentence pairs, for each pair ground truth similarity score in range is available.|
|BIOSSES||sentence pairs, for each pair ground truth similarity score in range is available. From biomedical domain.|
|AL (MA Siamese)||48.47||67.54|
|Random (No AL strategy)||48.57||57.50||63.10||63.67||65.30||66.15||66.80||67.80|
We perform experiments with three different AL strategies. In case of BALD, the dropout and number of forward passes were fixed to 0.5 and 51 respectively. We apply normal dropout in the character encoder as opposed to the use of recurrent dropout  in the word encoder of model presented in  owing to an improvement in performance. The threshold for margin, entropy and BALD strategies were set to 15, 40 and 0.2 respectively. For numerical stability we use log probabilities and thus the value for margin based AL strategy’s threshold is outside the interval . We use the spectral clustering  algorithm to cluster the sentences chosen by AL strategy. The number of clusters was set to 20 and we chose two representative examples from each cluster.
For our experiments, we use two different architectures: CNN-BiLSTM-CRF model (CNN for character-level encoding and BiLSTM for word-level encoding) and a BiLSTM-BiLSTM-CRF model  (BiLSTM for both character-level and word-level encoding). For the CNN-BiLSTM-CRF model, the architecture is slightly modified into a light-weight variation of the model proposed in 
, having one layer in CNN encoder with three filters of sizes 1,2 and 3, followed by a max pool, as opposed to three layers in the original setup. This modification was found to produce slightly better results. We use a batch size of 12. We use glove embeddings for all datasets except GENIA corpus, for which we instead use BioWordVec .
The Siamese network, that has two BiLSTM encoders, as described in Section 4.2, is trained using examples from SICK dataset (except for GENIA corpus, where we use BIOSSES dataset). The similarity scores in both datasets were normalized to lie in range . For model aware Siamese (MA Siamese), input sentences are encoded using encoder of sequence tagging model and the embeddings are then passed through Siamese encoders whereas for model isolated Siamese (Iso Siamese), original word embeddings are directly fed as input to Siamese encoders. The output of BiLSTM corresponding to the length of sentence is considered as encoding for the given sentence. We retrained the Siamese network after every 10 iterations. The baseline cosine-similarity measure is computed by directly using last output embedding (corresponding to the length of the sentence) from word encoder of the sequence tagging model.
We divide the sequence tagging dataset into 50 splits and begin by training the sequence tagging model on one of the randomly chosen splits (2% of the data). This model is then used to encode sentences from SICK dataset, which are in turn, used as input for training the Siamese network. Next, at each iteration, we randomly pick one of the remaining splits and use an AL strategy to retrieve examples on which the sequence tagging model is most confused. This is followed by clustering to extract the most representative examples. The similarity scores are exponentiated to avoid negative values during clustering. All results reported here were obtained by averaging over five different randomly chosen initial splits. While starting with random splits initially presents some variance in F-score, it does not significantly affect the final results as the model aware similarity setup adjusts accordingly to acquire necessary examples. All hyperparameters were empirically determined based on the performance on validation sets of respective datasets. For datasets which do not have a predefined validation set, we use a 20% split from train file for validation, which is then combined with the train set before the start of active learning.
To demonstrate that model aware similarity computation is useful, we perform the following control experiments.
To show the utility of a Siamese network, we also provide results for the case when cosine similarity between output encodings of the sequence tagging model is used rather than using a Siamese network (Cosine in Figure 2).
Additionally, we compare these results with a setup where no AL is applied (Random in Figure 2).
We perform a similar control experiment to investigate the role played by mixed sampling strategy (Legend title in Figure 2: None). We test variants of model with and without this component and present a summary in Section 5.3.
Figure 2 shows a blown up view of the performance (F-score) of our approach against percentage of training data used for all combinations of datasets and AL strategies. Table 2 gives us an interpretation of top left plot in Figure 2.
Through experiments we have shown that:
Our approach consistently outperforms all other methods on all NER and POS datasets.
For BALD AL strategy (which is known to have the best performance for sequence tagging), the improvement over similarity based methods is marginal. Nevertheless, it can be seen that the relative improvement over the baseline (None) remains significant
Our approach outperforms existing approaches that use simple metric like cosine distance as described above.
By comparing AL (MA Siamese) against the None baseline (AL strategy without incremental training), it can be seen that there is no performance drop due to mixed sampling based incremental training procedure. However, due to the mixed sampling strategy, training time decreases as one need not retrain on all examples seen thus far.
As given in Table 3, using a lightweight architecture, we were able to match the performance obtained by training on full dataset using a smaller fraction of the data. GENIA is a medical domain dataset and it is especially costly to label such data. We believe that even a marginal improvement here will offer significant gains in terms of cost. While comparing different AL strategies is not our primary motive, we empirically demonstrate in Fig 2 that one can achieve performance comparable to a complex AL strategy like BALD using simple AL strategies like Margin and Entropy based approach if one uses our model aware Siamese. For POS tagging, we use the BiLSTM-BiLSTM-CRF model owing to an improvement in performance. For a better understanding of the variation in F-score with percentage of labeled data, refer to Figure 3 in supplementary material.
|Dataset||% of train data used to reach full dataset F-Score||% less data required to reach full data F-score|
|CoNLL 2003 (NER)||25%||2%|
|CoNLL 2003 (POS)||31%||16%|
Our experiments with MA Siamese demonstrates that comparable performance (to state-of-the-art) can be achieved using fewer labels as compared to existing AL strategies, even with exactly the same setup (same model, AL strategy and dataset). Thus, we can say that the discarded examples were redundant.Avoiding annotation of such samples saves time and brings down the cost (of both computational resources and annotations). Besides, these gains can be seen across AL strategies and datasets. This can be really effective in the medical domain annotations which require high expertise. For additional experiments with a different sequence tagging architecture: BiLSTM-BiLSTM-CRF, refer to Figure 4 of in supplementary material.
In this paper, we proposed an approach to mitigate redundancy in existing AL strategies for sequence tagging that uses model aware similarity scores. We empirically demonstrated that our proposed approach performs consistently well across many datasets, AL strategies and models. Although, we focused exclusively on the sequence tagging problem, we believe that our idea of using model aware similarity scores can be applied in other contexts. We leave this for future work.
Weight uncertainty in neural networks. In
Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 1613–1622. External Links: Cited by: §2.
Analysis of perceptron-based active learning. J. Mach. Learn. Res. 10, pp. 281–299. External Links: Cited by: §2.
A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1019–1027. External Links: Cited by: §5.2.
On spectral clustering: analysis and an algorithm. In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), pp. 849–856. Cited by: §4.2, §5.2.
GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §5.2.
Large margin hidden markov models for automatic speech recognition. In Advances in Neural Information Processing Systems 19, B. Schölkopf, J. C. Platt, and T. Hoffman (Eds.), pp. 1249–1256. External Links: Cited by: §2.
Representative sampling for text classification using support vector machines. In Proceedings of the 25th European Conference on IR Research, ECIR’03, pp. 393–407. External Links: Cited by: §2.