Active learning (AL) is a machine learning paradigm for efficiently acquiring data for annotation from a (typically large) pool of unlabeled dataLewis1994-mj; Cohn:1996:ALS:1622737.1622744; settles2009active. Its goal is to concentrate the human labeling effort on the most informative data points that will benefit model performance the most and thus reducing data annotation cost.
The most widely used approaches to acquiring data for AL are based on uncertainty and diversity, often described as the “two faces of AL” DASGUPTA20111767. While uncertainty-based methods leverage the model predictive confidence to select difficult examples for annotation Lewis:1994:SAT:188490.188495; Cohn:1996:ALS:1622737.1622744, diversity sampling exploits heterogeneity in the feature space by typically performing clustering Brinker03incorporatingdiversity; pmlr-v16-bodo11a. Still, both approaches have core limitations that may lead to acquiring redundant data points. Algorithms based on uncertainty may end up choosing uncertain yet uninformative repetitive data, while diversity-based methods may tend to select diverse yet easy examples for the model 10.5555/645530.655646. The two approaches are orthogonal to each other, since uncertainty sampling is usually based on the model’s output, while diversity exploits information from the input (i.e. feature) space. Hybrid data acquisition functions that combine uncertainty and diversity sampling have also been proposed shen-etal-2004-multi; zhu-etal-2008-active; Ducoffe2018-sq; Ash2020Deep; yuan-etal-2020-cold; ru-etal-2020-active.
In this work, we aim to leverage characteristics from hybrid data acquisition. We hypothesize that data points that are close in the model feature space (i.e. share similar or related vocabulary, or similar model encodings) but the model produces different predictive likelihoods, should be good candidates for data acquisition. We define such examples as contrastive (see example in Figure 1). For that purpose, we propose a new acquisition function that searches for contrastive examples in the pool of unlabeled data. Specifically, our method, Contrastive Active Learning (Cal) selects unlabeled data points from the pool, whose predictive likelihoods diverge the most from their neighbors in the training set. This way, Cal shares similarities with diversity sampling, but instead of performing clustering it uses the feature space to create neighborhoods. Cal also leverages uncertainty, by using predictive likelihoods to rank the unlabeled data.
We evaluate our approach in seven datasets from four tasks including sentiment analysis, topic classification, natural language inference and paraphrase detection. We compareCal against a full suite of baseline acquisition functions that are based on uncertainty, diversity or both. We also examine robustness by evaluating on out-of-domain data, apart from in-domain held-out sets. Our contributions are the following:
We propose Cal, a new acquisition function for active learning that acquires contrastive examples from the pool of unlabeled data (§2);
We show that Cal performs consistently better or equal compared to all baselines in all tasks when evaluated on in-domain and out-of-domain settings (§4);
We conduct a thorough analysis of our method showing that Cal achieves a better trade-off between diversity and uncertainty compared to the baselines (§6).
We release our code online 111https://github.com/mourga/contrastive-active-learning.
2 Contrastive Active Learning
In this section we present in detail our proposed method, Cal: Contrastive Active Learning. First, we provide a definition for contrastive examples and how they are related to finding data points that are close to the decision boundary of the model (§2.1). We next describe an active learning loop using our proposed acquisition function (§2.2).
2.1 Contrastive Examples
In the context of active learning, we aim to formulate an acquisition function that selects contrastive examples from a pool of unlabeled data for annotation. We draw inspiration from the contrastive learning framework, that leverages the similarity between data points to push those from the same class closer together and examples from different classes further apart during training 10.5555/2999792.2999959; NIPS2016_6b180037; oord2019representation; pmlr-v119-chen20j; gunel2021supervised.
In this work, we define as contrastive examples two data points if their model encodings are similar, but their model predictions are very different (maximally disagreeing predictive likelihoods).
Formally, data points and should first satisfy a similarity criterion:
where is an encoder that maps in a shared feature space, is a distance metric and is a small distance value.
A second criterion, based on model uncertainty, is to evaluate that the predictive probability distributions of the modeland for the inputs and should maximally diverge:
For example, in a binary classification problem, given a reference example with output probability distribution () 333A predictive distribution () here denotes that the model is confident that belongs to the first class and to the second. and similar candidate examples with () and with (), we would consider as contrastive examples the pair . However, if another example (similar to in the model feature space) had a probability distribution (), then the most contrastive pair would be (, ).
Figure 1 provides an illustration of contrastive examples for a binary classification case. All data points inside the circle (dotted line) are similar in the model feature space, satisfying Eq. 1. Intuitively, if the divergence of the output probabilities of the model for the gray and blue shaded data points is high, then Eq. 2 should also hold and we should consider them as contrastive.
From a different perspective, data points with similar model encodings (Eq. 1) and dissimilar model outputs (Eq. 2), should be close to the model’s decision boundary (Figure 1). Hence, we hypothesize that our proposed approach to select contrastive examples is related to acquiring difficult examples near the decision boundary of the model. Under this formulation, Cal
does not guarantee that the contrastive examples lie near the model’s decision boundary, because our definition is not strict. In order to ensure that a pair of contrastive examples lie on the boundary, the second criterion should require that the model classifies the two examples in different classes (i.e. different predictions). However, calculating the distance between an example and the model decision boundary is intractable and approximations that use adversarial examples are computationally expensiveDucoffe2018-sq.
2.2 Active Learning Loop
Assuming a multi-class classification problem with classes, labeled data for training and a pool of unlabeled data , we perform AL for iterations. At each iteration, we train a model on and then use our proposed acquisition function, Cal (Algorithm 1), to acquire a batch consisting of examples from . The acquired examples are then labeled444We simulate AL, so we already have the labels of the examples of (but still treat it as an unlabeled dataset)., they are removed from the pool and added to the labeled dataset , which will serve as the training set for training a model in the next AL iteration. In our experiments, we use a pretrained Bert model Devlin2019-ou, which we fine-tune at each AL iteration using the current . We begin the AL loop by training a model using an initial labeled dataset 555We acquire the first examples that form the initial training set by applying random stratified sampling (i.e. keeping the initial label distribution)..
Find Nearest Neighbors for Unlabeled Candidates
The first step of our contrastive acquisition function (cf. line 2) is to find examples that are similar in the model feature space (Eq. 1). Specifically, we use the [CLS] token embedding of Bert as our encoder to represent all data points in and
. We use a K-Nearest-Neighbors (KNN) implementation using the labeled data, in order to query similar examples for each candidate . Our distance metric is Euclidean distance. To find the most similar data points in for each , we select the top instead of selecting a predefined threshold (Eq. 1) 666
We leave further modifications of our scoring function as future work. One approach would be to add the average distance from the neighbors (cf. line 6) in order to alleviate the possible problem of selecting outliers.. This way, we create a neighborhood that consists of the unlabeled data point and its closest examples in (Figure 1).
Compute Contrastive Score between Unlabeled Candidates and Neighbors
In the second step, we compute the divergence in the model predictive probabilities for the members of the neighborhood (Eq. 2). Using the current trained model to obtain the output probabilities for all data points in (cf. lines 3-4), we then compute the Kullback–Leibler divergence (KL) between the output probabilities of and all (cf. line 5). To obtain a score for a candidate , we take the average of all divergence scores (cf. line 6).
Rank Unlabeled Candidates and Select Batch
We apply these steps to all candidate examples and obtain a score for each. With our scoring function we define as contrastive examples the unlabeled data that have the highest score . A high score indicates that the unlabeled data point has a high divergence in model predicted probabilities compared to its neighbors in the training set (Eq. 1, 2), suggesting that it may lie near the model’s decision boundary. To this end, our acquisition function selects the top examples from the pool that have the highest score (cf. line 8), that form the acquired batch .
3 Experimental Setup
3.1 Tasks & Datasets
We conduct experiments on sentiment analysis, topic classification, natural language inference and paraphrase detection tasks. We provide details for the datasets in Table 1. We follow yuan-etal-2020-cold and use imdb maas-etal-2011-learning, sst- socher-etal-2013-recursive, pubmed dernoncourt-lee-2017-pubmed and agnews from NIPS2015_250cf8b5 where we also acquired dbpedia. We experiment with tasks requiring pairs of input sequences, using qqp and qnli from glue wang2018glue. To evaluate robustness on out-of-distribution (OOD) data, we follow hendrycks-etal-2020-pretrained and use sst- as OOD dataset for imdb and vice versa. We finally use twitterppdb lan-etal-2017-continuously as OOD data for qqp as in Desai2020-ys.
|imdb||Sentiment Analysis||Movie Reviews||sst-2||K||K||K|
|sst-2||Sentiment Analysis||Movie Reviews||imdb||6K||K|
|qnli||Natural Language Inference||Wikipedia||-||K||K||K|
|qqp||Paraphrase Detection||Social QA Questions||twitterppdb||K||K||K|
We compare Cal against five baseline acquisition functions. The first method, Entropy is the most commonly used uncertainty-based baseline that acquires data points for which the model has the highest predictive entropy. As a diversity-based baseline, following yuan-etal-2020-cold, we use BertKM
that applies k-means clustering using thenormalized Bert output embeddings of the fine-tuned model to select data points. We compare against Badge Ash2020Deep, an acquisition function that aims to combine diversity and uncertainty sampling, by computing gradient embeddings for every candidate data point in and then using clustering to select a batch. Each is computed as the gradient of the cross-entropy loss with respect to the parameters of the model’s last layer, aiming to be the component that incorporates uncertainty in the acquisition function 777We note that BertKM and Badge
are computationally heavy approaches that require clustering of vectors with high dimensionality, while their complexity grows exponentially with the acquisition size. We thus do not apply them to the datasets that have a large. More details can be found in the Appendix A.2. We also evaluate a recently introduced cold-start acquisition function called Alps yuan-etal-2020-cold that uses the masked language model (MLM) loss of Bert as a proxy for model uncertainty in the downstream classification task. Specifically, aiming to leverage both uncertainty and diversity, Alps forms a surprisal embedding for each , by passing the unmasked input through the Bert MLM head to compute the cross-entropy loss for a random 15% subsample of tokens against the target labels. Alps clusters these embeddings to sample sentences for each AL iteration. Lastly, we include Random
, that samples data from the pool from a uniform distribution.
3.3 Implementation Details
We use BERT-base Devlin2019-ou adding a task-specific classification layer using the implementation from the HuggingFace library wolf-etal-2020-transformers. We evaluate the model
times per epoch on the development set followingDodge2020FineTuningPL and keep the one with the lowest validation loss. We use the standard splits provided for all datasets, if available, otherwise we randomly sample a validation set from the training set. We test all models on a held-out test set. We repeat all experiments with five different random seeds resulting into different initializations of the parameters of the model’s extra task-specific output feedforward layer and the initial . For all datasets we use as budget the of , initial training set and acquisition size . Each experiment is run on a single Nvidia Tesla V100 GPU. More details are provided in the Appendix A.1.
4.1 In-domain Performance
We present results for in-domain test accuracy across all datasets and acquisition functions in Figure 2. We observe that Cal is consistently the top performing method especially in dbpedia, pubmed and agnews datasets.
Cal performs slightly better than Entropy in imdb, qnli and qqp, while in sst- most methods yield similar results. Entropy is the second best acquisition function overall, consistently performing better than diversity-based or hybrid baselines. This corroborates recent findings from Desai2020-ys that Bert
is sufficiently calibrated (i.e. produces good uncertainty estimates), making it a tough baseline to beat in AL.
BertKM is a competitive baseline (e.g. sst-, qnli) but always underperforms compared to Cal and Entropy
, suggesting that uncertainty is the most important signal in the data selection process. An interesting future direction would be to investigate in depth whether and which (i.e. which layer) representations of the current (pretrained language models) works best with similarity search algorithms and clustering.
Similarly, we can see that Badge, despite using both uncertainty and diversity, also achieves low performance, indicating that clustering the constructed gradient embeddings does not benefit data acquisition. Finally, we observe that Alps generally underperforms and is close to Random. We can conclude that this heterogeneous approach to uncertainty, i.e. using the pretrained language model as proxy for the downstream task, is beneficial only in the first few iterations, as shown in yuan-etal-2020-cold.
Surprisingly, we observe that for the sst- dataset Alps performs similarly with the highest performing acquisition functions, Cal and Entropy. We hypothesize that due to the informal textual style of the reviews of sst- (noisy social media data), the pretrained Bert model can be used as a signal to query linguistically hard examples, that benefit the downstream sentiment analysis task. This is an interesting finding and a future research direction would be to investigate the correlation between the difficulty of an example in a downstream task with its perplexity (loss) of the pretrained language model.
4.2 Out-of-domain Performance
We also evaluate the out-of-domain (OOD) robustness of the models trained with the actively acquired datasets of the last iteration (i.e. of or 100% of the AL budget) using different acquisition strategies. We present the OOD results for sst-, imdb and qqp in Table 2. When we test the models trained with sst- on imdb (first column) we observe that Cal achieves the highest performance compared to the other methods by a large margin, indicating that acquiring contrastive examples can improve OOD generalization. In the opposite scenario (second column), we find that the highest accuracy is obtained with Entropy. However, similarly to the ID results for sst- (Figure 2), all models trained on different subsets of the imdb dataset result in comparable performance when tested on the small sst-
test set (the mean accuracies lie inside the standard deviations across models). We hypothesize that this is becausesst- is not a challenging OOD dataset for the different imdb models. This is also evident by the high OOD accuracy, 85% on average, which is close to the 91% sst- ID accuracy of the full model (i.e. trained on 100% of the ID data). Finally, we observe that Cal obtains the highest OOD accuracy for qqp compared to Random, Entropy and Alps. Overall, our empirical results show that the models trained on the actively acquired dataset with Cal obtain consistently similar or better performance than all other approaches when tested on OOD data.
5 Ablation Study
We conduct an extensive ablation study in order to provide insights for the behavior of every component of Cal. We present all AL experiments on the agnews dataset in Figure 3.
We first aim to evaluate our hypothesis that Cal acquires difficult examples that lie close to the model’s decision boundary. Specifically, to validate that the ranking of the constructed neighborhoods is meaningful, we run an experiment where we acquire candidate examples that have the minimum divergence from their neighbors opposite to Cal (i.e. we replace with in line 8 of Algorithm 1). We observe (Fig. 3 - CAL opposite) that even after acquiring 15% of unlabeled data, the performance remains unchanged compared to the initial model (of the first iteration), even degrades. In effect, this finding denotes that Cal does select informative data points.
Next, we experiment with changing the way we construct the neighborhoods, aiming to improve computational efficiency. We thus modify our algorithm to create a neighborhood for each labeled example (instead of unlabeled).888In this experiment, we essentially change the for-loop of Algorithm 1 (cf. line 1-7) to iterate for each in (instead of each in ) and similarly find the nearest neighbors of each labeled example in the pool (KNN) As for the scoring (cf. line 6), if an unlabeled example was not picked (i.e. was not a neighbor to a labeled example), its score is zero. If it was picked multiple times we average its scores. We finally acquire the top unlabeled data with the highest scores. This formulation is more computationally efficient since usually .. This way we compute a divergence score only for the neighbors of the training data points. However, we find this approach to slightly underperform (Fig. 3 - CAL per labeled example), possibly because only a small fraction of the pool is considered and thus the uncertainty of all the unlabeled data points is not taken into account.
We also experiment with several approaches for constructing our scoring function (cf. line 6 in Algorithm 1). Instead of computing the KL divergence between the predicted probabilities of each candidate example and its labeled neighbors (cf. line 5), we used cross entropy between the output probability distribution and the gold labels of the labeled data. The intuition is to evaluate whether information of the actual label is more useful than the model’s predictive probability distribution. We observe this scoring function to result in a slight drop in performance (Fig. 3 - Cross Entropy). We also experimented with various pooling operations to aggregate the KL divergence scores for each candidate data point. We found maximum and median (Fig. 3 - Max/Median) to perform similarly with the average (Fig. 3 - Cal), which is the pooling operation we decided to keep in our proposed algorithm.
Since our approach is related to to acquiring data near the model’s decision boundary, this effectively translates into using the [CLS] output embedding of Bert. Still, we opted to cover several possible alternatives to the representations, i.e. feature space, that can be used to find the neighbors with KNN. We divide our exploration into two categories: intrinsic representations from the current fine-tuned model and extrinsic using different methods. For the first category, we examine representing each example with the mean embedding layer of Bert (Fig. 3 - Mean embedding) or the mean output embedding (Fig. 3 - Mean output). We find both alternatives to perform worse than using the [CLS] token (Fig. 3 - Cal). The motivation for the second category is to evaluate whether acquiring contrastive examples in the input feature space, i.e. representing the raw text, is meaningful gardner-etal-2020-evaluating 999This can be interpreted as comparing the effectiveness of selecting data near the model decision boundary vs. the task decision boundary, i.e. data that are similar for the task itself or for the humans (in terms of having the same raw input/vocabulary), but are from different classes.. We thus examine contextual representations from a pretrained Bert language model (Fig. 3 - BERT-pr [CLS]) (not fine-tuned in the task or domain) and non-contextualized tf-idf vectors (Fig. 3 - TF-IDF). We find both approaches, along with Mean embedding, to largely underperform compared to our approach that acquires ambiguous data near the model decision boundary.
Finally, we further investigate Cal and all acquisition functions considered (baselines), in terms of diversity, representativeness and uncertainty. Our aim is to provide insights on what data each method tends to select and what is the uncertainty-diversity trade-off of each approach. Table 3 shows the results of our analysis averaged across datasets. We denote with the labeled set, the unlabeled pool and an acquired batch of data points from 101010In the previous sections we used and to denote the labeled and unlabeled sets and we change the notation here to and , respectively, for simplicity..
6.1 Diversity & Uncertainty Metrics
Diversity in input space (Div.-I)
We first evaluate the diversity of the actively acquired data in the input feature space, i.e. raw text, by measuring the overlap between tokens in the sampled sentences and tokens from the rest of the data pool . Following yuan-etal-2020-cold, we compute Div.-I as the Jaccard similarity between the set of tokens from the sampled sentences , , and the set of tokens from the unsampled sentences , , . A high Div.-I value indicates high diversity because the sampled and unsampled sentences have many tokens in common.
Diversity in feature space (Div.-F)
We next evaluate diversity in the (model) feature space, using the [CLS] representations of a trained Bert model 111111To enable an appropriate comparison, this analysis is performed after the initial Bert model is trained with the initial training set and each AL strategy has selected examples equal to of the pool (first iteration). Correspondingly, all strategies select examples from the same unlabeled set while using outputs from the same Bert model.. Following Zhdanov2019-mg and Ein-Dor2020-mm, we compute Div.-F of a set as , where denotes the [CLS] output token of example obtained by the model which was trained using , and denotes the Euclidean distance between and in the feature space.
To measure uncertainty, we use the model trained on the entire training dataset (Figure 2 - Full supervision). As in yuan-etal-2020-cold
, we use the logits from the fully trained model to estimate the uncertainty of an example, as it is a reliable estimate due to its high performance after training on many examples, while it offers a fair comparison across all acquisition strategies. First, we compute predictive entropy of an inputwhen evaluated by model and then we take the average over all sentences in a sampled batch . We use the average predictive entropy to estimate uncertainty of the acquired batch for each method . As a sampled batch we use the full actively acquired dataset after completing our AL iterations (with of the data).
We finally analyze the representativeness of the acquired data as in Ein-Dor2020-mm. We aim to study whether AL strategies tend to select outlier examples that do not properly represent the overall data distribution. We rely on the KNN-density measure proposed by zhu-etal-2008-active, where the density of an example is quantified by one over the average distance between the example and its K most similar examples (i.e., K nearest neighbors) within , based on the [CLS] representations as in Div.-F. An example with high density degree is less likely to be an outlier. We define the representativeness of a batch as one over the average KNN-density of its instances using the Euclidean distance with K=.
We first observe in Table 3 that Alps acquires the most diverse data across all approaches. This is intuitive since Alps is the most linguistically-informed method as it essentially acquires data that are difficult for the language modeling task, thus favoring data with a more diverse vocabulary. All other methods acquire similarly diverse data, except Badge that has the lowest score. Interestingly, we observe a different pattern when evaluating diversity in the model feature space (using the [CLS] representations). BertKM has the highest Div.-F score, as expected, while Cal and Entropy have the lowest. This supports our hypothesis that uncertainty sampling tends to acquire uncertain but similar examples, while Cal by definition constrains its search in similar examples in the feature space that lie close to the decision boundary (contrastive examples). As for uncertainty, we observe that Entropy and Cal acquire the most uncertain examples, with average entropy almost twice as high as all other methods. Finally, regarding representativeness of the acquired batches, we see that Cal obtains the highest score, followed by Entropy, with the rest AL strategies to acquire less representative data.
Overall, our analysis validates assumptions on the properties of data expected to be selected by the various acquisition functions. Our findings show that diversity in the raw text does not necessarily correlate with diversity in the feature space. In other words, low Div.-F does not translate to low diversity in the distribution of acquired tokens (Div.-I), suggesting that Cal can acquire similar examples in the feature space that have sufficiently diverse inputs. Furthermore, combining the results of our AL experiments (Figure 2) and our analysis (Table 3) we conclude that the best performance of Cal, followed by Entropy, is due to acquiring uncertain data. We observe that the most notable difference, in terms of selected data, between the two approaches and the rest is uncertainty (Unc.), suggesting perhaps the superiority of uncertainty over diversity sampling. We show that Cal improves over Entropy because our algorithm “guides” the focus of uncertainty sampling by not considering redundant uncertain data that lie away from the decision boundary and thus improving representativeness. We finally find that Random is evidently the worst approach, as it selects the least diverse and uncertain data on average compared to all methods.
7 Related Work
Uncertainty-based acquisition for AL focuses on selecting data points that the model predicts with low confidence. A simple uncertainty-based acquisition function is least confidence Lewis:1994:SAT:188490.188495 that sorts data in descending order from the pool by the probability of not predicting the most confident class. Another approach is to select samples that maximize the predictive entropy. Houlsby2011-qz
propose Bayesian Active Learning by Disagreement (BALD), a method that chooses data points that maximize the mutual information between predictions and model’s posterior probabilities.Gal2017-gh applied BALD for deep neural models using Monte Carlo dropout Gal2016-lf to acquire multiple uncertainty estimates for each candidate example. Least confidence, entropy and BALD acquisition functions have been applied in a variety of text classification and sequence labeling tasks, showing to substantially improve data efficiency Shen2017-km; Siddhant2018-lg; Lowell2019-mf; Kirsch2019-lk; shelmanov-etal-2021-active; DBLP:journals/corr/abs-2104-08320.
On the other hand, diversity or representative sampling is based on selecting batches of unlabeled examples that are representative of the unlabeled pool, based on the intuition that a representative set of examples once labeled, can act as a surrogate for the full data available. In the context of deep learning,DBLP:journals/corr/abs-1711-00941 and conf/iclr/SenerS18 select representative examples based on core-set construction, a fundamental problem in computational geometry. Inspired by generative adversarial learning, DBLP:journals/corr/abs-1907-06347 define AL as a binary classification task with an adversarial classifier trained to not be able to discriminate data from the training set and the pool. Other approaches based on adversarial active learning, use out-of-the-box models to perform adversarial attacks on the training data, in order to approximate the distance from the decision boundary of the model Ducoffe2018-sq; ru-etal-2020-active.
There are several existing approaches that combine representative and uncertainty sampling. Such approaches include active learning algorithms that use meta-learning 10.5555/1005332.1005342; conf/aaai/HsuL15fang-etal-2017-learning; liu-etal-2018-learning-actively, aiming to learn a policy for switching between a diversity-based or an uncertainty-based criterion at each iteration. Recently, Ash2020Deep propose Batch Active learning by Diverse Gradient Embeddings (Badge) and yuan-etal-2020-cold propose Active Learning by Processing Surprisal (Alps), a cold-start acquisition function specific for pretrained language models. Both methods construct representations for the unlabeled data based on uncertainty, and then use them for clustering; hence combining both uncertainty and diversity sampling. The effectiveness of AL in a variety of NLP tasks with pretrained language models, e.g. Bert Devlin2019-ou, has empirically been recently evaluated by Ein-Dor2020-mm, showing substantial improvements over random sampling.
8 Conclusion & Future Work
We present Cal, a novel acquisition function for AL that acquires contrastive examples; data points which are similar in the model feature space and yet the model outputs maximally different class probabilities. Our approach uses information from the feature space to create neighborhoods for each unlabeled example, and predictive likelihood for ranking the candidate examples. Empirical experiments on various in-domain and out-of-domain scenarios demonstrate that Cal performs better than other acquisition functions in the majority of cases. After analyzing the actively acquired datasets obtained with all methods considered, we conclude that entropy is the hardest baseline to beat, but our approach improves it by guiding uncertainty sampling in regions near the decision boundary with more informative data.
Still, our empirical results and analysis show that there is no single acquisition function to outperform all others consistently by a large margin. This demonstrates that there is still room for improvement in the AL field.
Furthermore, recent findings show that in specific tasks, as in Visual Question Answering (VQA), complex acquisition functions might not outperform random sampling because they tend to select collective outliers that hurt model performance karamcheti-etal-2021-mind. We believe that taking a step back and analyzing the behavior of standard acquisition functions, e.g. with Dataset Maps swayamdipta-etal-2020-dataset, might be beneficial. Especially, if similar behavior appears in other NLP tasks too.
Another interesting future direction for Cal, related to interpretability, would be to evaluate whether acquiring contrastive examples for the task Kaushik2020Learning; gardner-etal-2020-evaluating is more beneficial than contrastive examples for the model, as we do in Cal.
KM and NA are supported by Amazon through the Alexa Fellowship scheme.
Appendix A Appendix
a.1 Data & Hyperparameters
In this section we provide details of all the datasets we used in this work and the hyperparparameters used for training the model. For qnli, imdb and sst- we randomly sample 10% from the training set to serve as the validation set, while for agnews and qqp we sample 5%. For the dbpedia dataset we undersample both training and validation datasets (from the standard splits) to facilitate our AL simulation (i.e. the original dataset consists of 560K training and 28K validation data examples). For all datasets we use the standard test set, apart from sst-, qnli and qqp datasets that are taken from the glue benchmark wang2018glue we use the development set as the held-out test set and subsample a development set from the training set.
For all datasets we train BERT-base Devlin2019-ou from the HuggingFace library wolf-etal-2020-transformers
in PytorchNEURIPS2019_9015. We train all models with batch size , learning rate , no weight decay, AdamW optimizer with epsilon . For all datasets we use maximum sequence length of , except for imdb that contain longer input texts, where we use . To ensure reproducibility and fair comparison between the various methods under evaluation, we run all experiments with the same five seeds that we randomly selected from the range . We evaluate the model 5 times per epoch on the development set following Dodge2020FineTuningPL and keep the one with the lowest validation loss. We use the code provided by yuan-etal-2020-cold for Alps, Badge and BertKM.
In this section we compare the efficiency of the acquisition functions considered in our experiments. We denote the number of labeled data in , the number of unlabeled data in , the number of classes in the downstream classification task, the dimension of embeddings, is fixed number of iterations for k-MEANS, the maximum sequence length and the acquisition size. In our experiments, following yuan-etal-2020-cold, , , , and 121212Except for imdb where .. Alps requires considering that the surprisal embeddings are computed. BertKM and Badge, the most computationally heavy approaches, require and respectively, given that gradient embeddings are computed for Badge 131313This information is taken from Section 6 of yuan-etal-2020-cold.. On the other hand, Entropy only requires forward passes though the model, in order to obtain the logits for all the data in . Instead, our approach, Cal, first requires forward passes, in order to acquire the logits and the CLS representations of the the data (in and ) and then one iteration for all data in to obtain the scores.
We present the runtimes in detail for all datasets and acquisition functions in Tables 4 and 5. First, we define the total acquisition time as a sum of two types of times; inference and selection time. Inference time is the time that is required in order to pass all data from the model to acquire predictions or probability distributions or model encodings (representations). This is explicitly required for the uncertainty-based methods, like Entropy, and our method Cal. The remaining time is considered selection and essentially is the time for all necessary computations in order to rank and select the most important examples from .
We observe in Table 4 that the diversity-based functions do not require this explicit inference time, while for Entropy it is the only computation that is needed (taking the argmax of a list of uncertainty scores is negligible). Cal requires both inference and selection time. We can see that inference time of Cal is a bit higher than Entropy because we do forward passes instead of , that is equivalent to both and instead of only . The selection time for Cal is the for-loop as presented in our Algorithm 1. We observe that it is often less computationally expensive than the inference step (which is a simple forward pass through the model). Still, there is room for improvement in order to reduce the time complexity of this step.
In Table 5 we present the total time for all datasets (ordered with increasing size) and the average time for each acquisition function, as a means to rank their efficiency. Because we do not apply all acquisition functions to all datasets we compute three different average scores in order to ensure fair comparison. avg.-all is the average time across all datasets and is used to compare Random, Alps, Entropy and Cal. avg.-3 is the average time across the first datasets (imdb, sst- and dbpedia) and is used to compare all acquisition functions. Finally, avg.-6 is the average time across all datasets apart from qqp and is used to compare Random, Alps, BertKM, Entropy and Cal.
We first observe that Entropy is overall the most efficient acquisition function. According to the avg.-all column, we observe that Cal is the second most efficient function, followed by Alps. According to the avg.-6 we observe the same pattern, with BertKM to be the slowest method. Finally, we compare all acquisition functions in the smallest (in terms of size of ) datasets and find that Entropy is the fastest method followed by Alps and Cal that require almost 3 times more computation time. The other clustering methods, BertKM and Badge, are significantly more computationally expensive, requiring respectively 13 and 100(!) times more time than Entropy.
Interestingly, we observe the effect of the acquisition size ( of in our case) and the size of in the clustering methods. As these parameters increase, the computation of the corresponding acquisition function increases dramatically. For example, we observe that in the smallest datasets that Alps requires similar time to Cal. However, when we increase and (i.e. as we move from dbpedia with examples in to qnli with etc - see Table 1) we observe that the acquisition time of Alps becomes twice as much as that of Cal. For instance, in qqp with acquisition size we see that Alps requires seconds on average, while Cal . This shows that even though our approach is more computationally expensive as the size of increases, the complexity is linear, while for the other hybrid methods that use clustering, the complexity grows exponentially.
All code for data preprocessing, model implementations, and active learning algorithms is made available at https://github.com/mourga/contrastive-active-learning. For questions regarding the implementation, please contact the first author.