Towards Robust and Reproducible Active Learning Using Neural Networks

02/21/2020 ∙ by Prateek Munjal, et al. ∙ 0

Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling entire data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we show that recent AL methods offer a gain over random baseline under a brittle combination of experimental conditions. We demonstrate that such marginal gains vanish when experimental factors are changed, leading to reproducibility issues and suggesting that AL methods lack robustness. We also observe that with a properly tuned model, which employs recently proposed regularization techniques, the performance significantly improves for all AL methods including the random sampling baseline, and performance differences among the AL methods become negligible. Based on these observations, we suggest a set of experiments that are critical to assess the true effectiveness of an AL method. To facilitate these experiments we also present an open source toolkit. We believe our findings and recommendations will help advance reproducible research in robust AL using neural networks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Active learning (AL) is a machine learning paradigm that promises to help reduce the burden of data annotation by intelligently selecting a subset of informative samples from a large pool of unlabeled data that are relatively more conducive for learning. In AL, a model trained with a small amount of labeled seed data is used to parse through the unlabeled data to select the subset that should be sent to an annotator (called oracle in AL literature). To select such a subset, AL methods rely on exploiting the latent-space structure of samples, model uncertainty, or other such heuristics. The promise of reducing annotation cost has brought a surge in recent AL research (Sinha et al., 2019), (Sener and Savarese, 2018), (Beluch et al., 2018), (Gal et al., 2017), (Kirsch et al., 2019), (Tran et al., 2019), (Yoo and Kweon, 2019), and with it, a few outstanding issues.

First, the results reported for RSB vary significantly between studies. For example, using labeled data of CIFAR10, the difference between RSB performance reported by (Yoo and Kweon, 2019) and (Tran et al., 2019) is   under identical settings. Second, the results reported for the same AL method can vary across studies: using VGG16 (Simonyan and Zisserman, 2014) on CIFAR100 (Krizhevsky and Hinton, 2009) with labeled data, (Sener and Savarese, 2018) reports classification accuracy whereas (Sinha et al., 2019) reports for (Sener and Savarese, 2018). Third, recent AL studies have been inconsistent with each other. For example, (Sener and Savarese, 2018) and (Ducoffe and Precioso, 2018) state that diversity-based AL methods consistently outperform uncertainty-based methods, which were found to be worse than the random sampling baseline (RSB). In contrast, recent developments in uncertainty based studies (Yoo and Kweon, 2019) suggest otherwise.

In addition to these issues, results using a new AL method are often reported on simplistic datasets and tested under limited experimental conditions, with an underlying assumption that the relative performance gains using an AL method would be maintained under changes in the experimental conditions. These issues with reporting of AL results has spurred a recent interest in benchmarking of AL methods and recent NLP and computer vision studies have raised a number of interesting questions

(Lowell et al., 2018), (Prabhu et al., 2019), (Mittal et al., 2019). With the goal of improving the reproducibility and robustness of AL methods, in this study we evaluate the performance of these methods for image classification compared to a RSB in a fair experimental environment. The contributions of this study are as follows.


: Through a comprehensive set of experiments performed using our PyTorch-based AL evaluation toolkit

111AL Toolkit will be released on GitHub. To get access to pre-release version, please contact Shadab Khan at we compare different AL methods including state-of-the-art diversity-based, uncertainty-based, and committee-based methods (Sinha et al., 2019), (Sener and Savarese, 2018), (Beluch et al., 2018), (Gal et al., 2017) and a well-tuned RSB. We demonstrate that: 1) results with our RSB are higher across a range of experiments than previously reported, 2)

state-of-the-art AL methods achieve a marginal gain over our RSB under narrow combination of experimental conditions (e.g. a specific architecture), which vanishes with changes in experimental conditions (e.g. using a different architecture for classifier),

3)variance in evaluation metric (accuracy) across repeated runs on the same set of data, or on different fold of initial labeled data, can lead to incorrect conclusions where accuracy gain using an AL method may be observed within the margin of error of accuracy measurement, 4) a bit surprisingly, our experiments also show that these performance gains vanish when the neural networks are well-regularized, and none of the evaluated AL methods performs better than our RSB 5) the variance in accuracy achieved using AL methods is substantially lower in consistent repeated training runs with a well-regularized model, suggesting that such a training regime is unlikely to effect misleading results in AL experiments, 6) finally, we conclude the paper with a set of guidelines on experimental evaluation of a new AL method, and provide a PyTorch-based AL toolkit to facilitate this.

2 Pool-Based Active Learning Methods

Contemporary pool-based AL methods can be broadly classified into: (i) uncertainty based (Sinha et al., 2019), (Gal et al., 2017), (Kirsch et al., 2019), (ii) diversity based (Sener and Savarese, 2018), (Ducoffe and Precioso, 2018), and (iii) committee based (Beluch et al., 2018). AL methods also differ in other aspects, for example, some AL methods use the task model (e.g. model trained for image classification) within their sampling function (Gal et al., 2017), (Sener and Savarese, 2018), where as others use different models for task and sampling functions (Sinha et al., 2019), (Beluch et al., 2018). These methods are discussed in detail next.

Notations: Starting with an initial set of labeled data = and a large pool of unlabeled data =, pool-based AL methods train a model . A sampling function then evaluates , and selects (budget size) samples to be labeled by an oracle. The selected samples with oracle-annotated labels are then added to , resulting in an extended labeled set, which is then used to retrain . This cycle of sample-annotate-train is repeated until the sampling budget is exhausted or a satisficing metric is achieved. AL sampling functions evaluated in this study are outlined next.

2.1 Model Uncertainty on Output (UC)

The method in (Lewis and Gale, 1994) ranks the unlabeled datapoints, in a descending order based on their scores given by , where is the number of classes, and chose the top samples. Typically this approach focuses on the samples in for which the softmax classifier is least confident.

2.2 Deep Bayesian Active Learning (DBAL)

(Gal et al., 2017) train the model with dropout layers and use Monte carlo dropout to approximate the sampling from posterior. For our experiments, we used the two most reported acquisitions i.e., max entropy and Bayesian Active Learning by Disagreement (BALD). The max entropy method selects the top datapoints having maximum entropy () where the posterior is given by, ;where denotes number of forward passes through the model, . BALD selects the top samples that increase the information gain over the model parameters i.e., . We implement DBAL as described in (Gal et al., 2017)

where probability terms in information gain is evaluated using above equation.

2.3 Center of Gravity (CoG)

Uncertainty in unlabeled datapoints is estimated in terms of the euclidean distance from the centre of gravity (

) in the latent space. We define the COG as: , where denotes the layer activations of the model for . Using this distance estimate, we select the top farthest datapoints from CoG. For our experiments, we use the penultimate layer activations.

2.4 Coreset

(Sener and Savarese, 2018) exploit the geometry of datapoints and choose samples that provide a cover to all datapoints. Essentially, their algorithm tries to find a set of points (cover-points), such that distance of any datapoint from its nearest cover-point is minimized. They proposed two sub-optimal but efficient solutions to this NP-Hard problem: coreset-greedy and coreset-MIP (Mixed Integer programming), coreset-greedy is used to initialize coreset-MIP. For our experiments, following (Yoo and Kweon, 2019), we implement coreset-greedy since it achieves comparable performance while being significantly compute efficient.

2.5 Variational Adversarial Active Learning (VAAL)

(Sinha et al., 2019) combined a VAE (Kingma and Welling, 2013) and a discriminator (Goodfellow et al., 2014) to learn a metric for AL sampling. VAE encoder is trained on both , and the discriminator is trained on the latent space representations of to distinguish between seen () and unseen () images. Sampling function selects samples from with lowest discriminator confidence (to be seen) as measured by output of discriminator’s softmax. Effectively, samples that are most likely to be unseen based on the discriminator’s output are chosen.

1:  Input , Budget size and Oracle,
2:  Split
3:  Split
4:  Train a base classifier, using only
6:  while  i  do
7:     sample using
12:     Initialize randomly
13:     while convergence do
14:        Train using only
15:     end while
16:  end while
Algorithm 1 AL Training Schedule

2.6 Ensemble Variance Ratio Learning

Proposed by (Beluch et al., 2018), this is a query-by-committee (QBC) method that uses a variance ratio computed by to select the sample set with the largest dispersion (), where is the number of committee members (CNNs), and is the number of predictions in the modal class category. Variance ratio lies in 0–1 range and can be treated as an uncertainty measure. We note that it is possible to formulate several AL strategies using the ensemble e.g. BALD, max-entropy, etc. Variance ratio was chosen for this study because it was shown by authors to lead to superior results. For training the CNN ensembles, we train 5 models with VGG16 architecture but a different random initialization. Further, following (Beluch et al., 2018), the ensembles are used only for sample set selection, a separate task classifier is trained in fully-supervised manner to do image classification.

3 Regularization and Active Learning

In a ML training pipeline comprising data–model–metric and training tricks, regularization can be introduced in several forms. In neural networks, regularization is commonly applied using parameter norm penalty (metric), dropout (model), or using standard data augmentation techniques such as horizontal flips and random crops (data). However, parameter norm penalty coefficients are not easy to tune and dropout effectively reduces model capacity to reduce the extent of over-fitting on the training data, and requires the drop probability to be tuned. On the other hand, several recent studies in semi-supervised learning (SSL) have shown promising new ways of regularizing neural networks to achieve impressive gains. While it isn’t surprising that these regularization techniques help reduce generalization error, most AL studies have overlooked them. We believe this is because of a reasonable assumption that if an AL method works better than random sampling, then its relative advantage should be maintained when newer regularization techniques and training tricks are used. Since regularization is critical for low-data training regime of AL where the massively-overparameterized model can easily overfit to the limited training data, we investigate the validity of such assumptions by applying regularization techniques to the entire data–model–metric chain of neural network training.

Specifically, we employ parameter norm penalty, random augmentation (RA) (Cubuk et al., 2019), stochastic weighted averaging (SWA) (Izmailov et al., 2018), and shake-shake (SS) (Gastaldi, 2017). In RA, a sequence of randomly chosen image transforms are sequentially applied to the training data, with a randomly chosen distortion magnitude () which picks a value between two extremes. For details of extreme values used for each augmentation choice, we refer the reader to work of (Cubuk et al., 2018). SWA is applied on the model by first saving snapshots of model during the time-course of optimization, and then averaging the snapshots as a post-processing step. For SS experiments, we utilize the publicly available pytorch implementation222 The hyper-parameters associated with these techniques as well as experiments and results with regularization applied to neural network training with AL-selected sample sets are discussed in Sec. 5.3.

4 Implementation Details

We perform experiments on CIFAR10, CIFAR100, and ImageNet by following the training schedule summarized in Alg. 

1. Given a dataset , we split it into train (), validation (), and test () sets. The train set is further divided into the initial labeled () and unlabeled () sets. A base classifier is first trained, followed by iterations of sample-annotate-train process using various AL methods. Model selection is done by choosing the best performing model on the validation set. For a fair comparison, a consistent set of experimental settings is used across all methods. Dataset-specific training details are discussed next.

Learning rate () and weight decay () were tuned using grid search, and set as follows for individual datasets. CIFAR10: optimizer=Adam (Kingma and Ba, 2015), , , input pre-processed using random horizontal flip () and normalization (divide by 255). CIFAR100: optimizer=Adam, and for AL iterations and and for base classifier that was trained on

, input pre-processed using random crop (pad=4) followed by horizontal flip (

) and normalization (divide by 255). ImageNet: optimizer=SGD, . We train the base classifier on for epochs where with a linear warm-up schedule (for first epochs) followed by decaying the by a factor of on epoch number: . For AL iterations we fine-tune the best model (picked by validation set accuracy) from previous iteration for epochs where which gets decayed by a factor of on epoch number: . Further, we choose the best model based on a realistically small validation set (i.e., images) following (Zhai et al., 2019). The input is pre-processed using random crops resized to x

followed by horizontal flip (p=0.5) and normalized to zero mean and one standard deviation using statistics of initial


Architecture: We use VGG16 (Simonyan and Zisserman, 2014) with batchnorm (Ioffe and Szegedy, 2015), 18-layer ResNet (He et al., 2016), and 28-layer 2-head Wide-ResNet (WRN-28-2) (Zagoruyko and Komodakis, 2016) in our experiments. For both target architectures we use333,444

. For CIFAR10/100 models we set the number of neurons in penultimate fully-connected layer of VGG16 to

as in footnotemark: .

Regularization Hyper-parameters: CIFAR10, =; and CIFAR100, =;

. Adam optimizer is used for both datasets. RA parameters are: CIFAR10: n=1, m=5, CIFAR100: n=1, m=2, ImageNet: n=2, m=9. We empirically select the SWA hyperparameters as: CIFAR 10/100: SWA LR:

and frequency:. Imagenet: SWA LR: and frequency:. These parameters are selected after performing a grid search and kept consistent across experiments. We always train a model from scratch in each AL iteration except for Imagenet due to its heavy compute budget.

Implementation of AL methods: We developed a PyTorch-based toolkit to evaluate the AL methods in a unified implementation. AL methods can be cast into two categories based on whether or not AL sampling relies on the task model (classifier network). For example, coreset uses the latent space representations learnt by task model to select the sample set, whereas VAAL relies on a separate VAE-discriminator network to select the samples, independent of the task model. In our implementation, we abstract these two approaches in a sampling function that may use the task model if required by the AL method. Each AL method was implemented using a separate sampling function, by referencing author-provided code if it was available. Using command line arguments, the toolkit allows the user to configure various aspects of training such as architecture used for task model, AL method, size of initial labeled set, size of acquisition batch, number of AL iterations, hyper-parameters for task model training and AL sampling and number of repetitions.

Figure 1: Comparisons of AL methods on CIFAR10 (top) and CIFAR100 (bottom) for different initial labeled sets . The mean accuracy for the base model (at 10% labeled data) is noted at the bottom of each subplot. The model is trained times for different random initialization seeds. The mean of runs in (f) & (l) suggest that no AL method performs significantly better than others. For exact numbers used to create the plots above, please refer to tables in the supplementary section.

5 Experiments and Results

All experiments were performed using 2 available nVidia DGX-1 servers, with each experiment utilizing 1–4 GPUs out of available 8 GPUs on each server. All codes were written in Python using PyTorch and other libraries in addition to third-party codebases. We plan to release our codebase on GitHub soon, for early-access please contact the authors.

5.1 Variance in Evaluation Metrics

Training a neural network involves many stochastic components including parameter initialization, data augmentation, mini-batch selection, and batchnorm whose parameters change with mini-batch statistics. These elements can lead to a different optima thus resulting in varying performances across different runs of the same experiment. To evaluate the variance in classification accuracy caused by different initial labeled data, we draw five random initial labeled sets () with replacement. Each of these five sets were used to train the base model, initialized with random weights, 5 times; a total of 25 models were trained for each AL method to characterize variance within-sample-sets and between-sample-sets.

From the results summarized in Fig. 1, we make the following observations: (i) A standard deviation of 1 to 2.5% in accuracy among different AL methods, indicating that out of chance, it is possible to achieve seemingly better results. (ii) In contrast to previous studies, our extensive experiments indicate that compared to RSB, no AL method achieves strictly better classification accuracy. At times, RSB appears to perform marginally better; for example, it achieves best mean accuracy of (on CIFAR10 with labeled data) and (on CIFAR100 with labeled data), whereas the second best performance is given by DBAL and VAAL i.e., and respectively. (iii) Our results averaged over 25 runs in Fig. 1 (f) and (l) indicate that no method performs clearly better than others. An ANOVA and pairwise multiple comparisons test with Tukey-Cramer FWER correction revealed that no AL method’s performance was significantly different from RSB. This provides a strong evidence and need to repeat an experiment over multiple runs to demonstrate true effectiveness of an AL method.

5.2 Differing Experimental Conditions

Next, we compare AL methods and RSB by modifying different experimental conditions for annotation batch size, size of validation set and class imbalance.

Annotation Batch Size (): Following previous studies, we experiment with annotation batch size () equal to 5%, and 10% of the overall sample count ().

Figure 2: Results when only of training data is annotated at each iteration of AL on (a) CIFAR10 and (b) CIFAR100. Results are average of 5 runs. For exact numbers used to create the plots above, please refer to tables in the supplementary section.

Results in Fig. 2 (corresponding table in supplementary section) show that VAAL and UC perform marginally better than the RSB, although this is inconsistent. For example, on CIFAR100 at labeled data, and , VAAL performs marginally better than most of the AL methods (Fig. 1(l)). This is in contrast to results with (Fig. 2). We therefore conclude that no AL method offers consistent advantage over others under different budget size settings.

Validation Set Size: During training, we select the best performing model on the validation set () to report the test set () results. To evaluate the sensitivity of AL results to the size of , we perform experiments on CIFAR100 with three different sizes: , , and of the total samples (). From results in Table 1, we do not observe any appreciable trend in accuracy with respect to the size of . For example, the RSB achieves a mean accuracy of , , and , respectively, for the best model selected using , and of the training data as . We conclude that AL results do not change significantly with the size of , and a small set can work for model selection in low-data regimes such as AL, freeing up more data for training the task model; a similar observation was made in a recent SSL study (Zhai et al., 2019).

2% 5% 10%
Methods 20% 30% 40% 20% 30% 40% 20% 30% 40%

34.6 1.2 43.3 1.6 49.8 1.1 35.4 1.4 42.5 1.9 49.1 1.7 34 0.3 43.1 1.5 48.4 1.1

34.9 0.8 43.9 0.1 48.6 0.9 34.9 0.5 42.9 1.3 47.7 1.4 34.6 0.5 43.6 0.8 49.5 0.9

36.8 0.7 43.7 0.5 48.8 1.2 33.7 2.2 43.5 1.2 49.1 0.4 34.9 1.1 42.8 1.7 48.9 0.7
Coreset 36.2 1.1 42.8 1.3 49.1 1.1 34.5 1.7 44.4 0.7 49.3 1.3 35.5 0.8 43.2 0.7 48.8 0.6
COG 35.4 1.4 44.2 0.9 49.2 1 34.1 2.1 43.7 0.7 48.8 1.7 35.9 2.2 42.7 1.4 49.4 1.4
DBAL 35.0 0.8 43.8 1.3 48.5 1.6 36.4 1.5 42.8 0.7 50.0 0.8 34.2 1.7 43.4 1.8 49.3 0.9
BALD 34.1 1.3 44 1 49.4 1 36.2 1.3 42.2 1.2 48.5 0.6 36.5 1.2 43.1 0.9 49.3 0.6

35.3 1.8 43.3 0.4 48.7 1 34.2 0.9 43.1 1.1 48.4 0.9 34.7 2.2 43.1 1.6 48.3 0.6
Table 1: Test set performance for model selected with different validation set sizes on CIFAR100. Results are average of 5 runs.

Class Imbalance: Here, we evaluate the robustness of different AL methods on imbalanced data. For this, we construct on CIFAR100 dataset, to simulate long tailed distribution of classes by following a power law, where the number of samples of 100 classes are given by where

. The resulting sample count per class is normalized to construct a probability distribution. Models were trained using previously described settings, with the exception of loss function which was set to weighted cross entropy. The results in Fig. 

4 show that for the first two AL iterations, RSB achieves the highest mean accuracy (), and is surpassed by DBAL in the last iteration. More importantly, we notice that AL methods demonstrate different degree of change in the imbalanced class setting, without revealing a clear trend in the plot. In contrast to the previously reported observations that found AL methods robust to class imbalance in the dataset, we conclude that AL methods do not outperform RSB.

Methods CIFAR10 CIFAR100
RSB 69.54 1.58 26.58 0.29

74.57 0.87 32.51 0.92

+ RA
75.43 0.89 29.77 0.83

+ Shake-Shake(SS)
71.78 0.99 34.8 0.28

+ SWA + RA
79.86 0.6 36.65 0.35

+ SS + SWA + RA
82.88 0.26 44.37 0.78
Table 2: Individual Contributions of different regularization techniques. Results averaged over 5 runs for 10% of training data. Above all experiments use the VGG16 architecture except for Shake-Shake as it is restricted to the family of resnext.

5.3 Regularization

With the motivation stated in section 3, we evaluate the effectiveness of advanced regularization techniques (RA and SWA) in the context of AL using CIFAR10 and CIFAR100 datasets. All experimental settings were used as previously reported, with the exception of number of epochs which was increased to 150 (from 100). We empirically observed that unlike regularization, which requires careful tuning, RA and SWA work fairly well with changes in their hyper-parameters. We therefore do not use regularization in these experiments where RA and SWA was applied.

Fig. 3 compares different AL methods with RSB on CIFAR10/100 datasets. We observe that models trained with RA and SWA consistently achieve significant performance gains across all AL iterations and exhibit appreciably-smaller variance across multiple runs of the experiments. Our regularized random-sampling baselines on labeled data achieves mean accuracy of and respectively on CIFAR10 and CIFAR100. We note that using RSB, for CIFAR10, a model regularized using RA and SWA with 20% of training data achieves over 4% higher accuracy compared to a model trained without RA and SWA using much larger 40% of the training data. Similarly for CIFAR100, the RSB 20%-model with regularization performs comparably to the 40%-model without regularization. Therefore, we consider regularization to be a valuable addition to the low-data training regime of AL, especially given that it significantly reduces the variance in evaluation metric and helps avoid misleading conclusions.

An ablative study to show individual contribution of each regularization technique towards overall performance gain is given in Table 2. The results indicate that both RA and SWA show a significant combined gain of %. We also experimented with Shake-Shake (SS) (Gastaldi, 2017) in parallel to RA and SWA, and observed that it significantly increases the runtime, and is not robust to model architectures. We therefore chose RA & SWA over SS in our experiments.

Figure 3: Effect of regularization (RA + SWA) on the test accuracy of CIFAR10(a) and CIFAR100(b) dataset. Results are average of 5 runs where regularized results are shown above the blue line.

5.4 Transferability and Optimizer Settings

In principle, the sample sets drawn by an AL method should be agnostic to the task model’s architecture, and a change in the architecture should maintain consistent performance trends for the AL method. We conduct an experiment by storing the indices of sample set drawn in an AL iteration on the source network, and use them to train the target network. We consider VGG16 as the source, and ResNet18 (RN18) (He et al., 2016) & WRN-28-2 (Zagoruyko and Komodakis, 2016) as the target architectures. From Table 3, we observe that the trend in AL gains is architecture dependent. On CIFAR10 with RN18 using Adam, VAAL achieves higher accuracy than RSB. However, this relative gain vanishes with RA and SWA. Further, there was no discernible trend in results using WRN-28-2 or VGG16 architectures.

width=2center Source Model Target Model VGG16 WRN-28-2 R18 +Adam R18 +SGD R18 +Adam+Reg R18 +SGD+Reg 20% 30% 40% 20% 30% 40% 20% 30% 40% 20% 30% 40% 20% 30% 40% 20% 30% 40% RSB 77.3 80.3 82.6 79.1 82.4 84.7 74.1 77.3 80.8 80.1 84.1 86.2 86.7 89 90.4 84.8 87.8 89.3 Coreset 76.7 79.9 82.4 79.1 82.9 83.7 74.4 78.8 81.1 80.1 84 86.5 86.4 88.9 90.3 85.1 87.2 89.2 VAAL 77.0 80.3 82.4 78.9 82.7 84.1 75.7 79.6 81.5 79.6 83.8 86.4 86.6 88.9 90.5 84.9 87.7 89.3 QBC 77.2 80.3 81.6 78.1 82.9 84.9 74.3 77.8 80.6 79.9 83.6 86.1 86.6 88.9 90.1 85.1 87.6 89.3

Table 3: Transferability experiment on CIFAR10 dataset where source model is VGG16 and target model is Resnet18 (R18) and Wide Resnet-28-2 (WRN-28-2). The reported numbers are mean of test accuracies over five seeds on CIFAR10/100 dataset. Results with regularization (Reg=SWA+RA) are shown in last two columns.

To evaluate whether the choice of optimizer played a role in VAAL’s performance using RN18 with Adam, we repeated the training with SGD. We note the followings (Table  3): (i) RSB (and other methods) achieved a higher mean accuracy when trained using SGD compared to Adam ( vs ) on RN18 using CIFAR10 labeled data. Further, RN18 with SGD performs comparably against WRN-28-2 with Adam i.e., vs . (ii) Using Adam, both VAAL and coreset perform favorably against RSB. However, with SGD, the results are comparable.

5.5 Active Learning on ImageNet

Compared to CIFAR10/100, ImageNet is more challenging with larger sample count, 1000 classes and higher resolution images. We compare coreset, VAAL and RSB on ImageNet. We were unable not evaluate QBC due to prohibitive compute cost of training an ensemble of 5 CNN models. The details for training hyper-parameters are in supplementary section. Results with and without regularization (RA, SWA) are shown in Table 5. Using ResNext-50 architecture (Xie et al., 2017) and following the settings of (Zhai et al., 2019)), we achieve improved baseline performances compared to the previously reported results (Beluch et al., 2018; Sinha et al., 2019). From table 5, we observe that both AL methods performed marginally better than RSB though ImageNet experiments are not repeated for multiple runs due to prohibitive compute requirements.

width=0.8center Methods Noise: RSB 69.09 72.78 76.97 76.63 RSB + Reg. 79.28 85.02 87.05 88.01 Noise: RSB 69.09 70.37 71.01 70.04 RSB + Reg. 79.28 83.02 84.24 85.44 Methods without RA + SWA RSB 58.05 62.95 64.61 66.15 VAAL 58.05 63.33 64.68 66.18 Coreset 58.05 63.04 64.43 65.58 with RA + SWA RSB 59.43 63.88 66.83 69.10 VAAL 59.43 65.17 67.39 69.47 Coreset 59.43 64.17 67.07 69.54 Figure 4: Results are average of 5 runs on imbalanced CIFAR100.
Table 4: RSB accuracy with and without SWA and RA on CIFAR10 with noisy oracle. RSB+Reg. refers to RSB regularized using RA and SWA.
Table 5: Effect of RA and SWA on ImageNet where annotation budget is of training data. Results reported for 1 run.

5.6 Additional Experiments

Noisy Oracle: In this experiment, we sought to evaluate the stability of regularized network to labels from a noisy oracle. We experimented with two levels of oracle noise by randomly permuting labels of 10% and 20% of samples in the set drawn by random sampling baseline at each iteration. From results in Table 5, we found that the drop in accuracy for the model regularized by RA and SWA was nearly half () compared with the model trained without these regularizations () on both and data splits. Our findings suggest that the noisy pseudolabels generated for the unlabelled set by model , when applied in conjunction with appropriate regularization, should help improve model’s performance. Additional results using AL methods in this setting are shared in the supplementary section. Active Learning Sample Set Overlap: For interested readers, we discuss the extent of overlap among the sample sets drawn by AL methods in the supplementary section.

6 Discussion

Under-Reported Baselines: We note that several recent AL studies show baseline results that are lower than the ones reproduced in this study. Table 6 summarizes our RSB results with comparisons to some of the recently published AL methods, under similar training settings. Based on this observation, we emphasize that comparison of AL methods must be done under a consistent set of experimental settings. Our observations confirm and provide a stronger evidence for a similar conclusion drawn in (Mittal et al., 2019), and to a less related extent, (Oliver et al., 2018). Different from (Mittal et al., 2019) though, we demonstrate that: (i) relative gains using AL method are found under a narrow combination of experimental conditions, (ii) such gains are not statistically meaningful over random baseline, (iii) more distinctly, we show that the performance gains vanish when a well-regularized training strategy is used.

The Role of Regularization: Regularization helps reduce generalization error and is particularly useful in training overparameterized neural networks with low data. We show that both RA and SWA can achieve appreciable gain in performance at the expense of a small computational overhead. We observed that along with learning rate (in case of SGD), regularization was one of the key factors in reducing the error while being fairly robust to its hyperparameters (in case of RA and SWA). We also found that any trend of consistent gain observed with an AL method over RSB on CIFAR10/100 disappears when the model is well-regularized. Models regularized with RA and SWA also exhibited smaller variance in evaluation metric compared to the models trained without them. With these observations, we recommend that AL methods be also tested using well-regularized model to ensure their robustness. Lastly, we note that there are multiple ways to regularize the data-model-metric pipeline, we focus on data and model side regularization using techniques such as RA and SWA, though it is likely that other combination of newer regularization techniques will lead to similar results. We do believe that with their simplicity and applicability to a wide variety of model (as compared to methods such as shake-shake), RA and SWA can be effectively used in AL studies without significant hyperparameter tuning.

Table 6: Reported Random Baseline Accuracies vs our RSB results. We denote our RSB results with regularization by RSB-R.

Using Unlabeled Set in Training: Some recent methods such as VAAL use set to train another network as part of their sampling routine. We argue that for such models, a better baseline comparison would be from the semi-supervised learning (SSL) literature. We note that some of the current SSL methods such as UDA (Xie et al., 2019) have reported very strong results ( on CIFAR10 with labeled training data). These results suggest that large number of noisy labels are relatively more helpful in reducing the generalization error as compared to the smaller percentage of high quality labels. Further commentary on this topic can be found in (Mittal et al., 2019).

AL Methods Compared To Strong RSB: Compared to the well-regularized RSB, state-of-the-art AL methods evaluated in this paper do not achieve any noticeable gain. We believe that reported AL results in the literature were obtained with insufficiently-regularized models, and the gains reported for AL methods are often not because of the superior quality of selected samples. As shown in Table 3, the fact that a change in model architecture can change the conclusions being drawn suggests that transferability experiments should be essential to any AL study. Similarly we observed that a simple change in optimizer or use of regularization can influence the conclusions. The highly-sensitive nature of AL results using neural networks therefore necessitates a comprehensive suite of experimental tests.

7 Conclusion and Proposed Guidelines

Our extensive experiments suggest a strong need for a common evaluation platform that facilitates robust and reproducible development of AL methods. To this end, we recommend the following to ensure results are robust: (i) experiments must be repeated under varying training settings such as optimizer, and model architecture, budget size, among others, (ii) regularization techniques such as RA and SWA should be incorporated into the training to ensure AL methods are able to demonstrate gains over a regularized random baseline, (iii) transferability experiments must be performed to ensure the AL-drawn sample sets are indeed informative as claimed. To increase the reproducibility of AL results, we further recommend: (iv) experiments should be performed using a common evaluation platform under consistent settings to minimize the sources of variation in the evaluation metric, (v) snapshot of experimental settings should be shared, e.g. using a configuration file (.cfg, .json etc), (vi) index sets for a public dataset used for partitioning the data into training, validation, test, and AL-drawn sets should be shared, along with the training scripts. In order to facilitate the use of these guidelines in AL experiments, we also provide an open-source AL toolkit. We believe our findings and toolkit will help advance robust and reproducible AL research.


  • W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler (2018) The power of ensembles for active learning in image classification.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pp. 9368–9377.
    Cited by: §1, §1, §2.6, §2, §5.5.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §3.
  • E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §3.
  • M. Ducoffe and F. Precioso (2018) Adversarial active learning for deep networks: a margin based approach. CoRR abs/1802.09841. External Links: Link, 1802.09841 Cited by: §1, §2.
  • Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1183–1192. Cited by: §1, §1, §2.2, §2.
  • X. Gastaldi (2017) Shake-shake regularization. arXiv preprint arXiv:1705.07485. Cited by: §3, §5.3.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §2.5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4, §5.4.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.
  • P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson (2018) Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407. Cited by: §3.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.5.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §4.
  • A. Kirsch, J. van Amersfoort, and Y. Gal (2019) BatchBALD: efficient and diverse batch acquisition for deep bayesian active learning. CoRR abs/1906.08158. External Links: Link, 1906.08158 Cited by: §1, §2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1.
  • D. D. Lewis and W. A. Gale (1994) A sequential algorithm for training text classifiers. In SIGIR’94, pp. 3–12. Cited by: §2.1.
  • D. Lowell, Z. C. Lipton, and B. C. Wallace (2018) How transferable are the datasets collected by active learners?. CoRR abs/1807.04801. External Links: Link, 1807.04801 Cited by: §1.
  • S. Mittal, M. Tatarchenko, Ö. Çiçek, and T. Brox (2019) Parting with illusions about deep active learning. External Links: arXiv:1912.05361 Cited by: §1, §6, §6.
  • A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow (2018) Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pp. 3235–3246. Cited by: §6.
  • A. Prabhu, C. Dognin, and M. Singh (2019) Sampling bias in deep active classification: an empirical study. arXiv preprint arXiv:1909.09389. Cited by: §1.
  • O. Sener and S. Savarese (2018)

    Active learning for convolutional neural networks: a core-set approach

    In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §1, §2.4, §2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.
  • S. Sinha, S. Ebrahimi, and T. Darrell (2019) Variational adversarial active learning. arXiv preprint arXiv:1904.00370. Cited by: §1, §1, §1, §2.5, §2, §5.5.
  • T. Tran, T. Do, I. D. Reid, and G. Carneiro (2019)

    Bayesian generative active deep learning

    CoRR abs/1904.11643. External Links: Link, 1904.11643 Cited by: §1, §1.
  • Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §6.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §5.5.
  • D. Yoo and I. S. Kweon (2019) Learning loss for active learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102. Cited by: §1, §1, §2.4.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4, §5.4.
  • X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) Sl: self-supervised semi-supervised learning. CoRR abs/1905.03670. External Links: Link, 1905.03670 Cited by: §4, §5.2, §5.5.