1 Introduction
The recent success of learningbased computer vision methods relies heavily on abundant annotated training examples, which may be prohibitively costly to label or impossible to obtain at large scale
[46, 6]. In order to mitigate this drawback, active learning [4] algorithms aim to incrementally select samples for annotation that result in high classification performance with low labeling cost. Active learning has been shown to require relatively fewer training instances when applied to computer vision tasks such as image classification [40, 29, 14, 1] and semantic segmentation [53, 28, 19].This paper introduces a poolbased active learning strategy which learns a low dimensional latent space from labeled and unlabeled data using Variational Autoencoders (VAEs). VAEs have been wellstudied and valued for both their generative properties as well as their ability to learn rich latent spaces. Our method, Variational Adversarial Active Learning (VAAL), selects instances for labeling from the unlabeled pool that are sufficiently different in the latent space learned by the VAE in order to maximize the performance of the representation learned on the newly labeled data. Sample selection in our method is performed by an adversarial network which classifies which pool the instances belong to (labeled or unlabeled).
Our VAE learns a latent representation in which the sets of labeled and unlabeled data are mapped into a common embedding. We use an adversarial network in this space to correctly classify one from another. The VAE and the discriminator are framed as a two player minimax game, similar to GANs [18] such that the VAE is trained to learn a feature space to trick
the adversarial network into predicting that all datapoints, from both the labeled and unlabeled sets, are from the labeled pool while the discriminator network learns how to discriminate between them. The strategy follows the intuition that once the active learner is trained, the probability associated with discriminator’s predictions effectively estimates how representative each sample is from the pool that it has been deemed to be from. Therefore, instead of explicitly measuring uncertainty, we aim to choose points that would yield high uncertainty and thus are samples that are not well represented in the labeled set. We additionally consider oracles with different levels of labeling noise and demonstrate the robustness of our method to such noisy labels. In our experiments we demonstrate superior performance on a variety of large scale image classification and segmentation datasets, and outperform current state of the art methods both in performance and computational cost.
2 Related Work
Active learning: Current approaches can be categorized as queryacquiring (poolbased) or querysynthesizing methods. Querysynthesizing approaches use generative models to generate informative samples [32, 34, 56] whereas poolbased algorithms use different sampling strategies to determine how to select the most informative samples. Since our work lies in the latter line of research, we will mainly focus on previous work in this direction.
Poolbased methods can be grouped into three major categories as follows: uncertaintybased methods [19, 51, 1], representationbased models [40], and their combination [53, 38]. Poolbased methods have been theoretically proven to be effective and achieve better performance than the random sampling of points [42, 12, 15]. Sampling strategies in poolbased algorithms have been built upon several methods, which are surveyed in [41], such as information theoretic methods [30], ensembles methods [35, 12]
and uncertainty heuristics such as distance to the decision boundary
[48] and conditional entropy [29]. Uncertaintybased poolbased models are proposed in both Bayesian [14] and nonBayesian frameworks. In the realm of Bayesian frameworks, probabilistic models such as Gaussian processes are used to estimate uncertainty [23, 39]. Gal & Gharamani [14, 13], also showed the relationship between uncertainty and dropout to estimate uncertainty in prediction in neural networks and applied it for active learning in small image datasets using shallow
[13] and deep [14] neural networks. In nonBayesian classical active learning approaches uncertainty heuristics such as distance from the decision boundary, highest entropy, and expected risk minimization have been widely investigated [3, 48, 52]. However, it was shown in [40] that such classical techniques do not scale well to deep neural networks and large image datasets. Instead they proposed to use Coresets, where they minimize the euclidean distance between the sampled points and the points that were not sampled in the feature space of the trained model [40]. Using an ensemble of models to represent uncertainty was proposed by [28, 53], but [36] showed that using ensembles does not always yield high diversity in predictions which results in sampling redundant instances.Representationbased methods rely on selecting few examples by increasing diversity in a given batch [40, 8]. The Coreset technique was shown to be an effective representation learning method for large scale image classification tasks [40]
and was theoretically proven to work best when the number of classes is small. However, as the number of classes grows, it deteriorates in performance. Moreover, for highdimensional data, using distancebased representation methods, like Coreset, appears to be ineffective because in highdimensions
pnorms suffer from the curse of dimensionality which is referred to as the
distance concentration phenomenon in the computational learning literature [10]. We overcome this limitation by utilizing VAEs which have been shown to be effective in unsupervised and semisupervised representation learning of high dimensional data [26, 45].Methods that aim to combine uncertainty and representativeness use a twostep process to select the points with high uncertainty as of the most representative points in a batch. A hybrid framework combining uncertainty using conditional entropy and representation learning using information density was proposed in [29]
for classification tasks. A weakly supervised learning strategy was introduced in
[51] that trains the model with pseudo labels obtained for instances with high confidence in predictions. However, for a fixed performance goal, they often need to sample more instances per batch compared to other methods. Furthermore, in [28] it was shown that having the representation step may not be necessary followed by suggesting an ensemble method that outperformed competitive approaches such as [53] which uses uncertainty together with Coresets. While we show that our model outperforms both [28] and [53], we argue that VAAL achieves this by learning the representation and uncertainty together such that they act in favor of each other resulting in a better active learning performance.Variational autoencoders: Autoencoders have long been used to effectively learn a feature space and representation [2]. A Variational AutoEncoder [26] is an example of a latent variable model that follows an encoderdecoder architecture of classical autoencoders which places a prior distribution on the feature space distribution, and uses an Expected Lower Bound to optimize the learnt posterior. Adversarial autoencoders are a family of autoencoders which minimize the adversarial loss in the latent space between a sample from the prior and the posterior distribution [33]. Prior work has investigated uncertainty modeling using a VAE to drive learning of sequence models in language applications [7],
Active learning for semantic segmentation: Segmentation labeling is one of the most expensive annotations to collect. Active learning in the literature has been broadly investigated for labeling medical images as it is one of the most prevailing applications of AL where only human experts with sophisticated knowledge are capable of providing labels and therefore, improving this process would reduce a lot of time and effort for them. Suggestive Annotation (SA) [53] uses uncertainty obtained from an ensemble of models trained on the labeled data and Coresets for choosing representative data points in a twostep strategy. [28] also proposed an active learning algorithm for image segmentation using an ensemble of models, but they empirically showed their proposed information theoretic heuristic for uncertainty is equal in performance to SA, without using Coresets. [19] extended the work by [14] and proposed using MonteCarlo dropout masks on the unlabeled images using a trained model and calculating the uncertainty on the predicted labels of the unlabeled images. Some active learning strategies developed for image classification can also be used for semantic segmentation. Coresets and maxentropy strategies can both be used for active learning in semantic segmentation [40, 3].
Adversarial learning: Adversarial learning has been used for different problems such as generative models [18], representation learning [33, 37], domain adaptation [50, 22]
, deep learning robustness and security
[31, 49] etc. The use of an adversarial network enables the model to train in a fullydifferentiable by adjusting to solving the minimax optimization problem [18]. The adversarial network used in the feature space has been extensively researched in the representation learning and domain adaptation literature to efficiently learn a useful feature space for the task [33, 24, 47, 50, 22].3 Adversarial Learning of Variational Autoencoders for Active Learning
Let () be a sample pair belonging to the pool of labeled data (). denotes a much larger pool of samples () which are not yet labeled. The goal of the active learner is to train the most labelefficient model by iteratively querying a fixed sampling budget, number of the most informative samples from the unlabeled pool (), using an acquisition function to be annotated by the oracle such that the expected loss is minimized.
3.1 Transductive representation learning.
We use a variational autoencoder for representation learning in which the encoder learns a low dimensional space for the underlying distribution using a Gaussian prior and the decoder reconstructs the input data. In order to capture the features that are missing in the representation learned on the labeled pool, we can benefit from using the unlabeled data and perform transductive learning. The objective function of the VAE is minimizing the variational lower bound on the marginal likelihood of a given sample formulated as
(1)  
where and are the encooder and decoder parameterized by and , respectively. is the prior chosen as a unit Gaussian, and is the Lagrangian parameter for the optimization problem. The reparameterization trick is used for proper calculation of the gradients [26].
3.2 Adversarial representation learning
The representation learned by the VAE is a mixture of the latent features associated with both labeled and unlabeled data. An ideal active learning agent is assumed to have a perfect sampling strategy that is capable of sending the most informative
unlabeled data to the oracle. Most of the sampling strategies rely on the model’s uncertainty, i.e, the more uncertain the model is on the prediction, the more informative that specific unlabeled data must be. However, this introduces vulnerability to the outliers. In contrast we train an adversarial network for our sampling strategy to learn how to distinguish between the encoded features in the latent space. This adversarial network is analogous to discriminators in GANs where their role is to discriminate between fake and real images created by the generator. In VAAL, the adversarial network is trained to map the latent representation of
to a binary label which is if the sample belongs to and is, otherwise. The key to our approach is that the VAE and the adversarial network are learned together in an adversarial fashion. While the VAE maps the labeled and unlabeled data into the same latent space with similar probability distribution
and , it fools the discriminator to classify all the inputs as labeled. On the other hand, the discriminator attempts to effectively estimate the probability that the data comes from the unlabeled data. We can formulate the objective function for the adversarial role of the VAE as follows(2) 
where is simply a binary crossentropy cost function. The objective function to train the discriminator is also given as below
(3) 
By combining Eq. (1) and Eq. (2) we obtain the full objective function for the VAE in VAAL as below
(4) 
where and
are hyperparameters that determine the effect of each component to learn an effective variational adversarial representation.
The task module, denoted as in Fig. (1), learns the task for which the active learner is being trained. We report results below on image classification and semantic segmentation tasks, using VGG16 [44] and dilated residual network (DRN) architecture [54] with an unweighted crossentropy cost function. Our full algorithm is shown in Alg. 1.
3.3 Sampling strategies and noisyoracles
The labels provided by the oracles might vary in how accurate they are depending on the quality of available human resources. For instance, medical images annotated by expert humans are assumed to be more accurate than crowdsourced data collected by nonexpert humans and/or available information on the cloud. We consider two types of oracles: an ideal oracle which always provides correct labels for the active learner, and a noisy oracle which nonadversarially provides erroneous labels for some specific classes. This might occur due to similarities across some classes causing ambiguity for the labeler. In order to present this oracle realistically, we have applied a targeted noise on visually similar classes. The sampling strategy in VAAL is shown in Alg. (2). We use the probability associated with the discriminator’s predictions as a score to collect number of samples in every batch with the lowest confidence to be sent to the oracle.
4 Experiments
We begin our experiments with an initial labeled pool with of the training set labeled. The budget size per batch is equal to of the training dataset. The pool of unlabeled data contains the rest of the training set from which samples are selected to be annotated by the oracle. Once labeled, they will be added to the initial training set and training is repeated on the new training set. We assume the oracle is ideal unless stated otherwise.
Datasets. We have evaluated VAAL on two common vision tasks. For image classification we have used CIFAR10 [27] and CIFAR100 [27] both with K images of size , and Caltech256 [20] which has images of size including object categories. For a better understanding of the scalability of VAAL we have also experimented with ImageNet [6] with more than M images of classes. For semantic segmentation, we evaluate our method on BDD100K [55] and Cityscapes [5] datasets both of which have classes. BDD100K is a diverse driving video dataset with K images with fullframe instance segmentation annotations collected from distinct locations in the United State. Cityscapes is also another large scale driving video dataset containing frames with instance segmentation annotations recorded in street scenes from different cities in Europe. The statistics of these datasets are summarized in Table 2 in the appendix.
Performance measurement. We evaluate the performance of VAAL in image classification and segmentation by measuring the accuracy and mean IoU, respectively achieved by trained with , , , , , , of the total training set as it becomes available with labels provided by the oracle. Results for all our experiments, except for ImageNet, are averaged over runs. ImageNet results however, are obtained by averaging over repetitions using , , , , of the training data.
4.1 VAAL on image classification benchmarks
Baselines. We compare our results using VAAL for image classification against various approaches including Coreset [40], MonteCarlo Dropout [13], and Ensembles using Variation Ratios (Ensembles w. VarR) [1, 11]. We also show the performance of deep Bayesian AL (DBAL) by following [14] and perform sampling using their proposed maxentropy scheme to measure uncertainty [43]. We also show the results using random sampling in which samples are uniformly sampled at random from the unlabeled pool. This method still serves as a competitive baseline in active learning. Moreover, we use the mean accuracy achieved on the entire dataset as an upper bound which does not adhere to the active learning scenario.
Implementation details. We used random horizontal flips for data augmentation. The architecture used in the task module for image classification is VGG16 [44] with Xavier initialization [17] and VAE has the same architecture as the Wasserstein autoencoder [47] with latent dimensionality given in Table 3 in the appendix. The discriminator is a
layer multilayer perceptron (MLP) and Adam
[25] is used as the optimizer for all these three modules with an equal learning rate of and batch size of . However for ImageNet, learning rate varies across the modules such that the task learner has a learning rate of while the VAE and the discriminator have a learning rate of . Training continues for epochs in ImageNet and for epochs in all other datasets. The budget size for classification experiments is chosen to be of the full training set, which is equivalent to , , , and for CIFAR10, CIFAR100, Caltech256, and ImageNet, respectively in VAAL and all other baselines. A complete list of hyperparameters used in our model are found through a grid search and are tabulated in Table 3 in the appendix.VAAL performance CIFAR10/100 and Caltech256. Figure 2 shows performance of VAAL compared to prior works. On CIFAR10, our method achieves mean accuracy of by using of the data whereas using the entire dataset yields accuracy of , denoted as Top1 accuracy in Fig. 2. Comparing the mean accuracy values for data ratios above shows that VAAL evidently outperforms random sampling, DBAL, and MCDropout while beating Ensembles by a smaller margin and becoming onpar with Coreset. On CIFAR100, VAAL remains competitive with Ensembles w. VarR and Coreset, and outperforms all other baselines. The maximum achievable mean accuracy is on CIFAR100 using of the data while VAAL achieves by only using of it. Moreover, for data ratios above of labeled data, VAAL consistently requires less number of labels compared to Coreset or Ensembles w. VarR in order to achieve the same accuracy, which is equal to labels. On Caltech256, which has real images of object categories, VAAL consistently outperforms all baselines by an average margin of from random sampling and from the most competitive baseline, Coreset. DBAL method performs nearly identical to random sampling while MCDropout yields lower accuracies than random sampling. By looking at the number of labels required to reach a fixed performance, for instance , VAAL needs of data ( images) to be labeled whereas this number is approximately and for Coreset and Ensemble w. VarR, respectively. Random sampling, DBAL, and MCDropout all need more than images.
As can be seen in Fig. 2, VAAL outperforms Coreset with higher margins as the number of classes increases from to to . The theoretical analysis shown in [40] confirms that Coreset is more effective when fewer classes are present due to the negative impact of high dimensionality on pnorms in the Coreset method.
VAAL performance on ImageNet. ImageNet [6] is a challenging large scale dataset which we use to show scalability of our approach. Fig. 2 shows that we improve the stateoftheart by increase in the gap between the accuracy achieved by the previous stateoftheart methods (Coreset and Ensemble) and random sampling. As can be seen in Fig. 2, this improvement can be also viewed in the number of samples required to achieve a specific accuracy. For instance, accuracy of is achieved by VAAL using K number of images whereas Coreset and Ensembles w. VarR should be provided with almost K more labeled images to obtain the same performance. Random sampling remains as a competitive baseline as both DBAL and MCDropout perform below that.
4.2 VAAL on image segmentation benchmarks
Baselines. We evaluate VAAL against stateoftheart AL approaches for image segmentation including Coreset [40], MCDropout [19], QueryByCommittee (QBC) [28], and suggestive annotation (SA)[53]. SA is a hybrid ensemble method that uses bootstrapping for uncertainty estimation [9] and coreset for measuring representativeness.
Implementation details. Similar to the image classification setup, we used random horizontal flips for data augmentation. The VAE is a Wasserstein autoencoder [47], and the discriminator is also a layer MLP. The architecture used in the task module for image segmentation is DRN [54] and Adam with a learning rate of is chosen as the optimizer for all three modules. The batch size is set as and training stops after epochs in both datasets. The budget size used in VAAL and all baselines is set as and for BDD100K and Cityscapes, respectively. All hyperparameteres are shown in Table 3 in the appendix
VAAL performance on Cityscapes and BDD100K. Figure 3 demonstrates our results on the driving datasets compared with four other baselines as well as the reference random sampling. As we also observed in section 4.1 Coreset performs better with fewer number of classes in image classification tasks [40] . However, the large gap between VAAL and Coreset, despite only having classes, suggests that Coreset and Ensemblebased methods (QBC in here) suffer from high dimensionality in the inputs ( as opposed to thumbnail images used in CIFAR10/100). QBC and Coreset, and SA (Coreset + QBC) perform nearly identical, while MCDropout remains less effective than random sampling. VAAL consistently demonstrate significantly better performance by achieving the highest mean IoU on both Cityscapes and BDD100K across different labeled data ratios. VAAL is able to achieve mIoU of and using only labeled data while the maximum mIoU we obtained using of these datasetes is and on Cityscapes and BDD100K, respectively. In terms of required labels by each method, on Cityscapes VAAL needs annotations to reach of mIoU whereas QBC, Coreset, SA, random sampling, MCDropout demand nearly , , , , and labels, respectively. Similarly on BDD100K in order to reach of mIoU, other baselines need more annotations than VAAL requires only . Considering the difficulties in full frame instance segmentation, VAAL is able to effectively reduce the required time and effort for such dense annotations.
5 Analyzing VAAL in Detail
In this section, we take a deeper look into our model by first performing ablation and then evaluating the effect of possible biases and noise on its performance. Sensitivity of VAAL to budget size is also explored in 5.2.
5.1 Ablation study
Figure 4 presents our ablation study to inspect the contribution of the key modules in VAAL including the VAE, and the discriminator (). We perform ablation on the segmentation task which is more challenging than classification and we use BDD100K as it is larger than Cityscapes. The variants of ablations we consider are: 1) eliminating VAE, 2) Frozen VAE with D, 3) eliminating . In the first ablation, we explore the role of the VAE as the representation learner by having only a discriminator trained on the image space to discriminate between labeled and unlabeled pool. As shown in Fig. 4, this setting results in the discriminator to only memorize the data and yields the lowest performance. Also, it reveals the key role of the VAE in not only learning a rich latent space, but also playing an effective minimax game with the discriminator to avoid overfitting. In the second ablation scenario we add a VAE to the previous setting to encodedecode a lower dimensional space for training . However, here we avoid training the VAE and hence merely explore its role as an autoencoder. This setting performs better than having only the trained in a high dimensional space, but yet performs similar or worse than random sampling suggesting that discriminator failed at learning representativeness
of the samples in the unlabeled pool. In the last ablation, we explore the role of the discriminator by training only a VAE that uses 2Wasserstein distance from the clustercentroid of the labeled dataset as a heuristic to explicitly measure uncertainty. For a multivariate isotropic Gaussian distribution, the closed form solution for 2Wasserstein distance between two probability distributions
[16] can be written as(5) 
where represents the Frobenius norm and , denote the , predicted by the encoder and ,
are the mean and variance for the normal distribution over the labeled data from which the latent variable
is generated. In this setting, we see an improvement over random sampling which shows the effect of explicitly measuring the uncertainty in the learned latent space. However, VAAL appears to outperform all these scenarios by implicitly learning the uncertainty over the adversarial game between the discriminator and the VAE.5.2 VAAL’s Robustness
Effect of biased initial labels in VAAL. We investigate here how bias in the initial labeled pool affect VAAL’s performance as well as other baselines on CIFAR100 dataset. Intuitively, bias can affect the training such that it causes the initially labeled samples to be not representative of the underlying data distribution by being inadequate to cover most of the regions in the latent space. We model a possible form of bias in the labeled pool by not providing labels for chosen classes at random and we compare it to the case where samples are randomly selected from all classes. We exclude the data for and classes at random in the initial labeled pool to explore how it affects the performance of the model. Figure 5 shows for and , VAAL is superior to Coreset and random sampling in selecting informative samples from the classes that were underrepresented in the initial labeled set. We also observe that VAAL with missing classes performs nearly identical to CoreSet and significantly better than random sampling where each has half number of missing classes.
Effect of budget size on performance. Figure 5 illustrates the effect of the budget size on our model compared to the most competitive baselines on CIFAR100. We repeated our experiments in section 4.1 for a lower budget size of . We observed that VAAL outperforms CoreSet and Ensemble w. VarR, as well as random sampling, on both budget sizes of and . Coreset comes at the second best method followed by Ensemble in Fig 5. We note that for all methods, including VAAL, has a slightly better performance compared to when which is expected to happen because a larger sampled batch results in adding redundant samples instead of more informative ones.
Noisy vs. ideal oracle in VAAL. In this analysis we investigate the performance of VAAL in the presence of noisy data caused by an inaccurate oracle. We assume the erroneous labels are due to the ambiguity between some classes and are not adversarial attacks. We model the noise as targeted noise on specific classes that are meaningful to be mislabeled by a human labeler. We used CIFAR100 for this analysis because of its hierarchical structure in which classes in CIFAR100 are grouped into superclasses. Each image comes with a fine label (the class to which it belongs) and a coarse label (the superclass to which it belongs). We randomly change the ground truth labels for , and of the training set to have an incorrect label within the same superclass. Figure 5 shows how a noisy oracle effects the performance of VAAL, Coreset, and random sampling. Because both Coreset and VAAL do not depend on the task learner, we see that the relative performance is comparable to the ideal oracle presented in Section 4.1. Intuitively, as the percentage of noisy labels increases, all of the active learning strategies converge to random sampling.
Choice of the network architecture in . In order to assure VAAL is insensitive to the VGG16 architecture used in our classification experiments, we also used ResNet18 [21] in VAAL and the most competitive baseline (Coreset). Figure 6 in the appendix shows the choice of the architecture does not affect the performance gap between VAAL and Coreset.
5.3 Sampling time analysis
The sampling strategy of an active learner has to select samples in a time efficient manner. In other words it should be as close as possible to random sampling, considering the fact that random sampling is still an effective baseline. Table 1 shows our comparison for VAAL and all our baselines on CIFAR10 using a single NVIDIA TITAN Xp. Table 1 shows the time needed to sample a fixed budget of images from the unlabeled pool for all the methods. MCDropout performs multiple forward passes to measure the uncertainty from dropout masks which explains why it appears to be very slow in sample selection. Coreset and Ensembles w. VarR, are the most competitive baselines to VAAL in terms of their achieved mean accuracy. However, in sampling time, VAAL takes seconds while Coreset requires sec and Ensembles w. VarR needs sev. DBAL [14] is onpar in sampling time with VAAL, however, DBAL is outperformed in accuracy by all other methods including random sampling which can sample in only a few milliseconds. The significant difference between Coreset and VAAL is due to the fact that Coreset needs to solve an optimization problem for sample selection as opposed to VAAL which only needs to perform inference on the discriminator and rank its output probabilities. The Ensembles w. VarR method uses models to measure the uncertainty resulting in better computational efficiency but it does not yet perform as fast as VAAL.
6 Conclusion
In this paper we proposed a new batch mode active learning algorithm, VAAL, that learns a latent representation on both labeled and unlabeled data in an adversarial game between a VAE and a discriminator, and implicitly learns the uncertainty for the samples deemed to be from the unlabeled pool. We demonstrate stateoftheart results, both in terms of accuracy and sampling time, on small and largescale image classification (CIFAR10, CIFAR100, Caltech256, ImageNet) and segmentation datasets (Cityscapes, BDD100K). We further showed that VAAL is robust to noisy labels and biased initial labeled data, and it performs consistently well, given different oracle budgets.
References

[1]
W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler.
The power of ensembles for active learning in image classification.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 9368–9377, 2018.  [2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.

[3]
K. Brinker.
Incorporating diversity in active learning with support vector machines.
InProceedings of the 20th international conference on machine learning (ICML03)
, pages 59–66, 2003. 
[4]
D. A. Cohn, Z. Ghahramani, and M. I. Jordan.
Active learning with statistical models.
Journal of artificial intelligence research
, 4:129–145, 1996. 
[5]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.  [6] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [7] Y. Deng, K. Chen, Y. Shen, and H. Jin. Adversarial active learning for sequences labeling and generation. In IJCAI, pages 4012–4018, 2018.
 [8] S. Dutt Jain and K. Grauman. Active image segmentation propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2864–2873, 2016.
 [9] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.

[10]
D. François.
Highdimensional data analysis.
In
From Optimal Metric to Feature Selection
, pages 54–55. VDM Verlag Saarbrucken, Germany, 2008.  [11] L. C. Freeman. Elementary applied statistics: for students in behavioral science. John Wiley & Sons, 1965.
 [12] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine learning, 28(23):133–168, 1997.
 [13] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
 [14] Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. arXiv preprint arXiv:1703.02910, 2017.
 [15] R. GiladBachrach, A. Navot, and N. Tishby. Query by committee made real. In Advances in neural information processing systems, pages 443–450, 2006.
 [16] C. R. Givens, R. M. Shortt, et al. A class of wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231–240, 1984.
 [17] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
 [18] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [19] M. Gorriz, A. Carlier, E. Faure, and X. Giroi Nieto. Costeffective active learning for melanoma segmentation. arXiv preprint arXiv:1711.09168, 2017.
 [20] G. Griffin, A. Holub, and P. Perona. Caltech256 object category dataset. 2007.
 [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [22] J. Hoffman, E. Tzeng, T. Park, J.Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycleconsistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
 [23] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Active learning with gaussian processes for object categorization. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007.
 [24] H. Kim and A. Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
 [25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 [26] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [27] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [28] W. Kuo, C. Häne, E. Yuh, P. Mukherjee, and J. Malik. Costsensitive active learning for intracranial hemorrhage detection. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 715–723. Springer, 2018.
 [29] X. Li and Y. Guo. Adaptive active learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 859–866, 2013.
 [30] D. J. MacKay. Informationbased objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
 [31] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
 [32] D. Mahapatra, B. Bozorgtabar, J.P. Thiran, and M. Reyes. Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 580–588. Springer, 2018.
 [33] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 [34] C. Mayer and R. Timofte. Adversarial sampling for active learning. arXiv preprint arXiv:1808.06671, 2018.
 [35] A. K. McCallumzy and K. Nigamy. Employing em and poolbased active learning for text classification. In Proc. International Conference on Machine Learning (ICML), pages 359–367. Citeseer, 1998.
 [36] P. Melville and R. J. Mooney. Diverse ensembles for active learning. In Proceedings of the twentyfirst international conference on Machine learning, page 74. ACM, 2004.
 [37] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2391–2400. JMLR. org, 2017.
 [38] H. T. Nguyen and A. Smeulders. Active learning using preclustering. In Proceedings of the twentyfirst international conference on Machine learning, page 79. ACM, 2004.
 [39] N. Roy and A. McCallum. Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown, pages 441–448, 2001.

[40]
O. Sener and S. Savarese.
Active learning for convolutional neural networks: A coreset approach.
In International Conference on Learning Representations, 2018.  [41] B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
 [42] B. Settles. Active learning literature survey. 2010. Computer Sciences Technical Report, 1648, 2014.
 [43] C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
 [44] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [45] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.

[46]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, volume 4, page 12, 2017.  [47] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558, 2017.
 [48] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
 [49] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
 [50] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
 [51] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Costeffective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2017.
 [52] Z. Wang and J. Ye. Querying discriminative and representative samples for batch mode active learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 9(3):17, 2015.
 [53] L. Yang, Y. Zhang, J. Chen, S. Zhang, and D. Z. Chen. Suggestive annotation: A deep active learning framework for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 399–407. Springer, 2017.
 [54] F. Yu, V. Koltun, and T. A. Funkhouser. Dilated residual networks. In CVPR, volume 2, page 3, 2017.
 [55] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2018.
 [56] J.J. Zhu and J. Bento. Generative adversarial active learning. arXiv preprint arXiv:1702.07956, 2017.
Supplementary Material
A. Datasets
Table 2 shows a summary of the datasets utilized in our work along with their size and number of classes and budget size.
Initially  

Dataset  Classes  Train + Val  Test  Labeled  Budget  Image Size 
CIFAR10 [27]  
CIFAR100 [27]  
Caltech256 [20]  
ImageNet [6]  
BDD100K [55]  
Cityscapes [5] 
B. Hyperparameter Selection
Table 3 shows the hyperparameters found for our models through a grid search.
Experiment  batch size  epochs  

CIFAR10  
CIFAR100  
Caltech256  
ImageNet  
BDD100K  
Cityscapes 