Variational Adversarial Active Learning

by   Samarth Sinha, et al.
berkeley college

Active learning aims to develop label-efficient algorithms by sampling the most representative queries to be labeled by an oracle. We describe a pool-based semi-supervised active learning algorithm that implicitly learns this sampling mechanism in an adversarial manner. Our method learns a latent space using a variational autoencoder (VAE) and an adversarial network trained to discriminate between unlabeled and labeled data. The mini-max game between the VAE and the adversarial network is played such that while the VAE tries to trick the adversarial network into predicting that all data points are from the labeled pool, the adversarial network learns how to discriminate between dissimilarities in the latent space. We extensively evaluate our method on various image classification and semantic segmentation benchmark datasets and establish a new state of the art on CIFAR10/100, Caltech-256, ImageNet, Cityscapes, and BDD100K. Our results demonstrate that our adversarial approach learns an effective low dimensional latent space in large-scale settings and provides for a computationally efficient sampling method.


The Effectiveness of Variational Autoencoders for Active Learning

The high cost of acquiring labels is one of the main challenges in deplo...

Minimax Active Learning

Active learning aims to develop label-efficient algorithms by querying t...

State-Relabeling Adversarial Active Learning

Active learning is to design label-efficient algorithms by sampling the ...

Task-Aware Variational Adversarial Active Learning

Deep learning has achieved remarkable performance in various tasks thank...

Semi-supervised Adversarial Active Learning on Attributed Graphs

Active learning (AL) on attributed graphs has received increasing attent...

Unsupervised Learning of slow features for Data Efficient Regression

Research in computational neuroscience suggests that the human brain's u...

Width-Based Planning and Active Learning for Atari

Width-based planning has shown promising results on Atari 2600 games usi...

1 Introduction

Figure 1:

Our model learns the distribution of labeled data in a latent space using a VAE optimized using both reconstruction and adversarial losses. A binary classifier predicts unlabeled examples and sends them to an oracle for annotations. The VAE is trained to fool the adversarial network to believe that all the examples are from the labeled data while the adversarial classifier is trained to differentiate labeled from unlabeled samples.

The recent success of learning-based computer vision methods relies heavily on abundant annotated training examples, which may be prohibitively costly to label or impossible to obtain at large scale

[46, 6]. In order to mitigate this drawback, active learning [4] algorithms aim to incrementally select samples for annotation that result in high classification performance with low labeling cost. Active learning has been shown to require relatively fewer training instances when applied to computer vision tasks such as image classification [40, 29, 14, 1] and semantic segmentation [53, 28, 19].

This paper introduces a pool-based active learning strategy which learns a low dimensional latent space from labeled and unlabeled data using Variational Autoencoders (VAEs). VAEs have been well-studied and valued for both their generative properties as well as their ability to learn rich latent spaces. Our method, Variational Adversarial Active Learning (VAAL), selects instances for labeling from the unlabeled pool that are sufficiently different in the latent space learned by the VAE in order to maximize the performance of the representation learned on the newly labeled data. Sample selection in our method is performed by an adversarial network which classifies which pool the instances belong to (labeled or unlabeled).

Our VAE learns a latent representation in which the sets of labeled and unlabeled data are mapped into a common embedding. We use an adversarial network in this space to correctly classify one from another. The VAE and the discriminator are framed as a two player mini-max game, similar to GANs [18] such that the VAE is trained to learn a feature space to trick

the adversarial network into predicting that all datapoints, from both the labeled and unlabeled sets, are from the labeled pool while the discriminator network learns how to discriminate between them. The strategy follows the intuition that once the active learner is trained, the probability associated with discriminator’s predictions effectively estimates how representative each sample is from the pool that it has been deemed to be from. Therefore, instead of explicitly measuring uncertainty, we aim to choose points that would yield high uncertainty and thus are samples that are not well represented in the labeled set. We additionally consider oracles with different levels of labeling noise and demonstrate the robustness of our method to such noisy labels. In our experiments we demonstrate superior performance on a variety of large scale image classification and segmentation datasets, and outperform current state of the art methods both in performance and computational cost.

2 Related Work

Active learning: Current approaches can be categorized as query-acquiring (pool-based) or query-synthesizing methods. Query-synthesizing approaches use generative models to generate informative samples [32, 34, 56] whereas pool-based algorithms use different sampling strategies to determine how to select the most informative samples. Since our work lies in the latter line of research, we will mainly focus on previous work in this direction.

Pool-based methods can be grouped into three major categories as follows: uncertainty-based methods [19, 51, 1], representation-based models [40], and their combination [53, 38]. Pool-based methods have been theoretically proven to be effective and achieve better performance than the random sampling of points [42, 12, 15]. Sampling strategies in pool-based algorithms have been built upon several methods, which are surveyed in [41], such as information theoretic methods [30], ensembles methods [35, 12]

and uncertainty heuristics such as distance to the decision boundary

[48] and conditional entropy [29]. Uncertainty-based pool-based models are proposed in both Bayesian [14] and non-Bayesian frameworks. In the realm of Bayesian frameworks, probabilistic models such as Gaussian processes are used to estimate uncertainty [23, 39]. Gal & Gharamani [14, 13]

, also showed the relationship between uncertainty and dropout to estimate uncertainty in prediction in neural networks and applied it for active learning in small image datasets using shallow

[13] and deep [14] neural networks. In non-Bayesian classical active learning approaches uncertainty heuristics such as distance from the decision boundary, highest entropy, and expected risk minimization have been widely investigated [3, 48, 52]. However, it was shown in [40] that such classical techniques do not scale well to deep neural networks and large image datasets. Instead they proposed to use Core-sets, where they minimize the euclidean distance between the sampled points and the points that were not sampled in the feature space of the trained model [40]. Using an ensemble of models to represent uncertainty was proposed by [28, 53], but [36] showed that using ensembles does not always yield high diversity in predictions which results in sampling redundant instances.

Representation-based methods rely on selecting few examples by increasing diversity in a given batch [40, 8]. The Core-set technique was shown to be an effective representation learning method for large scale image classification tasks [40]

and was theoretically proven to work best when the number of classes is small. However, as the number of classes grows, it deteriorates in performance. Moreover, for high-dimensional data, using distance-based representation methods, like Core-set, appears to be ineffective because in high-dimensions


-norms suffer from the curse of dimensionality which is referred to as the

distance concentration phenomenon in the computational learning literature [10]. We overcome this limitation by utilizing VAEs which have been shown to be effective in unsupervised and semi-supervised representation learning of high dimensional data [26, 45].

Methods that aim to combine uncertainty and representativeness use a two-step process to select the points with high uncertainty as of the most representative points in a batch. A hybrid framework combining uncertainty using conditional entropy and representation learning using information density was proposed in [29]

for classification tasks. A weakly supervised learning strategy was introduced in

[51] that trains the model with pseudo labels obtained for instances with high confidence in predictions. However, for a fixed performance goal, they often need to sample more instances per batch compared to other methods. Furthermore, in [28] it was shown that having the representation step may not be necessary followed by suggesting an ensemble method that outperformed competitive approaches such as [53] which uses uncertainty together with Core-sets. While we show that our model outperforms both [28] and [53], we argue that VAAL achieves this by learning the representation and uncertainty together such that they act in favor of each other resulting in a better active learning performance.

Variational autoencoders: Autoencoders have long been used to effectively learn a feature space and representation [2]. A Variational AutoEncoder [26] is an example of a latent variable model that follows an encoder-decoder architecture of classical autoencoders which places a prior distribution on the feature space distribution, and uses an Expected Lower Bound to optimize the learnt posterior. Adversarial autoencoders are a family of autoencoders which minimize the adversarial loss in the latent space between a sample from the prior and the posterior distribution [33]. Prior work has investigated uncertainty modeling using a VAE to drive learning of sequence models in language applications [7],

Active learning for semantic segmentation: Segmentation labeling is one of the most expensive annotations to collect. Active learning in the literature has been broadly investigated for labeling medical images as it is one of the most prevailing applications of AL where only human experts with sophisticated knowledge are capable of providing labels and therefore, improving this process would reduce a lot of time and effort for them. Suggestive Annotation (SA) [53] uses uncertainty obtained from an ensemble of models trained on the labeled data and Core-sets for choosing representative data points in a two-step strategy. [28] also proposed an active learning algorithm for image segmentation using an ensemble of models, but they empirically showed their proposed information theoretic heuristic for uncertainty is equal in performance to SA, without using Core-sets. [19] extended the work by [14] and proposed using Monte-Carlo dropout masks on the unlabeled images using a trained model and calculating the uncertainty on the predicted labels of the unlabeled images. Some active learning strategies developed for image classification can also be used for semantic segmentation. Core-sets and max-entropy strategies can both be used for active learning in semantic segmentation [40, 3].

Adversarial learning: Adversarial learning has been used for different problems such as generative models [18], representation learning [33, 37], domain adaptation [50, 22]

, deep learning robustness and security

[31, 49] etc. The use of an adversarial network enables the model to train in a fully-differentiable by adjusting to solving the mini-max optimization problem [18]. The adversarial network used in the feature space has been extensively researched in the representation learning and domain adaptation literature to efficiently learn a useful feature space for the task [33, 24, 47, 50, 22].

3 Adversarial Learning of Variational Auto-encoders for Active Learning

Let () be a sample pair belonging to the pool of labeled data (). denotes a much larger pool of samples () which are not yet labeled. The goal of the active learner is to train the most label-efficient model by iteratively querying a fixed sampling budget, number of the most informative samples from the unlabeled pool (), using an acquisition function to be annotated by the oracle such that the expected loss is minimized.

3.1 Transductive representation learning.

We use a -variational autoencoder for representation learning in which the encoder learns a low dimensional space for the underlying distribution using a Gaussian prior and the decoder reconstructs the input data. In order to capture the features that are missing in the representation learned on the labeled pool, we can benefit from using the unlabeled data and perform transductive learning. The objective function of the -VAE is minimizing the variational lower bound on the marginal likelihood of a given sample formulated as


where and are the encooder and decoder parameterized by and , respectively. is the prior chosen as a unit Gaussian, and is the Lagrangian parameter for the optimization problem. The reparameterization trick is used for proper calculation of the gradients [26].

3.2 Adversarial representation learning

The representation learned by the VAE is a mixture of the latent features associated with both labeled and unlabeled data. An ideal active learning agent is assumed to have a perfect sampling strategy that is capable of sending the most informative

unlabeled data to the oracle. Most of the sampling strategies rely on the model’s uncertainty, i.e, the more uncertain the model is on the prediction, the more informative that specific unlabeled data must be. However, this introduces vulnerability to the outliers. In contrast we train an adversarial network for our sampling strategy to learn how to distinguish between the encoded features in the latent space. This adversarial network is analogous to discriminators in GANs where their role is to discriminate between fake and real images created by the generator. In VAAL, the adversarial network is trained to map the latent representation of

to a binary label which is if the sample belongs to and is

, otherwise. The key to our approach is that the VAE and the adversarial network are learned together in an adversarial fashion. While the VAE maps the labeled and unlabeled data into the same latent space with similar probability distribution

and , it fools the discriminator to classify all the inputs as labeled. On the other hand, the discriminator attempts to effectively estimate the probability that the data comes from the unlabeled data. We can formulate the objective function for the adversarial role of the VAE as follows


where is simply a binary cross-entropy cost function. The objective function to train the discriminator is also given as below


By combining Eq. (1) and Eq. (2) we obtain the full objective function for the VAE in VAAL as below


where and

are hyperparameters that determine the effect of each component to learn an effective variational adversarial representation.

The task module, denoted as in Fig. (1), learns the task for which the active learner is being trained. We report results below on image classification and semantic segmentation tasks, using VGG16 [44] and dilated residual network (DRN) architecture [54] with an unweighted cross-entropy cost function. Our full algorithm is shown in Alg. 1.

3.3 Sampling strategies and noisy-oracles

The labels provided by the oracles might vary in how accurate they are depending on the quality of available human resources. For instance, medical images annotated by expert humans are assumed to be more accurate than crowd-sourced data collected by non-expert humans and/or available information on the cloud. We consider two types of oracles: an ideal oracle which always provides correct labels for the active learner, and a noisy oracle which non-adversarially provides erroneous labels for some specific classes. This might occur due to similarities across some classes causing ambiguity for the labeler. In order to present this oracle realistically, we have applied a targeted noise on visually similar classes. The sampling strategy in VAAL is shown in Alg. (2). We use the probability associated with the discriminator’s predictions as a score to collect number of samples in every batch with the lowest confidence to be sent to the oracle.

0:  Labeled pool (, Unlabeled pool , Initialized models for , , and

  Hyperparameters: epochs,

, , , ,
1:  for  to epochs do
2:     sample )
3:     sample
4:     Compute by using Eq. 1
5:     Compute by using Eq. 2
7:     Update VAE by descending stochastic gradients:
10:     Update by ascending its stochastic gradient:
12:     Train and update :
14:  end for
15:  return  Trained
Algorithm 1 Variational Adversarial Active Learning
1:  Select samples () with
5:  return  
Algorithm 2 Sampling Strategy in VAAL

4 Experiments

We begin our experiments with an initial labeled pool with of the training set labeled. The budget size per batch is equal to of the training dataset. The pool of unlabeled data contains the rest of the training set from which samples are selected to be annotated by the oracle. Once labeled, they will be added to the initial training set and training is repeated on the new training set. We assume the oracle is ideal unless stated otherwise.

Datasets. We have evaluated VAAL on two common vision tasks. For image classification we have used CIFAR10 [27] and CIFAR100 [27] both with K images of size , and Caltech-256 [20] which has images of size including object categories. For a better understanding of the scalability of VAAL we have also experimented with ImageNet [6] with more than M images of classes. For semantic segmentation, we evaluate our method on BDD100K [55] and Cityscapes [5] datasets both of which have classes. BDD100K is a diverse driving video dataset with K images with full-frame instance segmentation annotations collected from distinct locations in the United State. Cityscapes is also another large scale driving video dataset containing frames with instance segmentation annotations recorded in street scenes from different cities in Europe. The statistics of these datasets are summarized in Table 2 in the appendix.

Figure 2: VAAL performance on classification tasks using CIFAR10, CIFAR100, Caltech-256, and ImageNet compared to Core-set [40], Ensembles w. VarR [1], MC-Dropout [13], DBAL [14], and Random Sampling.

Performance measurement. We evaluate the performance of VAAL in image classification and segmentation by measuring the accuracy and mean IoU, respectively achieved by trained with , , , , , , of the total training set as it becomes available with labels provided by the oracle. Results for all our experiments, except for ImageNet, are averaged over runs. ImageNet results however, are obtained by averaging over repetitions using , , , , of the training data.

4.1 VAAL on image classification benchmarks

Baselines. We compare our results using VAAL for image classification against various approaches including Core-set [40], Monte-Carlo Dropout [13], and Ensembles using Variation Ratios (Ensembles w. VarR) [1, 11]. We also show the performance of deep Bayesian AL (DBAL) by following [14] and perform sampling using their proposed max-entropy scheme to measure uncertainty [43]. We also show the results using random sampling in which samples are uniformly sampled at random from the unlabeled pool. This method still serves as a competitive baseline in active learning. Moreover, we use the mean accuracy achieved on the entire dataset as an upper bound which does not adhere to the active learning scenario.

Implementation details. We used random horizontal flips for data augmentation. The architecture used in the task module for image classification is VGG16 [44] with Xavier initialization [17] and -VAE has the same architecture as the Wasserstein autoencoder [47] with latent dimensionality given in Table 3 in the appendix. The discriminator is a

-layer multilayer perceptron (MLP) and Adam

[25] is used as the optimizer for all these three modules with an equal learning rate of and batch size of . However for ImageNet, learning rate varies across the modules such that the task learner has a learning rate of while the VAE and the discriminator have a learning rate of . Training continues for epochs in ImageNet and for epochs in all other datasets. The budget size for classification experiments is chosen to be of the full training set, which is equivalent to , , , and for CIFAR10, CIFAR100, Caltech-256, and ImageNet, respectively in VAAL and all other baselines. A complete list of hyperparameters used in our model are found through a grid search and are tabulated in Table 3 in the appendix.

VAAL performance CIFAR10/100 and Caltech-256. Figure 2 shows performance of VAAL compared to prior works. On CIFAR10, our method achieves mean accuracy of by using of the data whereas using the entire dataset yields accuracy of , denoted as Top-1 accuracy in Fig. 2. Comparing the mean accuracy values for data ratios above shows that VAAL evidently outperforms random sampling, DBAL, and MC-Dropout while beating Ensembles by a smaller margin and becoming on-par with Core-set. On CIFAR100, VAAL remains competitive with Ensembles w. VarR and Core-set, and outperforms all other baselines. The maximum achievable mean accuracy is on CIFAR100 using of the data while VAAL achieves by only using of it. Moreover, for data ratios above of labeled data, VAAL consistently requires less number of labels compared to Core-set or Ensembles w. VarR in order to achieve the same accuracy, which is equal to labels. On Caltech-256, which has real images of object categories, VAAL consistently outperforms all baselines by an average margin of from random sampling and from the most competitive baseline, Core-set. DBAL method performs nearly identical to random sampling while MC-Dropout yields lower accuracies than random sampling. By looking at the number of labels required to reach a fixed performance, for instance , VAAL needs of data ( images) to be labeled whereas this number is approximately and for Core-set and Ensemble w. VarR, respectively. Random sampling, DBAL, and MC-Dropout all need more than images.

As can be seen in Fig. 2, VAAL outperforms Core-set with higher margins as the number of classes increases from to to . The theoretical analysis shown in [40] confirms that Core-set is more effective when fewer classes are present due to the negative impact of high dimensionality on p-norms in the Core-set method.

VAAL performance on ImageNet. ImageNet [6] is a challenging large scale dataset which we use to show scalability of our approach. Fig. 2 shows that we improve the state-of-the-art by increase in the gap between the accuracy achieved by the previous state-of-the-art methods (Core-set and Ensemble) and random sampling. As can be seen in Fig. 2, this improvement can be also viewed in the number of samples required to achieve a specific accuracy. For instance, accuracy of is achieved by VAAL using K number of images whereas Core-set and Ensembles w. VarR should be provided with almost K more labeled images to obtain the same performance. Random sampling remains as a competitive baseline as both DBAL and MC-Dropout perform below that.

4.2 VAAL on image segmentation benchmarks

Baselines. We evaluate VAAL against state-of-the-art AL approaches for image segmentation including Core-set [40], MC-Dropout [19], Query-By-Committee (QBC) [28], and suggestive annotation (SA)[53]. SA is a hybrid ensemble method that uses bootstrapping for uncertainty estimation [9] and core-set for measuring representativeness.

Implementation details. Similar to the image classification setup, we used random horizontal flips for data augmentation. The -VAE is a Wasserstein autoencoder [47], and the discriminator is also a -layer MLP. The architecture used in the task module for image segmentation is DRN [54] and Adam with a learning rate of is chosen as the optimizer for all three modules. The batch size is set as and training stops after epochs in both datasets. The budget size used in VAAL and all baselines is set as and for BDD100K and Cityscapes, respectively. All hyperparameteres are shown in Table 3 in the appendix

Figure 3: VAAL performance on segmentation tasks using Cityscapes and BDD100K compared to QBC [28], Core-set [40], MC-Dropout [13], and Random Sampling

VAAL performance on Cityscapes and BDD100K. Figure 3 demonstrates our results on the driving datasets compared with four other baselines as well as the reference random sampling. As we also observed in section 4.1 Core-set performs better with fewer number of classes in image classification tasks [40] . However, the large gap between VAAL and Core-set, despite only having classes, suggests that Core-set and Ensemble-based methods (QBC in here) suffer from high dimensionality in the inputs ( as opposed to thumbnail images used in CIFAR10/100). QBC and Core-set, and SA (Core-set + QBC) perform nearly identical, while MC-Dropout remains less effective than random sampling. VAAL consistently demonstrate significantly better performance by achieving the highest mean IoU on both Cityscapes and BDD100K across different labeled data ratios. VAAL is able to achieve mIoU of and using only labeled data while the maximum mIoU we obtained using of these datasetes is and on Cityscapes and BDD100K, respectively. In terms of required labels by each method, on Cityscapes VAAL needs annotations to reach of mIoU whereas QBC, Core-set, SA, random sampling, MC-Dropout demand nearly , , , , and labels, respectively. Similarly on BDD100K in order to reach of mIoU, other baselines need more annotations than VAAL requires only . Considering the difficulties in full frame instance segmentation, VAAL is able to effectively reduce the required time and effort for such dense annotations.

5 Analyzing VAAL in Detail

In this section, we take a deeper look into our model by first performing ablation and then evaluating the effect of possible biases and noise on its performance. Sensitivity of VAAL to budget size is also explored in 5.2.

5.1 Ablation study

Figure 4 presents our ablation study to inspect the contribution of the key modules in VAAL including the VAE, and the discriminator (). We perform ablation on the segmentation task which is more challenging than classification and we use BDD100K as it is larger than Cityscapes. The variants of ablations we consider are: 1) eliminating VAE, 2) Frozen VAE with D, 3) eliminating . In the first ablation, we explore the role of the VAE as the representation learner by having only a discriminator trained on the image space to discriminate between labeled and unlabeled pool. As shown in Fig. 4, this setting results in the discriminator to only memorize the data and yields the lowest performance. Also, it reveals the key role of the VAE in not only learning a rich latent space, but also playing an effective mini-max game with the discriminator to avoid overfitting. In the second ablation scenario we add a VAE to the previous setting to encode-decode a lower dimensional space for training . However, here we avoid training the VAE and hence merely explore its role as an autoencoder. This setting performs better than having only the trained in a high dimensional space, but yet performs similar or worse than random sampling suggesting that discriminator failed at learning representativeness

of the samples in the unlabeled pool. In the last ablation, we explore the role of the discriminator by training only a VAE that uses 2-Wasserstein distance from the cluster-centroid of the labeled dataset as a heuristic to explicitly measure uncertainty. For a multivariate isotropic Gaussian distribution, the closed form solution for 2-Wasserstein distance between two probability distributions

[16] can be written as


where represents the Frobenius norm and , denote the , predicted by the encoder and ,

are the mean and variance for the normal distribution over the labeled data from which the latent variable

is generated. In this setting, we see an improvement over random sampling which shows the effect of explicitly measuring the uncertainty in the learned latent space. However, VAAL appears to outperform all these scenarios by implicitly learning the uncertainty over the adversarial game between the discriminator and the VAE.

Figure 4: Ablation results on analyzing the effect of the VAE and the discriminator denoted as here,

5.2 VAAL’s Robustness

Figure 5: Analyzing robustness of VAAL to noisy labels, budget size, and biased initial labeled pool using CIFAR100.

Effect of biased initial labels in VAAL. We investigate here how bias in the initial labeled pool affect VAAL’s performance as well as other baselines on CIFAR100 dataset. Intuitively, bias can affect the training such that it causes the initially labeled samples to be not representative of the underlying data distribution by being inadequate to cover most of the regions in the latent space. We model a possible form of bias in the labeled pool by not providing labels for chosen classes at random and we compare it to the case where samples are randomly selected from all classes. We exclude the data for and classes at random in the initial labeled pool to explore how it affects the performance of the model. Figure 5 shows for and , VAAL is superior to Core-set and random sampling in selecting informative samples from the classes that were underrepresented in the initial labeled set. We also observe that VAAL with missing classes performs nearly identical to Core-Set and significantly better than random sampling where each has half number of missing classes.

Effect of budget size on performance. Figure 5 illustrates the effect of the budget size on our model compared to the most competitive baselines on CIFAR100. We repeated our experiments in section 4.1 for a lower budget size of . We observed that VAAL outperforms Core-Set and Ensemble w. VarR, as well as random sampling, on both budget sizes of and . Core-set comes at the second best method followed by Ensemble in Fig 5. We note that for all methods, including VAAL, has a slightly better performance compared to when which is expected to happen because a larger sampled batch results in adding redundant samples instead of more informative ones.

Noisy vs. ideal oracle in VAAL. In this analysis we investigate the performance of VAAL in the presence of noisy data caused by an inaccurate oracle. We assume the erroneous labels are due to the ambiguity between some classes and are not adversarial attacks. We model the noise as targeted noise on specific classes that are meaningful to be mislabeled by a human labeler. We used CIFAR100 for this analysis because of its hierarchical structure in which classes in CIFAR100 are grouped into super-classes. Each image comes with a fine label (the class to which it belongs) and a coarse label (the super-class to which it belongs). We randomly change the ground truth labels for , and of the training set to have an incorrect label within the same super-class. Figure 5 shows how a noisy oracle effects the performance of VAAL, Core-set, and random sampling. Because both Core-set and VAAL do not depend on the task learner, we see that the relative performance is comparable to the ideal oracle presented in Section 4.1. Intuitively, as the percentage of noisy labels increases, all of the active learning strategies converge to random sampling.

Choice of the network architecture in . In order to assure VAAL is insensitive to the VGG16 architecture used in our classification experiments, we also used ResNet18 [21] in VAAL and the most competitive baseline (Core-set). Figure 6 in the appendix shows the choice of the architecture does not affect the performance gap between VAAL and Core-set.

Method Time (sec)
MC-Dropout [13]
Core-set [40]
Ensembles w. VarR [1]
DBAL. [14]
VAAL (ours)
Table 1: Time taken to sample, for one sampling iteration, from the unlabeled pool on CIFAR10

dataset. For a fair comparison we use the same PyTorch data-loader across VAAL and baselines.

5.3 Sampling time analysis

The sampling strategy of an active learner has to select samples in a time efficient manner. In other words it should be as close as possible to random sampling, considering the fact that random sampling is still an effective baseline. Table 1 shows our comparison for VAAL and all our baselines on CIFAR10 using a single NVIDIA TITAN Xp. Table 1 shows the time needed to sample a fixed budget of images from the unlabeled pool for all the methods. MC-Dropout performs multiple forward passes to measure the uncertainty from dropout masks which explains why it appears to be very slow in sample selection. Core-set and Ensembles w. VarR, are the most competitive baselines to VAAL in terms of their achieved mean accuracy. However, in sampling time, VAAL takes seconds while Core-set requires sec and Ensembles w. VarR needs sev. DBAL [14] is on-par in sampling time with VAAL, however, DBAL is outperformed in accuracy by all other methods including random sampling which can sample in only a few milliseconds. The significant difference between Core-set and VAAL is due to the fact that Core-set needs to solve an optimization problem for sample selection as opposed to VAAL which only needs to perform inference on the discriminator and rank its output probabilities. The Ensembles w. VarR method uses models to measure the uncertainty resulting in better computational efficiency but it does not yet perform as fast as VAAL.

6 Conclusion

In this paper we proposed a new batch mode active learning algorithm, VAAL, that learns a latent representation on both labeled and unlabeled data in an adversarial game between a VAE and a discriminator, and implicitly learns the uncertainty for the samples deemed to be from the unlabeled pool. We demonstrate state-of-the-art results, both in terms of accuracy and sampling time, on small and large-scale image classification (CIFAR10, CIFAR100, Caltech-256, ImageNet) and segmentation datasets (Cityscapes, BDD100K). We further showed that VAAL is robust to noisy labels and biased initial labeled data, and it performs consistently well, given different oracle budgets.


Supplementary Material

A. Datasets

Table 2 shows a summary of the datasets utilized in our work along with their size and number of classes and budget size.

Dataset Classes Train + Val Test Labeled Budget Image Size
CIFAR10 [27]
CIFAR100 [27]
Caltech-256 [20]
ImageNet [6]
BDD100K [55]
Cityscapes [5]
Table 2: A summary of the datasets used in our experiments. CIFAR10, CIFAR100, Caltech-256 and ImageNet are datasets used for image classification, while BDD100K and Cityscapes are large scale segmentation datasets. The budget for each dataset is the number of images that can be sampled at each training iteration.

B. Hyperparameter Selection

Table 3 shows the hyperparameters found for our models through a grid search.

Experiment batch size epochs
Table 3: Hyperparameters used in our experiments for VAAL. is the latent space dimension of VAE. , , and are learning rates for VAE, discriminator (), and task module (), respectively. and are the regularization parameters for transductive and adversarial terms used in Eq. (4). is the Lagrangian parameter in Eq. (1).

Figure 6 shows the performance of our method is robust to the choice of the architecture by having consistently better performance over Core-set [40] on CIFAR100.

Figure 6: Performance of VAAL using ResNet18 and VGG16 on CIFAR100