Task-Aware Variational Adversarial Active Learning

02/11/2020 ∙ by Kwanyoung Kim, et al. ∙ 0

Deep learning has achieved remarkable performance in various tasks thanks to massive labeled datasets. However, there are often cases where labeling large amount of data is challenging or infeasible due to high labeling cost such as labeling by experts or long labeling time per large-scale data sample (e.g., video, very large image). Active learning is one of the ways to query the most informative samples to be annotated among massive unlabeled pool. Two promising directions for active learning that have been recently explored are data distribution-based approach to select data points that are far from current labeled pool and model uncertainty-based approach that relies on the perspective of task model. Unfortunately, the former does not exploit structures from tasks and the latter does not seem to well-utilize overall data distribution. Here, we propose the methods that simultaneously take advantage of both data distribution and model uncertainty approaches. Our proposed methods exploit variational adversarial active learning (VAAL), that considered data distribution of both label and unlabeled pools, by incorporating learning loss prediction module and RankCGAN concept into VAAL by modeling loss prediction as a ranker. We demonstrate that our proposed methods outperform recent state-of-the-art active learning methods on various balanced and imbalanced benchmark datasets.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved remarkable performance in various computer vision tasks such as image classification 

(Krizhevsky et al., 2012; He et al., 2016), object detection (Ren et al., 2015; Redmon et al., 2016), semantic segmentation (Long et al., 2015; Chen et al., 2018)

, and so on thanks to massive labeled datasets with annotations such as ImageNet dataset for image classification 

(Deng et al., 2009) and PASCAL VOC datasets for classification, detection, and segmentation (Everingham et al., 2010). Obtaining good annotations is a challenging task and has often been a large-scale project. However, there are often cases where labeling large amount of data is even more challenging or often infeasible due to high labeling cost such as labeling by experts (Esteva et al., 2017) or long labeling time per large-scale data sample such as very large pathology images (Campanella et al., 2019). High labeling cost seems to be one of the factors to limit the scope of applicability of deep learning to more research areas and more institutes with less labeling budget.

There are several approaches to overcoming limited labeling budget such as utilizing small amount of labeled data (few-shot learning) (Ravi and Larochelle, 2017)

, exploiting both small amount of labeled data and large amount of unlabeled data (semi-supervised learning

(Zhu et al., 2003; Kingma et al., 2014), and selecting data to label for the best possible performance (active learning) (Settles, 2009; Gal et al., 2017)

. While the first two approaches are applicable after labeling, the last approach is applicable before labeling. Active learning has been widely investigated in machine learning. While active learning has been extensively in relatively traditional machine learning settings (non-deep learning approaches) 

(Cohn et al., 1996; Tong and Koller, 2001; Brinker, 2003; Melville and Mooney, 2004; Nguyen and Smeulders, 2004; Settles, 2009; Houlsby et al., 2011; Mac Aodha et al., 2014; Wang and Ye, 2015; Sundin et al., 2019; Pinsler et al., 2019) and has recently been investigated actively in deep learning settings (Gal et al., 2017; Sener and Savarese, 2018; Yang et al., 2017; Yoo and Kweon, 2019; Tran et al., 2019; Sinha et al., 2019; Kirsch et al., 2019).

In this work, we revisit two recent state-of-the-art active learning methods that are learning loss (Yoo and Kweon, 2019) and variational adversarial active learning (VAAL) (Sinha et al., 2019) and investigated active learning methods to exploit the pros of both methods. While VAAL is exploiting the feature structures of both labeled and unlabeled dataset in a task-agnostic way without any information from task learner (Sinha et al., 2019), learning loss is utilizing loss prediction module to predict losses of task learner for unlabeled data (Yoo and Kweon, 2019). We propose to incorporate task-related information on predicted loss into VAAL using a concept from RankCGAN (Saquil et al., 2018) so that both global data structure from VAAL and local task-related information from learning loss can be used to select samples to annotate. We evaluated our proposed methods on various benchmark datasets such as CIFAR-10, SVHN, Fashion-MNIST that have the same number of images per class (balanced datasets) as well as our modified CIFAR-10 that had different numbers of images for classes (imbalanced). Here are the contributions:

  1. Revisiting learning loss work / VAAL work and proposing new active learning method called task-aware VAAL (TA-VAAL) to exploit both global data structure from labeled / unlabeled datasets as well as local task-aware information from loss prediction module for active learning with deep neural networks,

  2. Extensive empirical analyses on the proposed method on both balanced datasets and imbalanced dataset not only using performance measures, but also using information theoretic measures for the proposed active sampling method, showing its state-of-the-art performance as well asl reliable active sampling results.

2 Related Work

There have been many works on active learning to select the most informative data points and we categorized them into two different approaches: model certainty-based approach and data distribution-based approach. The first approach is using unlabeled data in a passive way while the second approach is using unlabeled data in an active way. In other words, the former has sample selection rules that are not affected by unlabeled data, but simply apply those rules to unlabeled data, while the latter exploit both labeled and unlabeled data to build up (or to train) sample selection rules (or deep neural networks).

Model uncertainty-based approach defined and used metrics for sample selection with labeled data. For example, the minimal distance from decision boundaries (or classification hyperplanes) can be used to select samples with the most ambiguous classification results 

(Tong and Koller, 2001; Brinker, 2003). Empirical risk minimization is used to minimize an upper bound of the true risk so that one can query the most informative samples that are the most uncertain and representative (Wang and Ye, 2015). Bayesian active learning by disagreement (BALD) maximizes the mutual information between model predictions and model parameters for active learning (Houlsby et al., 2011). However, these method were not well-suited to deep neural networks.

Recently, there have been some works on model uncertainty-based active learning with deep neural networks. BALD was extended to accommodate deep neural networks with Bayesian neural network and Monte-Carlo dropout (Gal et al., 2017). As a practical active learning method, two-step approach was proposed to select the -most uncertain samples by using bootstrapping and and then to find the most representative subset among samples by using greedy method (Yang et al., 2017)

. Bayesian generative active deep learning was proposed to utilize both labeled data and labeled fake data (generated images from deep generative model) to train a classifier (a task-learner) as well as a discriminator for real / fake images 

(Tran et al., 2019). Yoo and Kweon 2019 proposed active learning loss method that attaches “loss prediction module” to a task-learner and loss prediction module is trained to predict target losses of unlabeled data samples. In learning loss method, predicted loss was used as a surrogate for model uncertainty and was obtained from feature information in mid-layers.

Data distribution-based approach exploits both labeled and unlabeled data to form sample selection rules so that selected samples are far from the distribution of labeled data and have the most well-representative information of unlabeled pool. For example, clustering for unlabeled data could help to choose samples from clusters so that they are from diverse clusters, not from one or small number of clusters (Nguyen and Smeulders, 2004)

. Expected error reduction using hierarchical clustering was developed to select active samples in a graph based semi-supervised framework 

(Mac Aodha et al., 2014). An objective function with diversity constrain was proposed to impose more diversity on the subset of data pool for multi-class active learning (Yang et al., 2015).

Recently, there have also been some works on data distribution-based active learning with deep neural networks. Core set approach was developed that minimizes the distance between labeled data point and unlabeled data pool with intermediate feature information of trained convolutional neural network models 

(Sener and Savarese, 2018). Recently, Sinha et al. 2019 proposed VAAL to train variational auto encoder (VAE) that captures the representing information of both labeled data and unlabeled data and to adopt adversarial learning to discriminate unlabeled data point from labeled data using the information from the latent space of the trained encoder of the VAE.

Our work falls into the data distribution-based approach, an extension of VAAL to incorporate task related information into the VAE framework. We conjecture that our proposed framework can accommodate both global data distribution structure and local task-related information so that high performance and reliability can be achieved. Both our proposed work and the work of Tran et al. 2019 use deep generative models, but the ways of using them are quite different and our work is using unlabeled data to train the VAE while Tran et al. 2019 does not. Thus, both methods are not competing with each other, rather could be complementing each other.

(a) VAAL
(b) Ours (TA-VAAL)
Figure 1: Schematic diagrams of VAAL and our proposed method, TA-VAAL: VAAL is effective at learning the overall influence of labels propagated to the entire (unlabeled) data distribution. Injecting the capability of capturing fine-grained relative label rank information, TA-VAAL helps focus on more influential/informative labels and adjust how they are propagated.

3 Background

3.1 Active learning

Let us denote the pool of labeled data and annotation by , respectively, and the pool of unlabeled data by . The goal of active learning is to select the samples from the unlabeled data pool and to annotate them to add the paired samples / annotations , respectively, to the labeled data pool for the best possible performance of the given task learner (or classifier, deep neural network) under limited annotation budget. At each stage, the task leaner is trained by minimizing the objective function with -stage labeled data set . Given a pair of training data point from the labeled pool, the task learner predicts and the task learner

is trained by minimizing the loss function

for all pairs in .

3.2 Learning loss for active learning

While can be calculated when the ground truth label is available, it can not be without . Yoo and Kweon 2019 proposed loss prediction module to predict the loss value for in the unlabeled data pool and used it to perform active learning. Loss prediction module

consists of global average pooling, fully connected layer and ReLU to predict

, ground-truth target loss.

Loss prediction module is optimized using the objective function , where = . Mean squared error (MSE) could be naively selected as the loss function , but training the loss prediction module with MSE often has a tendency to minimize the scale of entire objective function. To avoid scaling issue, Yoo and Kweon 2019 used ranking between two sample points as another loss function. They compared a pair of samples to obtain ranking: was re-grouped into a set of pairs . If there were elements in the original , this re-grouped set will have elements (pairs) of . Thus, the marginal ranking loss in (Yoo and Kweon, 2019) that should be minimized will be


where is a positive scalar that was set to 1.

Learning loss only utilizes labeled data to train loss prediction module and applies it to unlabeled data to select samples that have fairly high predicted losses.

3.3 Variational adversarial active learning

A schematic diagram of VAAL is illustrated in Figure 0(a) (Sinha et al., 2019). VAAL firstly trained the VAE with both labeled dataset and unlabeled dataset to learn the representations of both datasets without using any information about the task learner for performing transductive learning. Then, a discriminator for deciding labeled / unlabeled data is connected to the latent space of the VAE and then both the VAE and the discriminator are jointly trained in an adversarial manner (Goodfellow et al., 2014)

. Thus, the VAE will encode both the labeled and unlabeled data pools into the same latent space and the discriminator will be trained to predict the probability that a point in the latent space belongs to unlabeled data pool.

VAAL is task-agnostic active learner and the proposed discriminator for labeled / unlabeled samples seems to work well as a surrogate for task learner. However, we conjecture that direct task-related information could be helpful to further improve the overall performance of active learning.

4 Method

4.1 Ranking conditional GANs in VAAL

Generative adversarial network (GAN) consists of generator and discriminator  (Goodfellow et al., 2014)

and has been applied to various tasks such as image style transfer, image super resolution, and image editing. The generator

takes a latent variable as an input to generate a sample data and is trained to fool the discriminator while the discriminator is trained to distinguish if the input data is real (or ) or fake (or ). Thus, the objective function for this minimax game can be expressed as:


where the latent variable

is drawn randomly from a probability distribution such as normal distribution

or uniform distribution


To control the distribution of latent space, conditional GAN (CGAN) was proposed to introduce additional latent variable, , to both the generator and the discriminator  (Mirza and Osindero, 2014). The objective function of CGAN is as follows:


RankCGAN incorporated “Ranker” into CGAN so that it produces ranking attribute to provide subjective control over the latent subspace containing  (Saquil et al., 2018). By using one or more subjective attributes, generative model can be controlled by ranking information. For example, typical ranking information could be attributes such as “sporty”, “black” attributes that are subjective for shoes classification (Saquil et al., 2018).

First of all, we propose to modify the VAE in the original VAAL (Figure 0(a)) to incorporate a rank variable from RankCGAN as illustrated in Figure 0(b). By inputting the ranking information about predicted loss from the task learner with loss prediction module, RankCGAN framework allows us to control the latent subspace with loss predictions so that the overall latent space can be reshaped. We argue that this structure will allow to reflect the global data structure from both labeled and unlabeled datasets in the overall latent space and to control the loss prediction latent subspace inside the overall latent space.

4.2 Ranker as loss prediction module in task learner

We argue that RankCGAN (Saquil et al., 2018) is a method to connect between VAAL (Sinha et al., 2019) and learning loss (Yoo and Kweon, 2019). We firstly revisited VAAL with RankCGAN and here we revisit learning loss with RankCGAN. The concept of “Ranker” in RankCGAN (Saquil et al., 2018) is similar to the loss prediction module in learning loss for active learning (Yoo and Kweon, 2019) in a few aspects and one of them is their ranking losses for training. Rather than directly predicting or value itself, they are predicting of losses or values.

Secondly, we propose to modify the original learning loss for active learning (Yoo and Kweon, 2019) with the ranking loss from (Saquil et al., 2018) that is as follows:


where is the Ranker to predict loss (, ) and “sig

” is the sigmoid function.

Thus, the total loss function of task learner with the Ranker is expressed as


where is a scaling parameter.

We empirically found that training a task learner with the ranking loss from (Saquil et al., 2018) was more stable and yielded better performance than the original learning loss work (Yoo and Kweon, 2019) with the original ranking loss in (1) for the task of active learning.

4.3 Task-aware variational adversarial active learning

Finally, we propose our task-aware VAAL that combines the conditional VAE with a rank variable (predicted loss or true loss) and the task learner with Ranker to yield loss ranking information by feeding Ranker output into a rank variable in the conditional VAE as illustrated in Figure 0(b). Our proposed method bridges between model uncertainty-based approach and data distribution-based approach in a tight way by using conditional GAN (RankCGAN) so that the information about data distribution accounts for model uncertainty information (predicted loss). We argue that our proposed approach will have an advantage to use more data (unlabeled data) over typical model uncertainty-based approach trained only with labeled data. We also argue that our approach will have another advantage of exploiting task-related information over data distribution-based approach.

To train the conditional VAE in a conditionally adversarial manner with ranking (the output of Ranker), the objective function of the conditional VAE for learning features of both labeled and unlabeled pools can be reformulated as


where and are the encoder and decoder of the VAE that are parameterized by , , are the output of Ranker from labeled, unlabeled data points, respectively,

is a Gaussian distribution and

is a hyper-parameter, is Kullback-Leibler distance, and reparameterization technique was used for training. Another loss function for training the VAE is the adversarial loss to fool the binary discriminator by representing the same probability distribution of labeled and unlabeled pools. The objective function for the conditional adversarial loss is


Thus, the final loss function for training the conditional VAE is the sum of (6) and (7).

The loss for training the discriminator was designed as follows:


Finally, at each sampling step, the data points to be labeled will be selected by performing the following operation:


Note that a subset method, replacing in (9) with a random subset of

, was used to avoid outliers as much as possible, which was also used in 

(Yoo and Kweon, 2019).

5 Experimental Results

5.1 Balanced benchmark datasets

We evaluated our proposed TA-VAAL method on various (balanced) benchmark datasets such as CIFAR-10 (Krizhevsky, 2009), SVHN (Netzer et al., 2011) and Fashion MNIST (Xiao et al., 2017). CIFAR-10 consisted of 50,000 training images and 10,000 test images whose size is with 10 object categories. SVHN contains 73,257 train images and 26,032 test images with the size of and with 10 object categories. Fashion-MNIST consists of the training set of 60,000 images and the test set of 10,000 images, which are grayscale images, associated with annotations from 10 classes.

We set initially labeled pool with randomly selected 1,000 images and the query data size as 1,000 at each stage. To avoid overlapped samples and introduce diversity to selected samples from unlabeled pool, we adopt the subset method that obtains random subsets from unlabeled pool at each stage before applying active learning methods. We set the subset size to be 10,000, 10 times larger than the query size.

5.2 Imbalanced dataset

We also performed experiments on imbalanced datasets whose data set sizes are different for classes. We choose the CIFAR-10 dataset that has 5,000 sample images per class for 10 classes (thus, the total number of images are 50,000). Images were randomly removed in an imbalanced manner as follows. We firstly removed images randomly for the first 4 classes to yield 3,000, 3,000, 4,000, 5,000 images for the classes 1, 2, 3, 4, respectively. Then, more images were removed randomly so that there are three cases of different levels of imbalances as the following Table 1 illustrates:

Numbers of images per class (100) Total
30, 30, 40, 50, 0.5, 1, 1, 1, 2, 10 16,550
30, 30, 40, 50, 5, 5, 2, 3, 5, 20 19,000
30, 30, 40, 50, 10, 10, 10, 10, 10, 30 23,000
Table 1: The numbers of images per class for all 10 classes in order on our generated imbalanced CIFAR-10 datasets. The entropies for the first, second, third rows are 1.65, 1.89, 2.11, respectively.

The smaller the entropy is, the more imbalanced the dataset is. The amount of images for the first 4 classes is 15,000 images and the remaining 6 classes contain 1,550, 4,000, 8,000 images. The entropies of three datasets are 1.65,1.89 and 2.11, respectively, corresponding to the first, second, third rows of Table 1.

5.3 Implementation details

For training, various techniques were used such as random crop from

zero-padded images, normalization with mean and standard deviation of training set, and horizontal flip and flop augmentation for CIFAR-10 and SVHN. Normalization only was applied for Fashion MNIST. ResNet18 

(He et al., 2016)

was used for all task learners and stochastic gradient descent (SGD) optimizer was used with momentum of 0.9 and weight decay of 0.005. Learning rate was 0.1 for the first 160 epochs and then 0.01 for the rest of 40 epochs. For the VAE, a modified Wasserstein auto-encoder 

(Tolstikhin et al., 2018)

for taking ranking information was used and the discriminator was constructed as a 5-layer multi-layer perceptron (MLP). For both the VAE and the discriminator, the Adam optimizer 

(Kingma and Ba, 2015) with learning rate of 5 was used. Mini-batch size was 128 and the epochs was 200 for all datasets.

5.4 Results for balanced benchmark datasets

Four active learning methods were evaluated including random sampling, learning loss (Yoo and Kweon, 2019), VAAL (Sinha et al., 2019) and our proposed TA-VAAL method on three benchmark datasets, CIFAR-10, SVHN and Fashion MNIST. Figure 2 presents three graphs of the number of labeled images versus accuracy, the mean and standard deviation of 5 trials.

In Figure 1(a) on CIFAR-10, our proposed method achieved the mean accuracy of 90.32% in the last stage (10k) while other methods yielded less than the mean accuracy of 90%. Our proposed methods outperforms against state-of-the-art methods for all stages. In particular, our proposed method yielded the mean accuracy of over 80% only with 4,000 labeled data while other methods were not able to achieve.

The results on SVHN are shown in Figure 1(b). Our proposed method outperforms all state-of-the-art methods in almost all stages except for one stage (5k). Our proposed method yielded substantially higher mean accuracies than other methods at early stages such as 3k and 4k.

The results on Fashion MNIST are depicted in Figure 1(c). Our proposed method outperforms all other compared methods over almost all stages except for the first stage and outperformed all other methods significantly at the 3k stage by more than 5% margin.

Note that the performances of learning loss and ours yielded slightly higher or lower mean accuracies at the first stage due to additional loss prediction module attached to the task learner. We performed ablation studies and will show the results in the later section to show that this additional loss is not the most important factor for the overall performances of our proposed method.

(a) CIFAR-10
(b) SVHN
(c) Fashion MNIST
Figure 2: Accuracy of active learning algorithms (random sampling, Learning loss, VAAL and ours (TA-VAAL)) on balanced datasets. The vertical width of the shaded region along each curve corresponds to twice the standard deviations.
(a) Dataset entropy = 1.65
(b) Dataset entropy = 1.89
(c) Dataset entropy = 2.11
Figure 3: Active learning results on imbalanced benchmark datasets: random sampling, Learning loss, VAAL and ours (TA-VAAL). Higher dataset entropy implies more balanced dataset.

5.5 Results on imbalanced dataset

Four active learning methods were evaluated including random sampling, learning loss (Yoo and Kweon, 2019), VAAL (Sinha et al., 2019) and our proposed TA-VAAL method on three imbalanced datasets by modifying CIFAR-10 as illustrated in Table 1. Figure 3 presents three graphs of the number of labeled images versus accuracy, the mean and standard deviation of 5 trials. Due to reduced data size, we experimented until the labeled data pool size of 5,000. Initial data size of labeled pool and incremental budget size were not changed.

Figure 3 on modified CIFAR-10 with the dataset entropies of 1.65 (the most imbalanced), 1.89 and 2.11 (the least imbalanced) illustrates the results to show that our proposed method outperformed all other state-of-the-art methods over all stages. For more imbalanced dataset, our proposed method was able to achieve higher performances than other compared methods and these differences became smaller for less imbalanced dataset with higher dataset entropy. Even though the final dataset size for all cases were 5,000, note that there are some classes with 50-100 total images per class in the case of the dataset entropy of 1.65. Thus, in this most harsh case, the standard deviation of the performance was large and the final accuracy at 5k stage was still lower than 70% due to these bottlenecks. In the case with dataset entropy of 1.89, our proposed method outperformed all other methods with substantial margins. Note that model uncertainty based approaches such as ours or learning loss were able to achieve substantially higher performances than other task-agnostic approaches in the case of imbalanced data sizes for classes. Moreover, our proposed method yielded substantially higher performances than learning loss method due to the incorporation of the unlabeled data information using generative models.

6 Empirical Analysis Results

6.1 Task learners without loss prediction modules

In (Yoo and Kweon, 2019), the task learner with loss prediction module was trained to show the performance on actively selected samples and was compared with other methods using the task learners without loss prediction modules. Since loss prediction modules are part of learning loss method and our proposed TA-VAAL method and it seems that these loss prediction modules influence the overall performance as shown in Figure 2, often yielding slightly higher or lower performances.

To measure the quality of the selected samples using active learning methods, we trained task learners of learning loss method and our proposed method again without attaching loss prediction module (or Ranker) on the selected datasets up to 10k as shown in Figure 4. Our proposed method was not able to yield the best performance at 5k stage on SVHN as shown in Figure 1(b), but now our proposed method outperforms all other methods on SVHN with the effect of loss prediction module for task learner training is minimized as illustrated in Figure 4. However, at the last stage (10k), our proposed method yielded slightly dropped performance by 0.48% compared with the result of task learner with Ranker. Learning loss method has slightly improved performance by 0.24%. Despite the changes, our proposed method still outperformed other state-of-the-art methods by achieving mean accuracy of 92.27 % in last active learning cycle.

Figure 4: Active learning results on SVHN using task learners without loss prediction modules.

6.2 Ablation studies

Figure 5 shows the performance results of our proposed methods with and without proposed components and proposed structures along with other state-of-the-art methods. The means and standard deviations of 5 trials are reported. First of all, learning loss method without proposed ranking loss (4), called learning loss_v2, yielded substantially higher performances at later active learning stages and comparable performances at early stages as compared to the original learning loss. Thus, it seems that using our proposed ranking loss (4) along with task learners is advantageous.

Another study is to incorporate ranking information into VAAL by using the original learning loss architecture, rather than our proposed Ranker (4). This combination of VAAL+learning loss still yielded substantially better performances than VAAL over all stages, but was not able to yield better performance than the original learning loss method at later stages while this combination seems to yield good performances at early stages.

Figure 5: Results of our proposed methods with and without proposed components and proposed structures on imbalanced dataset, CIFAR-10, whose dataset entropy is 1.89.

6.3 On selected samples using active learning methods

Figure 6 shows the bar graphs for the number of labeled images (selected sampled) versus the entropies of the data class counts of selected samples over 10 classes. The higher the entropy is, the better samples are selected over classes. Figure 5(a) shows that our proposed method selected good samples with high data class count entropy on the most severely imbalanced dataset and Figure 5(b) also shows similar results on the least severely imbalanced dataset with robustness or small standard deviations.

These results are also corresponding to the performance results in Figure 3. For example, learning loss method yielded substantially low performance at 2k stage as shown in Figure 2(a) due to its data selection with very low data count output entropy over classes at the same stage as illustrated in Figure 5(a). This is possibly due to limited number of data for certain classes in the case of dataset entropy = 1.65 so that task learner in learning loss method was not well-trained. In the meanwhile, our proposed method was able to select good samples at the same stage thanks to the structure from VAAL so that good performance as well as high data class count entropy was able to be achieved.

Lastly, Figure 7 shows that at the last stage, the discriminator of VAAL yielded concentrated output values (counts) at the last stage so that the active learning selection became almost random, while our proposed method still yielded wide range of discriminator outputs for good selections and performances, as illustrated in Figures 6 and 3.

(a) Dataset entropy = 1.65
(b) Dataset entropy = 2.11
Figure 6: Bar graphs of number of labeled images (selected samples) versus data class count entropy on the imbalanced datasets with (a) dataset entropy of 1.65 and (b) dataset entropy of 2.11.
Figure 7: Output of discriminator versus number of samples at the last stage on the imbalanced dataset with entropy 1.65.

7 Conclusion

We proposed the methods that simultaneously take advantage of both promising directions for active learning that have been recently explored are data distribution-based approach to select samples that are far from current labeled pool and model uncertainty-based approach that relies on task models. We demonstrate that our proposed methods outperform state-of-the-art active learning methods on various balanced and imbalanced benchmark datasets.


This work was supported partly by the Technology Innovation Program or Industrial Strategic Technology Development Program (10077533, Development of robotic manipulation algorithm for grasping/assembling with the machine learning using visual and tactile sensing information) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea) and partly by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number : HI18C0316).


  • K. Brinker (2003)

    Incorporating diversity in active learning with support vector machines

    In ICML, pp. 59–66. Cited by: §1, §2.
  • G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. W. K. Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, and T. J. Fuchs (2019) Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine 25 (8), pp. 1301–1309. Cited by: §1.
  • L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 801–818. Cited by: §1.
  • D. A. Cohn, Z. Ghahramani, and M. I. Jordan (1996) Active learning with statistical models.

    Journal of artificial intelligence research

    4, pp. 129–145.
    Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: §1.
  • A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), pp. 115–118. Cited by: §1.
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010) The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §1.
  • Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In ICML, pp. 1183–1192. Cited by: §1, §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §3.3, §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §5.3.
  • N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel (2011) Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745. Cited by: §1, §2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.3.
  • D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In NIPS, pp. 3581–3589. Cited by: §1.
  • A. Kirsch, J. van Amersfoort, and Y. Gal (2019) Batchbald: efficient and diverse batch acquisition for deep bayesian active learning. In NeurIPS, pp. 7024–7035. Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §1.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §5.1.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §1.
  • O. Mac Aodha, N. D. Campbell, J. Kautz, and G. J. Brostow (2014) Hierarchical subquery evaluation for active learning on a graph. In CVPR, pp. 564–571. Cited by: §1, §2.
  • P. Melville and R. J. Mooney (2004) Diverse ensembles for active learning. In ICML, pp. 74. Cited by: §1.
  • M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §4.1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, pp. 1–9. Cited by: §5.1.
  • H. T. Nguyen and A. Smeulders (2004) Active learning using pre-clustering. In ICML, pp. 79. Cited by: §1, §2.
  • R. Pinsler, J. Gordon, E. Nalisnick, and J. M. Hernández-Lobato (2019) Bayesian batch active learning as sparse subset approximation. In NeurIPS, pp. 6356–6367. Cited by: §1.
  • S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §1.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In CVPR, pp. 779–788. Cited by: §1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §1.
  • Y. Saquil, K. I. Kim, and P. Hall (2018) Ranking CGANs: subjective control over semantic image attributes. In BMVC, Cited by: §1, §4.1, §4.2, §4.2, §4.2.
  • O. Sener and S. Savarese (2018) Active learning for convolutional neural networks: a core-set approach. In ICLR, Cited by: §1, §2.
  • B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §1.
  • S. Sinha, S. Ebrahimi, and T. Darrell (2019) Variational adversarial active learning. In ICCV, pp. 5972–5981. Cited by: §1, §1, §2, §3.3, §4.2, §5.4, §5.5.
  • I. Sundin, P. Schulam, E. Siivola, A. Vehtari, S. Saria, and S. Kaski (2019) Active learning for decision-making from imbalanced observational data. In ICML, pp. 6046–6055. Cited by: §1.
  • I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf (2018) Wasserstein auto-encoders. In ICLR, Cited by: §5.3.
  • S. Tong and D. Koller (2001) Support vector machine active learning with applications to text classification. Journal of machine learning research 2 (Nov), pp. 45–66. Cited by: §1, §2.
  • T. Tran, T. Do, I. Reid, and G. Carneiro (2019) Bayesian generative active deep learning. In ICML, pp. 6295–6304. Cited by: §1, §2, §2.
  • Z. Wang and J. Ye (2015) Querying discriminative and representative samples for batch mode active learning. ACM Transactions on Knowledge Discovery from Data (TKDD) 9 (3), pp. 1–23. Cited by: §1, §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.1.
  • L. Yang, Y. Zhang, J. Chen, S. Zhang, and D. Z. Chen (2017) Suggestive annotation: a deep active learning framework for biomedical image segmentation. In MICCAI, pp. 399–407. Cited by: §1, §2.
  • Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann (2015) Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113 (2), pp. 113–127. Cited by: §2.
  • D. Yoo and I. S. Kweon (2019) Learning loss for active learning. In CVPR, pp. 93–102. Cited by: §1, §1, §2, §3.2, §3.2, §4.2, §4.2, §4.2, §4.3, §5.4, §5.5, §6.1.
  • X. Zhu, Z. Ghahramani, and J. D. Lafferty (2003) Semi-supervised learning using gaussian fields and harmonic functions. In ICML, pp. 912–919. Cited by: §1.