Code for the paper https://arxiv.org/abs/1702.07956
We propose a new active learning by query synthesis approach using Generative Adversarial Networks (GAN). Different from regular active learning, the resulting algorithm adaptively synthesizes training instances for querying to increase learning speed. We generate queries according to the uncertainty principle, but our idea can work with other active learning principles. We report results from various numerical experiments to demonstrate the effectiveness the proposed approach. In some settings, the proposed algorithm outperforms traditional pool-based approaches. To the best our knowledge, this is the first active learning work using GAN.READ FULL TEXT VIEW PDF
This paper describes ASAL a new active learning strategy that uses
We consider active learning of deep neural networks. Most active learnin...
Active learning has long been a topic of study in machine learning. Howe...
The goal of data selection is to capture the most structural information...
We propose a hypergraph-based active learning scheme which we term HS^2,...
As part of a quality control process in manufacturing it is often necess...
Sufficient supervised information is crucial for any machine learning mo...
Code for the paper https://arxiv.org/abs/1702.07956
One of the most exciting machine learning breakthroughs in recent years is the generative adversarial networks (GAN)goodfellow2014generative . It trains a generative model by finding the Nash Equilibrium of a two-player adversarial game. Its ability to generate samples in complex domains enables new possibilities for active learners to synthesize training samples on demand, rather than relying on choosing instances to query from a given pool.
In the classification setting, given a pool of unlabeled data samples and a fixed labeling budget, active learning algorithms typically choose training samples strategically from a pool to maximize the accuracy of trained classifiers. The goal of these algorithms is to reduce label complexity. Such approaches are called pool-based active learning. This pool-based active learning approach is illustrated in Figure1 (a).
In a nutshell, we propose to use GANs to synthesize informative training instances that are adapted to the current learner. We then ask human oracles to label these instances. The labeled data is added back to the training set to update the learner. This protocol is executed iteratively until the label budget is reached. This process is shown in Figure 1 (b).
The main contributions of this work are as follows:
To the best of our knowledge, this is the first active learning framework using deep generative models111The appendix of papernot2016semi mentioned three active learning attempts but did not report numerical results. Our approach is also different from those attempts..
While we do not claim our method is always superior to the previous active learners in terms of accuracy, in some cases, it yields classification performance not achievable even by a fully supervised learning scheme. With enough capacity from the trained generator, our method allows us to have control over the generated instances which may not be available to the previous active learners.
We conduct experiments to compare our active learning approach with self-taught learning222See the supplementary document.. The results are promising.
The proposed approach should not be understood as a pool-based active learning method. Instead, it is active learning by query synthesis. We show that our approach can perform competitively when compared against pool-based methods.
Our work is related to two different subjects, active learning and deep generative models.
Active learning algorithms can be categorized into stream-based, pool-based and learning by query synthesis. Historically, stream-based and pool-based are the two popular scenarios of active learning Settles2010 .
, the authors synthesized learning queries and used human oracles to train a neural network for classifying handwritten characters. However, they reported poor results due to the images generated by the learner being sometimes unrecognizable to the human oracles. We will report results on similar tasks such as differentiating 5 versus 7, showing the advancement of our active learning scheme. Figure2 compares image samples generated by the method in Lang1992 and our algorithm.
The popular SVM algorithm from Tong1998 is an efficient pool-based active learning scheme for SVM. Their scheme is a special instance of the uncertainty sampling principle which we also employ. Jain2010 reduces the exhaustive scanning through database employed by SVM. Our algorithm shares the same advantage of not needing to test every sample in the database at each iteration of active learning. Although we do so by not using a pool at all instead of a clever trick. wang2014active
proposed active transfer learning which is reminiscent to our experiments in Section5.1. However, we do not consider collecting new labeled data in target domains of transfer learning.
There have been some applications of generative models in semi-supervised learning and active learning. Previously,Nigam2000 proposed a semi-supervised learning approach to text classification based on generative models. Hospedales2013
applied Gaussian mixture models to active learning. In that work, the generative model served as a classifier. Compared with these approaches, we apply generative models to directly synthesize training data. This is a more challenging task.
One building block of our algorithm is the groundbreaking work of the GAN model in goodfellow2014generative . Our approach is an application of GAN in active learning.
Our approach is also related to Springenberg2015 which studied GAN in a semi-supervised setting. However, our task is active learning which is different from the semi-supervised learning they discussed. Our work shares the common strength with the self-taught learning algorithm in Raina2007 as both methods use the unlabeled data to help with the task. In the supplementary document, we compare our algorithm with a self-taught learning algorithm.
In a way, the proposed approach can be viewed as an adversarial training procedure goodfellow2014explaining , where the classifier is iteratively trained on the adversarial example generated by the algorithm based on solving an optimization problem. goodfellow2014explaining focuses on the adversarial examples that are generated by perturbing the original datasets within the small epsilon-ball whereas we seek to produce examples using active learning criterion.
To the best of our knowledge, the only previous mentioning of using GAN for active learning is in the appendix of papernot2016semi . The authors discussed therein three attempts to reduce the number of queries. In the third attempt, they generated synthetic samples and sorted them by the information content whereas we adaptively generate new queries by solving an optimization problem. There were no reported active learning numerical results in that work.
We briefly introduce some important concepts in active learning and generative adversarial network.
In the PAC learning framework Valiant1984 , label complexity describes the number of labeled instances needed to find a hypothesis with error . The label complexity of passive supervised learning, i.e. using all the labeled samples as training data, is Vapnik1998 , where is the VC dimension of the hypothesis class . Active learning aims to reduce the label complexity by choosing the most informative instances for querying while attaining low error rate. For example, Hanneke2007 proved that the active learning algorithm from Cohn1994 has the label complexity bound , where is defined therein as the disagreement coefficient, thus reducing the theoretical bound for the number of labeled instances needed from passive supervised learning. Theoretically speaking, the asymptotic accuracy of an active learning algorithm can not exceed that of a supervised learning algorithm. In practice, as we will demonstrate in the experiments, our algorithm may be able to achieve higher accuracy than the passive supervised learning in some cases.
Stream-based active learning makes decisions on whether to query the streamed-in instances or not. Typical methods include Beygelzimer2008 ; Cohn1994 ; Dasgupta2007 . In this work, we will focus on comparing pool-based and query synthesis methods.
In pool-based active learning, the learner selects the unlabeled instances from an existing pool based on a certain criterion. Some pool-based algorithms make selections by using clustering techniques or maximizing a diversity measure, e.g. Brinker ; Xu2007 ; Dasgupta2008 ; Nguyen ; Yang2015 ; Hoi2009 . Another commonly used pool-based active learning principle is uncertainty sampling. It amounts to querying the most uncertain instances. For example, algorithms in Tong1998 ; Campbell2000
query the labels of the instances that are closest to the decision boundary of the support vector machine. Figure3 (a) illustrates this selection process. Other pool-based works include houlsby2012collaborative which proposes a Bayesian active learning by disagreement algorithm in the context of learning user preferences, guillory2010interactive ; golovin2010adaptive which study the submodularity nature of sequential active learning schemes.
Mathematically, let be the pool of unlabeled instances, and
be the separating hyperplane.is the feature map induced by the SVM kernel. The SVM algorithm in Tong1998 chooses a new instance to query by minimizing the distance (or its proxy) to the hyperplane
This formulation can be justified by the version space theory in separable cases Tong1998 or by other analyses in non-separable cases, e.g., Campbell2000 ; Bordes2005 . This simple and effective method is widely applied in many studies, e.g., Goh2004 ; Warmuth2002 .
In the query synthesis scenario, an instance is synthesized instead of being selected from an existing pool. Previous methods tend to work in simple low-dimensional domains Angluin2001 but fail in more complicated domains such as images Lang1992 . Our approach aims to tackle this challenge.
Generative adversarial networks (GAN) is a novel generative model invented by goodfellow2014generative . It can be viewed as the following two-player minimax game between the generator and the discriminator ,
where is the underlying distribution of the real data andand each has its own set of parameter and . By solving this game, a generator is obtained. In the ideal scenario, given random input , we have . However, finding this Nash Equilibrium is a difficult problem in practice. There is no theoretical guarantee for finding the Nash Equilibrium due to the non-convexity of and . A gradient descent type algorithm is typically used for solving this optimization problem.
proposed infoGAN which learns disentangled representations using unsupervised learning.
A few updated GAN models have been proposed. Salimans2016 proposed a few improved techniques for training GAN. Another potentially important improvement of GAN, Wasserstein GAN, has been proposed by Arjovsky2017 ; gulrajani2017improved . The authors proposed an alternative to training GAN which can avoid instabilities such as mode collapse with theoretical analysis. They also proposed a metric to evaluate the quality of the generation which may be useful for future GAN studies. Possible applications of Wasserstein GAN to our active learning framework are left for future work.
The invention of GAN triggered various novel applications. Yeh2016
performed image inpainting task using GAN.Zhu2016 proposed iGAN to turn sketches into realistic images. Ledig2016
applied GAN to single image super-resolution.zhu2017unpaired
proposed CycleGAN for image-to-image translation using only unpaired training data.
Our study is the first GAN application to active learning.
For a comprehensive review of GAN, readers are referred to Goodfellow-et-al-2016 .
In this section, we introduce our active learning approach which we call Generative Adversarial Active Learning (GAAL). It combines query synthesis with the uncertainty sampling principle.
The intuition of our approach is to generate instances which the current learner is uncertain about, i.e. applying the uncertainty sampling principle. One particular choice for the loss function is based on uncertainty sampling principle explained in section3.1. In the setting of a classifier with the decision function , the (proxy) distance to the decision boundary is . Similar to the intuition of (1), given a trained generator function , we formulate the active learning synthesis as the following optimization problem
where is the latent variable and is obtained by the GAN algorithm. Intuitively, minimizing this loss will push the generated samples toward the decision boundary. Figure 3 (b) illustrates this idea. Compared with the pool-base active learning in Figure 3 (a), our hope is that it may be able to generate more informative instances than those available in the existing pool.
The solution(s) to this optimization problem, , after being labeled, will be used as new training data for the next iteration. We outline our procedure in Algorithm 1.
It is possible to use a state-of-the-art classifier, such as convolutional neural networks. To do this, we can replace the feature map in Equation 3 with a feed-forward function of a convolutional neural network. In that case, the linear SVM will become the output layer of the network. In step 4 of Algorithm 1, one may also use a different active learning criterion. We emphasis that our contribution is the general framework instead of a specific criterion.
In training GAN, we follow the procedure detailed in Radford2015 . Optimization problem (3) is non-convex with possibly many local minima. One typically aims at finding good local minima rather than the global minimum. We use a gradient descent algorithm with momentum to solve this problem. We also periodically restart the gradient descent to find other solutions. The gradient of and is calculated using back-propagation.
Alternatively, we can incorporate diversity into our active learning principle. Some active learning approaches rely on maximizing diversity measures, such as the Shannon Entropy. In our case, we can include in the objective function (3) a diversity measure such as proposed in Yang2015 ; Hoi2009 , thus increasing the diversity of samples. The evaluation of this alternative approach is left for future work.
We perform active learning experiments using the proposed approach. We also compare our approach to self-taught learning, a type of transfer learning method, in the supplementary document. The GAN implementation used in our experiment is a modification of a publicly available TensoFlow DCGAN implementation333https://github.com/carpedm20/DCGAN-tensorflow. The network architecture of DCGAN is described in Radford2015 .
In our experiments, we focus on binary image classification. Although this can be generalized to multiple classes using one-vs-one or one-vs-all scheme Joshi2009 . Recent advancements in GAN study show it could potentially model language as well gulrajani2017improved . Although those results are preliminary at the current stage. We use a linear SVM as our classifier of choice (with parameter ). Even though classifiers with much higher accuracy (e.g., convolutional neural networks) can be used, our purpose is not to achieve absolute high accuracy but to study the relative performance between different active learning schemes.
The following schemes are implemented and compared in our experiments.
The proposed generative adversarial active learning (GAAL) algorithm as in Algorithm 1.
Using regular GAN to generate training data. We refer to this as simple GAN.
SVM algorithm from Tong1998 .
Passive random sampling, which randomly samples instances from the unlabeled pool.
Passive supervised learning, i.e., using all the samples in the pool to train the classifier.
Self-taught learning from Raina2007 .
We initialize the training set with 50 randomly selected samples. The algorithms proceed with a batch of 10 queries every time.
We use two datasets for training, the MNIST and CIFAR-10. The MNIST dataset is a well-known image classification dataset with 60000 training samples. The training set and the test set follow the same distribution. We perform the binary classification experiment distinguishing 5 and 7 which is reminiscent to Lang1992 . The training set of CIFAR-10 dataset consists of 50000 color images from 10 categories. One might speculate the possibility of distinguishing cats and dogs by training on cat-like dogs or dog-like cats. In practice, our human labelers failed to confidently identify most of the generated cat and dog images. Figure 4 (Top) shows generated samples. The authors of Salimans2016 reported attempts to generate high-resolution animal pictures, but with the wrong anatomy. We leave this task for future studies, possibly with improved techniques such as Arjovsky2017 ; gulrajani2017improved . For this reason, we perform binary classification on the automobile and horse categories. It is relatively easy for human labelers to identity car and horse body shapes. Typical generated samples, which are presented to the human labelers, are shown in Figure 4.
We use all the images of 5 and 7 from the MNIST training set as our unlabeled pool to train the generator . Different from traditional active learning, we do not select new samples from the pool after initialization. Instead, we apply Algorithm 1 to generate a training query. For the generator and , we follow the same network architecture of Radford2015 . We use linear SVM as our classifier although other classifiers can be used, e.g. Tong1998 ; Schein2007 ; Settles2010 .
We first test the trained classifier on a test set that follows a distribution different from the training set. One purpose is to demonstrate the adaptive capability of the GAAL algorithm. In addition, because the MNIST test set and training set follow the same distribution, pool-based active learning methods have an natural advantage over active learning by synthesis since they use real images drawn from the exact same distribution as the test set. It is thus reasonable to test on sets that follow different, albeit similar, distributions. To this end, we use the USPS dataset from LeCun1989 as the test set with standard preprocessing. In reality, such settings are very common, e.g., training autonomous drivers on simulated datasets and testing on real vehicles; training on handwriting characters and recognizing writings in different styles, etc. This test setting is related to transfer learning, where the distribution of the training domain is different from that of the target domain . Figure 5 (Top) shows the results of our first experiment.
When using the full training set, with 11000 training images, the fully supervised accuracy is at . The accuracy of the random sampling scheme steadily approaches that level. On the other hand, GAAL is able to achieve accuracies better than that of the fully supervised scheme. With 350 training samples, its accuracy improves over supervised learning and even SVM, an aggressive active learner dasgupta2005analysis ; Tong1998 . Obviously, the accuracy of both SVM and random sampling will eventually converge to the fully supervised learning accuracy. Note that for the SVM algorithm, an exhaustive scan through the training pool is not always practical. In such cases, the common practice is to restrict the selection pool to a small random subset of the original data.
For completeness, we also perform the experiments in the settings where the training and test set follow the same distribution. Figure 5 (Bottom) shows these results. Somewhat surprisingly, in Figure 5
(Left), GAAL’s classification accuracy starts to drop after about 100 samples. One possible explanation is that GAAL may be generating points close to the boundary that are also close to each other. This is more likely to happen if the boundary does not change much from one active learning cycle to the next. This probably happens because the test and train sets are the identically distributed and simple, like MNIST. Therefore, after a while, the training set may be filled with many similar points, biasing the classifier and hurting accuracy. In contrast, because of the finite and discrete nature of pools in the given datasets, a pool-based approach, such as SVM, most likely explores points near the boundary that are substantially different. It is also forced to explore further points once these close-by points have already been selected. In a sense, the strength of GAAL might in fact be hurting its classification accuracy. We believe this effect is not so pronounced when the test and train sets are different because the boundary changes more significantly from one cycle to the next, which in turn induces some diversity in the generated samples.
To reach competitive accuracy when the training and test set follow the same distribution, we might incorporate a diversity term into our objective function in GAAL. We will address this in future work.
In the CIFAR-10 dataset, our human labeler noticed higher chances of bad generated samples, e.g., instances fail to represent either of the categories. This may be because of the significantly higher dimensions than the MNIST dataset. In such cases, we asked the labelers to only label the samples they can distinguish. We speculate recent improvements on GAN, e.g., Salimans2016 ; Arjovsky2017 ; gulrajani2017improved , may help mitigate this issue given the cause is the instability of GANs. Addressing this limitation will be left to future studies.
The proposed Algorithm 1 can be understood as an exploitation method, i.e., it focuses on generating the most informative training data based on the current decision boundary On the other hand, it is often desirable for the algorithm to explore the new areas of the data. To achieve this, we modify Algorithm 1 by simply executing random sampling every once in a while. This is a common practice in active learning baram2004online ; roder2012active . We use the same experiment setup as in the previous section. Figure 6 shows the results of this mixed scheme.
A mixed scheme is able to achieve better performance than either using GAAL or random sampling alone. Therefore, it implies that GAAL, as an exploitation scheme, performs even better in combination with an exploration scheme. A detailed analysis such mixed schemes will be an interesting future topic.
In this work, we proposed a new active learning approach, GAAL, that employs the generative adversarial networks. One possible explanation for GAAL not outperforming the pool-based approaches in some settings is that, in traditional pool-based learning, the algorithm will eventually exhaust all the points near the decision boundary thus start exploring further points. However, this is the not the case in GAAL as it can always synthesize points near the boundary. This may in turn cause the generation of similar samples, thus reducing the effectiveness. We suspect incorporating a diversity measure into the GAAL framework as discussed at the end of Section 4 might mitigate this issue. This issue is related to the exploitation and exploration trade-off which we explored in brief.
The results of this work are enough to inspire future studies of deep generative models in active learning. However, much work remains in establishing theoretical analysis and reaching better performance. We also suspect that GAAL can be modified to generate adversarial examples such as in goodfellow2014explaining . The comparison of GAAL with transfer learning (see the supplementary document) is particularly interesting and worth further investigation. We also plan to investigate the possibility of using Wasserstein GAN in our framework.
Multimodal concept-dependent active learning for image retrieval.In Proc. 12th Annu. ACM Int. Conf. Multimed. - Multimed. ’04, page 564, New York, New York, USA, 2004. ACM Press.
IEEE Conf. Comput. Vis. Pattern Recognit., pages 2372–2379, 2009.
Active learning for logistic regression: An evaluation, volume 68. 2007.
One common strength of GAAL and self-taught learning  is that both utilize the unlabeled data to help with the classification task. As we have seen in the MNIST experiment, our GAAL algorithm seems to be able to adapt to the learner. The results in this experiment are preliminary and not meant to be taken as comprehensive evaluations.
, we use a Reconstruction Independent Component Analysis (RICA) model with a convolutional layer and a pooling layer. RICA is similar to a sparse autoencoder. Following standard self-taught learning procedures, We first train on the unlabeled pool dataset. Then we use trained RICA as the a feature extractor to obtain higher level features from randomly selected MNIST images. We then concatenate the features with the original image data to train the classifier. Finally, we test the trained classifier on the USPS dataset. We test the training size of, , , and . The reason of doing so is that deep learning type techniques are known to thrive in the abundance of training data. They may perform relatively poorly with limited amount of training data, as in the active learning scenarios. We run the experiments for 100 times and average the results. We use the same setting for the GAAL algorithm as in Section 5.1. The classifier we use is a linear SVM. Table 1 shows the classification accuracies of GAAL, self-taught learning and baseline supervised learning on raw image data.
|Algoirthm||Training set size||accuracy|
Using GAAL on the raw features achieves a higher accuracy than that of the self-taught learning with the same training size of . In fact, self-taught learning performs worse than the regular supervised learning when labeled data is scarce. This is possible for an autoencoder type algorithm. However, when we increase the training size, the self-taught learning starts to perform better. With 5000 training samples, self-taught learning outperforms GAAL with 250 training samples.
Based on these results, we suspect that GAAL also has the potential to be used as a self-taught algorithm444At this stage, self-taught learning has the advantage that it can utilize any unlabeled training data, i.e., not necessarily from the categories of interest. GAAL does not have this feature yet.
. In practice, the GAAL algorithm can also be applied on top of the features extracted by a self-taught algorithm. A comprehensive comparison with a more advanced self-taught learning method with deeper architecture is beyond the scope of this work.