Discriminative Active Learning

07/15/2019 ∙ by Daniel Gissin, et al. ∙ 6

We propose a new batch mode active learning algorithm designed for neural networks and large query batch sizes. The method, Discriminative Active Learning (DAL), poses active learning as a binary classification task, attempting to choose examples to label in such a way as to make the labeled set and the unlabeled pool indistinguishable. Experimenting on image classification tasks, we empirically show our method to be on par with state of the art methods in medium and large query batch sizes, while being simple to implement and also extend to other domains besides classification tasks. Our experiments also show that none of the state of the art methods of today are clearly better than uncertainty sampling when the batch size is relatively large, negating some of the reported results in the recent literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

DiscriminativeActiveLearning

Code and website for DAL (Discriminative Active Learning) - a new active learning algorithm for neural networks in the batch setting. For the blog:


view repo

DiscriminativeActiveLearning

Code and website for DAL (Discriminative Active Learning) - a new active learning algorithm for neural networks in the batch setting.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few years deep neural networks achieved groundbreaking results in many machine learning tasks, ranging from computer vision to speech recognition and NLP. While these successes can be attributed to the increase in computational power and to the expressiveness of neural networks, they are also a clear result of the use of massive datasets which contain millions of samples. The collection and labeling of such datasets is highly time consuming and often expensive, and much research has been done to try and reduce the cost of this process.

Active learning is one field of research which tries to address the difficulties of data labeling. It assumes the collection of data is relatively easy but the labeling process is costly, which is the case in many language, vision and speech recognition tasks. It tackles the question of which samples, if labeled, will result in the highest improvement in test accuracy under a fixed labeling budget. The most popular active learning setting is pool-based sampling (Settles, 2012), where we assume that we have a small labeled dataset and access to a large unlabeled dataset from which we choose the samples to be labeled next. We then start an iterative process where in every step the active learning algorithm uses information in and to choose the best to be labeled. is then labeled by an oracle and added to , and the process continues until we have reached the desired amount of samples or test accuracy.

Many heuristic methods have been developed over the years

(Settles, 2012), mostly for classification tasks, that have been shown to reduce the sample size by up to an order of magnitude. These methods fit a model to the labeled dataset and then choose a single sample to be labeled according to the performance of the model on the samples in

. While these methods are very successful, they have a few shortcomings that make them less suited for large scale data collection for deep learning. First, it is highly impractical to query a single sample from

in every iteration, since we need a human in the loop for every one of those iterations. Adding to that the time needed to fit a neural network, collecting a large dataset using this setting simply takes too long. Second, many of these methods are designed solely for classification tasks, and extending them to other domains is often non trivial.

To tackle the first problem, the batch-mode active learning setting was introduced, allowing for the querying of a batch of examples at every iteration. This change adds complexity to the problem since it might not be enough to greedily choose the top- samples from according to some score, since they may be very similar to each other. Choosing a batch of diverse examples is often better than choosing one containing very similar examples, which may lead to performance that is worse than random sampling. Several methods were developed with this problem in mind (Chen & Krause, 2013), (Guo & Schuurmans, 2008), (Azimi et al., 2012), (Wei et al., 2015). The second problem however has not been explicitly addressed to the best of our knowledge, and active learning solutions to other domains have had to be adapted separately (Settles, 2008), (Vezhnevets et al., 2012).

To address the second problem and make the method agnostic to the specific task being performed, we do not assume access to a classifier trained on the labeled set when choosing the batch to query and instead only assume access to a learned representation of the data,

. We pose the active learning problem as a simple binary classification problem and choose the batch to query according to a solution to said problem. We compare our method empirically to today’s state-of-the-art active learning methods for image classification, showing competitive results while being simple to implement and conceptually simple to transfer to new tasks.

2 Related work

There has been extensive work on active learning prior to the advances in deep learning of the recent years. This work is summarized in (Settles, 2012)

which covers the many different heuristics developed for active learning, along with adaptation made for other tasks besides classification. A large empirical study of different heuristics applied to classification using the logistic regression model can be found in

(Yang & Loog, 2018). This study shows uncertainty-based methods (Lewis & Gale, 1994) perform well on a large and diverse set of datasets while being very simple to implement. These methods score the unlabeled examples according to the classifier’s uncertainty about their class, and choose the example with highest uncertainty.

While neural networks for classification tasks output a probability distribution over labels, they have been shown to be over-confident and their softmax scores are unreliable measures of confidence. Recent work

(Gal et al., 2017)

shows an equivalence between dropout, a widely used neural network regularization method, and approximate Bayesian inference. This allows for the use of dropout at test time in order to achieve better confidence estimations of the network’s outputs. Our empirical analysis shows this method performs very similarly to uncertainty sampling using the regular softmax scores.

In addition to uncertainty-based approaches to active learning, another approach that has been actively studied in the past is the margin-based approach, which has a similar flavor. Instead of scoring an example based on the output probabilities of the model, these methods choose the examples which are closest to the decision boundary of the model. While in linear models like logistic regression for binary classification problems these two approaches are equivalent, this isn’t the case for neural networks that have an intractable decision boundary. (Ducoffe & Precioso, 2018) approximates the distance to the decision boundary using the distance to the nearest adversarial example (Szegedy et al., 2013). Our empirical analysis shows that while this method is conceptually very different from uncertainty sampling, it ranks the unlabeled examples in a very similar way.

Another active learning approach is based on expected model change, choosing examples which will cause the biggest effect to our model when labeled. (Settles et al., 2008) and (Huang et al., 2016) use this approach by calculating the expected gradient length (EGL) of the loss with respect to the model parameters, when the expectation is over the posterior distribution over the labels induced by the trained classifier. While their results show significant improvements over random sampling for text and speech recognition tasks, our empirical analysis along with the analysis of (Ducoffe & Precioso, 2018) show that this method is not suited for image classification using CNNs.

Finally, (Sener & Savarese, 2018) and (Geifman & El-Yaniv, 2017) address the challenge posed by the batch-mode setting by formulating the active learning task as a core set selection problem, attempting to choose a batch which -covers the unlabeled set as closely as possible and obtains a diverse query in the process. This is done robustly by formulating the

-cover as an integer program which allows a small subset of the examples (outliers) not to be covered. The metric space over which the

-cover is optimized is the learned representation of the data. While not discussed in their paper, the core set approach is similar to our method in that it only requires access to a learned representation of the data and not a trained classifier, which allows their method to easily extend to other tasks as well. However, their robust formulation is not as straightforward to implement as our method and does not scale as easily to large unlabeled datasets.

3 DAL - Discriminative Active Learning

We motivate our proposed method with a simple idea - when building a labeled dataset out of a large pool of unlabeled examples, we would like our dataset to represent the true distribution of the data as well as possible. In high dimensional and multi-modal data, this would ensure our labeled set has samples from all of the different modalities of the distribution. A naive solution would be to model the distribution of the unlabeled pool using some density-based or parameterized modeling method, and use the learned distribution to generate a sub-sample for labeling. However, these models are either hard to train or break down in high dimensional problems, so a different solution is needed.

Assuming the unlabeled pool is large enough to represent the true distribution, we can instead ask for every example how certain we are that it came from the unlabeled pool as opposed to our current labeled set. If the examples from the labeled set are indistinguishable from the unlabeled pool, then we have successfully represented the distribution with our labeled set. Another intuitive motivation is that if we can say with high probability that an unlabeled example came from the unlabeled set, then it is different from our current labeled examples and labeling it should be informative.

This proposition allows us to pose the active learning problem as a binary classification problem between the unlabeled class, , and the labeled class, . Given a simple binary classification task, we can now leverage powerful modern classification models to solve our problem.

Formally, with being a mapping from the original input space to some learned representation (e.g. the representation learned to solve the original task using ), we define a binary classification problem with as our input space and as our label space, where is the label for a sample being in the labeled set and is the label for the unlabeled set. For every iteration of the active learning process, we solve the classification problem over by minimize the log loss and obtaining a model . We then select the top- samples that satisfy .

It would seem at first that our method ignores important available information - the labels of . While it is true that the discriminative part of our method ignores the labels, the representation is learned using these labels (and possibly through semi-supervised methods). This learned representation is important, as using the original input space significantly hurts the performance of DAL in practice.

3.1 Relation to Domain Adaptation

We next show a connection to domain adaptation in the binary classification setting. This connection gives some theoretical motivation to our method.

Domain adaptation deals with the problem of bounding the error of a classifier trained on a source distribution but then tested on a different target distribution . Naturally, we are interested in how different the two distributions are. Following (Kifer et al., 2004), (Ben-David et al., 2010) suggests the -divergence as a relevant distance:

Definition 3.1

Given a domain and two distributions, and , over , and given a hypothesis class over , the -divergence between and is

Furthermore, (Ben-David et al., 2010) relates the domain adaptation error to , showing that the test error over of a classifier which was trained on is bounded by a term that depends on .

The similarity to our case is straightforward: we train a predictor based on the labeled examples, and expect it to also work well on the unlabeled examples. To do so, we would like the distributions over the labeled and unlabeled examples to be as similar as possible.

Formally, we will construct a source and target distributions as follows: the “target” distribution will be the distribution which is focused evenly on the examples in the unlabeled set, namely, . Similarly, the “source” distribution will be . Observe that the divergence becomes

Solving this problem is equivalent to finding a classifier in that discriminates between and . So, if we want to minimize the difference in performance between source and target distributions, we should minimize . DAL can be viewed as a proxy to minimizing in a greedy manner — at each iteration, we first (approximately) find the that achieves the supremum in the definition of and then we move an example from to so as to decrease .

3.1.1 Connection to domain adaptation by backpropagation

It is interesting to compare our method to the unsupervised domain adaptation method introduced in (Ganin & Lempitsky, 2014). Their method aims to learn a neural network that will generalize well to a target domain by demanding the representation both minimize the training loss on the source domain and maximize the binary classification loss of a classifier that distinguishes between the source and target representations. Here, the classification and domain adaptation objectives are trained jointly, while in DAL they are decoupled - the representation is learned by optimizing the classification objective and the sample to query is then chosen using the domain adaptation objective given the learned representation.

It would be interesting to use this domain adaptation framework in an active learning setting, as using the unlabeled pool in addition to the labeled set to learn the representation could be beneficial. This can be seen as an extension of DAL, which we leave as future work.

3.2 Relation to GANs

Our method is reminiscent of generative adversarial networks (Goodfellow et al., 2014) in that it attempts to fool a discriminator which tries to distinguish between data coming from two different distributions. However, while the distribution created by the generator in GANs is differentiable and changes according to the gradient of the discriminator in an attempt to maximize the discriminator loss, the active learning setting does not allow us to change the distribution of in a differentiable way. Rather, we may only change by transferring examples from , which is a discrete and non differentiable decision.

In that sense, DAL’s choice of example being the one which the discriminator is most confident came from can be seen as a greedy maximization of the loss of the discriminator, extending the GAN objective to a non differentiable setting. Instead of modeling the distribution of with a neural network (the generator), we model the distribution using which is a subset of . Since we train the discriminator until convergence before choosing the examples to label, it gets to observe all of the examples of and and more easily detect modes which exist only in . This aids in avoiding the problem of mode collapse often exhibited by GANs.

It should be noted that there have been direct attempts to use GANs for active learning (Zhu & Bento, 2017) in the membership synthesis active learning setting (Settles, 2012), but the results so far have been preliminary and do not improve upon random sampling when evaluating the method on real datasets. Since so far GANs have been hard to train and unstable when working on real data, our method allows a simpler way of tackling the objective of GANs for active learning.

3.3 Difference from the core set approach

It should be noted that similarly to the core set approach, our approach eventually covers the unlabeled data manifold with labeled examples, while not explicitly optimizing an -cover objective. This could suggest a similarity between the two methods, but they differ on data manifolds which aren’t uniform in their density.

The core set approach attempts to cover all of the points in the manifold without considering density, which leads it to over represent examples which come from sparse regions of the manifold. On the other hand, assuming DAL is able to represent correctly, it will choose examples such that the ratio of labeled to unlabeled examples remains the same over the entire manifold. This means that will not be biased towards sparse regions of the manifold, and will sample from in proportion to the density. Experiments comparing the two methods on synthetic data confirm this intuition. However, this does not guarantee our method should perform better than the core set approach, since a bias towards sparse regions of the manifold may be advantageous in certain situations.

4 Practical considerations

We move to detailing some of the practical considerations that arise when implementing DAL.

4.1 Trading off speed for query diversity

As defined above, our method is greedy and chooses the top- examples to be labeled according to our score function, without consideration for the diversity of the batch. Observe however that if we keep fixed, we do not need a human to label the chosen examples in order to move an example from to and continue optimizing. The representation of the chosen example stays the same and all we need to do is mark it as labeled. If the sample size we want to query is , we can split the query into mini-queries by first choosing only the top- examples to be labeled using our method, moving those examples to without actually labeling them yet, training a new classifier on the new and , and repeating the process. This ensures the eventual chosen query is diverse, since consecutive mini queries will be less likely to contain similar instances. , the amount of times we split the query and repeat this process, gives us a parameter with which to trade-off between the running time of our algorithm and the amount of awareness it has for the diversity of its query. The final algorithm is detailed in 1. In practice, we see an improvement in performance when using these kind of mini queries, but after a certain number of mini queries the improvement levels off. For the datasets used in these experiments, we found 10-20 mini queries to perform well.

  Input: { is the total budget, is the amount of mini-queries}
  for i=1…n do
     
     for j=1… do
        
        
        
     end for
  end for
  return
Algorithm 1 Discriminative Active Learning

4.2 Choosing the architecture for the binary classification

Since the learned representation isn’t structured in a way that should give a convolutional architecture any advantage, we chose to use a multi-layered perceptron as the architecture for solving the binary classification task. The chosen architecture was arbitrary, having three hidden layers with a width of 256 and ReLU activation functions.

4.3 Training the binary classifier

Note that we are not solving the binary classification problem with generalization in mind, but rather we are trying to fit the training set as well as possible and detect the examples which are harder to fit. A large neural network is often expressive enough to achieve a loss of zero on this kind of binary training set, but it would find it harder to fit samples which are nearby and of different labels. This means that using a small neural network or using early stopping can ensure that the samples which don’t have a high probability of coming from will have a lower . In practice, we found that for an unlabeled pool 10 times the size of the labeled set, stopping the training process when the training accuracy reached  98% performed well, but changing this value slightly didn’t have a significant effect on performance.

5 Empirical evaluation

5.1 Datasets and models

We evaluated our method on two datasets, MNIST (LeCun, 1998) and CIFAR-10 (Krizhevsky & Hinton, 2009). We used the same model to evaluate the labeled dataset of every method. For the MNIST task our classification model was a simple LeNet architecture (LeCun et al., 1998), while for the CIFAR-10 task we used VGG-16 (Simonyan & Zisserman, 2014) as our architecture.

5.2 Experimental setup

There isn’t a standard methodology for evaluating active learning algorithms with neural networks today. In reviewing the literature, an important factor which changes between algorithm evaluations is the query batch size, ranging from dozens of examples all the way up to thousands. Since the query batch size is expected to have a significant effect on the performance of the different methods, we evaluate the methods with query batch sizes of different orders of magnitude.

A single experiment for an algorithm entailed starting with an initial random batch which is identical for all algorithms. We then select a batch to query using the algorithm, add it to the labeled dataset and repeat this process for a predetermined number of times. In every intermediate step we evaluate the current labeled dataset by training the model on the dataset for many epochs and saving the weights of the epoch with the best validation accuracy, where a random 20% of the labeled set is reserved as the validation set. The test accuracy of the best model is then recorded. To reduce the variance from the stochastic training process of neural networks, we average our results over 20 experiments for every algorithm. We optimize the model using the Adam optimizer

(Kingma & Ba, 2014) with the default learning rate of 0.001.

We have released the complete code used for these experiments, including the implementations of DAL and the other baseline methods (Gissin, ).

5.3 Evaluated algorithms

We compare our method to the following baselines:

  • Random: the query batch is chosen uniformly at random out of the unlabeled set.

  • Uncertainty: the batch is chosen according to the uncertainty sampling decision rule. We report the best performance out of the max-entropy and variation-ratio acquisition functions.

  • DBAL (Deep Bayesian Active Learning): according to (Gal et al., 2017), we improve the uncertainty measures of our neural network with Monte Carlo dropout and report the best performance out of both the max-entropy and variation-ratio acquisition functions.

  • DFAL (DeepFool Active Learning): according to (Ducoffe & Precioso, 2018), we choose the examples with the closest adversarial example. We used cleverhans’ (Papernot et al., 2018) implementation of the DeepFool attack (Moosavi-Dezfooli et al., 2016).

  • EGL (Expected Gradient Length): according to (Huang et al., 2016), we choose the examples with the largest expected gradient norm, with the expectation over the posterior distribution of labels for the example according to the trained model.

  • Core-Set: we choose the examples that best cover the dataset in the learned representation space. To approximate the -cover objective, we use the greedy farthest-first traversal algorithm and use it to initialize the IP formulation detailed in (Sener & Savarese, 2018). We use Gurobi (Gurobi Optimization, 2018) to iteratively solve the integer program.

  • DAL (Discriminative Active Learning): our method, detailed above.

(a)
(b)
(c)
Figure 1:

Accuracy plots of the different algorithms for the MNIST active learning experiments. The results are averaged over 20 experiments and the error bars are empirical standard deviations. (a) Query batch size of 10. (b) Query batch size of 100. (c) Query batch size of 1000.

5.4 Results

In figure 1 and 2 we compare the different models on the MNIST and CIFAR-10 datasets, over a large range of query batch sizes.

A clear initial observation from the experiments is that, aside from Core-Set on very small batch sizes and EGL in general, all algorithms consistently outperform random sampling. For Uncertainty Sampling, this comes in contrast to past results in (Ducoffe & Precioso, 2018) and (Sener & Savarese, 2018), which show uncertainty-based methods performing on par or worse than random sampling. Those results were attributed to the batch-mode setting, causing the examples sampled by uncertainty sampling to be highly correlated. However, our experiments show consistently that this isn’t the case even for very large batch sizes. For (Ducoffe & Precioso, 2018), after reviewing the published code we believe this is due to an implementation error of the max-entropy acquisition function. As for (Sener & Savarese, 2018), we are unable to offer a good explanation for the discrepancy in our result since their implementation of uncertainty-based methods was not published. As for EGL, our result is consistent with (Ducoffe & Precioso, 2018) and an inspection of the queries made by EGL shows each query has a very high bias towards examples from a single class. A possible explanation for the discrepancy between these results and the strong results of EGL in (Huang et al., 2016) may be the architecture used for the different tasks. EGL was shown to perform well on text and speech tasks, which use recurrent architectures, while our implementation uses convolutional architectures. Since EGL uses the gradient with respect to the model parameters as the score function, it is quite plausible that the architecture used can have a big effect on the performance of the method.

The next observation to be made is that once the batch size is relatively large, the batch size most relevant to large-scale machine learning applications, all of the baseline methods are quite on par with each other, with results within the statistical margin of error. For small batch sizes, when it is 10 and 100, we see a clear distinction between the performance of the uncertainty based methods and the performance of DAL and Core-Set. The uncertainty based methods perform best, with a small advantage to DFAL in the early stages of the active learning process. Between DAL and Core-Set, DAL is the clear winner in small batch sizes, with Core-Set performing similarly to random sampling for very small batch sizes. However, once the batch size is in the thousands and we observe the results for CIFAR-10, a more realistic dataset with larger query batch sizes, we see all of the methods having essentially the same performance while clearly beating random sampling.

(a)
(b)
(c)
Figure 2: Accuracy plots of the different algorithms for the CIFAR-10 active learning experiments. The results are averaged over 20 experiments and the error bars are empirical standard deviations. (a) Query batch size of 1000. (b) Query batch size of 5000. (c) Query batch size of 10000.

5.5 Ranking comparison

Following (Huang et al., 2016), we compare the rankings given by the different algorithms to the MNIST unlabeled dataset after the first iteration of the experiment. For pairs of algorithms, we plot the rankings in a 2D ranking-vs-ranking coordinate system, such that a plot close to the diagonal implies the two algorithms rank the unlabeled dataset in a similar way.

Figure 3 compares Uncertainty, DBAL, DFAL and DAL. Core-Set is not compared since it does not output a ranking over the unlabeled set, but rather only outputs a query batch.

(a)
(b)
(c)
Figure 3: Comparing the ranking of the MNIST dataset according to the score of different algorithms. (a) Uncertainty vs DBAL. (b) Uncertainty vs DFAL. (c) Uncertainty vs DAL.

Unsurprisingly, we observe that Uncertainty and DBAL are similar since their acquisition function is the same and only the confidence measures slightly changed. On the other hand, while DFAL uses a very different method for ranking, we see that it ends up ranking the unlabeled set in a very similar way to DBAL and Uncertainty, suggesting that even in a very non-convex function like neural networks, a margin-based approach behaves similarly to an uncertainty-based one.

When comparing DAL to the other methods however, we see a big difference in the ranking which is completely uncorrelated with the rankings of the other methods. This suggests that DAL selects unlabeled examples in a completely different way than the baseline methods, due to the formulation of the active learning objective as a binary classification problem.

6 Conclusion

In this paper we propose a new active learning algorithm, DAL, which poses the active learning objective as a binary classification problem and attempts to make the labeled set indistinguishable from the unlabeled pool. We empirically show a strong similarity between the uncertainty and margin based methods in use today, and show our method to be completely different while obtaining results on par with the state of the art methods for image classification on medium and large query batch sizes. In addition, our method is simple to implement and can conceptually be extended to other domains, and so we see it as a welcome addition to the arsenal of methods in use today.

7 Acknowledgments

This research is supported by the European Research Council (TheoryDL project).

References