Effective Version Space Reduction for Convolutional Neural Networks

06/22/2020 ∙ by Jiayu Liu, et al. ∙ 0

In active learning, sampling bias could pose a serious inconsistency problem and hinder the algorithm from finding the optimal hypothesis. However, many methods for neural networks are hypothesis space agnostic and do not address this problem. We examine active learning with convolutional neural networks through the principled lens of version space reduction. We identify the connection between two approaches—prior mass reduction and diameter reduction—and propose a new diameter-based querying method—the minimum Gibbs-vote disagreement. By estimating version space diameter and bias, we illustrate how version space of neural networks evolves and examine the realizability assumption. With experiments on MNIST, Fashion-MNIST, SVHN and STL-10 datasets, we demonstrate that diameter reduction methods reduce the version space more effectively and perform better than prior mass reduction and other baselines, and that the Gibbs vote disagreement is on par with the best query method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Active learning is a supervised learning framework in which the learner is given access to a pool or stream of unlabeled samples and is allowed to selectively query labels from an oracle (e.g., a human annotator). In each query round, the learner queries the labels of some unlabeled samples and trains on the augmented labeled set to obtain new classifiers. The goal is to learn a good classifier or

hypothesis using as few labels as possible. This setting is relevant in many real-world problems, where labeled data are scarce or expensive to obtain, but unlabeled data are cheap and abundant.

Figure 1: Version space reduction for binary classification. Upon observing the label of , the current version space is split into subspaces and , one of which will be removed and the other remains. Left: Prior mass reduction methods remove approximately half of the mass. Middle: Diameter reduction methods, like pairwise disagreement, query a sample that lead to sub-spaces of small diameter. Right: Proposed method, the Gibbs-vote disagreement, measures diameter by the expected distance between random hypotheses and their majority vote.

Many active learning methods for neural networks rely on measures of the “informativeness” of a query, in the form of classifier uncertainty, margin [19, 11] or information gain [18, 13, 21]. Other methods capture the informativeness by representativeness of the query set using geometry-based [27] or discriminative [14] methods. However, most of these methods ignore the notion of the hypothesis space and do not address the problem of sampling bias [9], which plague many active learning methods. Without carefully handling this problem, an active learning algorithm is not guaranteed to be consistent, i.e., capable of finding the optimal classifier in the hypothesis space.

We consider the hypothesis space of convolutional neural networks (ConvNets) and study version space reduction methods. Version space reduction works by removing hypotheses that are inconsistent with the observed labels from a predefined hypothesis space and maintaining the consistent sub-space, the version space. A key condition called the realizability assumption is that the hypothesis space contains the classifier that provides the ground truth—if not, there are no guarantees that the best hypothesis will not be removed, because a hypothesis might make mistakes on the queried samples but perform well on the data distribution.

For neural networks, the realizability assumption may not hold for all cases. For instance, no neural networks can achieve arbitrarily small test error on some classification datasets. A workaround is to consider the effective labelings on a set of i.i.d. pool samples. To avoid the problem of an unreasonably large effective hypothesis space, as implied by the result of [33], we only consider the labelings achievable by training on unaltered samples and correct labels. We examine experimentally whether the realizability holds with this restriction and analyze its implications on version space reduction methods.

Prior mass reduction [7, 15, 5] and diameter reduction [8, 30] are two widely used version space reduction approaches. See Fig. 1 for illustration. However, prior mass reduction is not an appropriate objective for active learning [30] since any intermediate version spaces containing more than one hypothesis may still have a large diameter, i.e., large error rate in the worst-case scenario, despite having substantially reduced mass. We derive connections between prior mass and diameter reduction and introduce a new interpretation of diameter reduction as prior mass “reducibility reduction”.

We propose a new diameter measure called the Gibbs-vote disagreement, which equals the expected distance between the random hypotheses and their majority vote classifier. We show its relation to a common diameter measure, the pairwise disagreement, and discuss under which situations the former may be advantageous. We show experimentally on four image classification datasets that diameter reduction methods perform better than all baselines and that prior mass reduction [7, 15, 5] and other baselines like [18, 13, 27, 11] do not perform consistently better than random query and sometimes fail completely.

2 Related Work

A lot of research has been conducted to study the label complexity for active learning and optimality guarantees for greedy version space reduction. Hanneke [16] and Balcan et al. [1] prove upper-bounds on the label complexity in the realizable and non-realizable cases, using a parameter called the disagreement coefficient. Tosh and Dasgupta [30] propose a diameter-based active learning algorithm and characterize its sample complexity using a parameter called the splitting index. Dasgupta [7] shows that a greedy strategy maximizing the worst-case prior mass reduction is approximately as good as the optimal strategy. Golovin and Krause [15] show that the prior mass reduction utility function is adaptive submodular and a greedy algorithm is guaranteed to obtain near-optimal solutions in the average-case scenario. Cuong et al. [6] prove a worst-case optimality guarantee for pointwise submodular functions.

A variety of methods relying on the informativeness of a query have been proposed for neural networks. Gal et al. [13] use the Monte Carlo dropout to approximate the mutual information between predictions and model posterior [18] in a Bayesian setting. Kirsch et al. [21] extend [18, 13] to a batch query method. Ducoffe and Precioso [11] use adversarial attacks to generate samples close to the decision boundaries. Sener and Savarese [27] adopt a core-set approach to select representative samples for query. Gissin and Shalev-Shwartz [14] use a discriminative method to select samples such that the labeled and the unlabeled set are indistinguishable. Pinsler et al. [26] formulate batch query as a sparse approximation to the expected complete data posterior of model parameters in a Bayesian setting. Beluch et al. [3] show that ensemble methods consistently outperform geometry-based methods [27] and the Monte Carlo dropout method [12, 13].

3 Preliminaries

Let be the input feature space and the label space. Let be a hypothesis space of functions and assume a prior over . A hypothesis randomly drawn from the prior is called a Gibbs classifier. Denote a pool of i.i.d. samples from the data distribution and the set of queried labeled samples. Define the version space corresponding to as

(1)

Denote the subset of that is consistent with being labeled as as

(2)

The disagreement probability induced by the marginal distribution

is defined as

(3)

which is a pseudo-metric on the hypothesis space. The disagreement and agreement region are defined as

(4)
(5)

4 Prior Mass Reduction

4.1 Gibbs Error

The Gibbs error [5] of an unlabeled sample is the average-case relative prior mass reduction:

(6)

where is the conditional distribution of restricted to . Gibbs error measures the proportion of inconsistent hypotheses taking expectation over all possible labelings of , achievable by hypotheses in the version space. A greedy strategy that considers maximizing the average-case absolute prior mass reduction in each query can equivalently select the unlabeled sample that maximizes the Gibbs error

(7)

Define the prior mass reduction utility function as

(8)

The optimization problem in (7) can be written, up to a scaling factor, as

(9)
(10)

where the notation denotes the expected marginal gain of in terms of prior mass reduction given the labeled samples in .

A closely related objective for active learning is the label entropy given . It can be shown that the Gibbs error lower bounds the entropy. However, a greedy strategy that maximizes the entropy is not guaranteed to be near-optimal in the adaptive case [6]. Furthermore, empirically this criterion performs similarly or worse than the maximum Gibbs error. For the sake of simplicity, we do not consider this method in this paper.

4.2 Variation Ratio

The variation ratio of an unlabeled sample is the worst-case relative prior mass reduction upon the reveal of its label:

(11)

It measures the proportion of inconsistent hypotheses considering the worst-case labeling of and is a lower bound on the Gibbs error. A greedy strategy that considers maximizing the worst-case absolute prior mass reduction in each query selects the unlabeled sample that maximizes the variation ratio

(12)

which can be expressed in terms of the prior mass reduction utility function, up to a scaling factor, as

(13)
(14)

where the notation denotes the worst-case marginal gain of in terms of prior mass reduction given the labeled samples in .

5 Diameter Reduction

5.1 Worst-Case Pairwise Disagreement

The size of the version space can be measured by the expected pairwise disagreement between hypotheses drawn from the conditional distribution:

(15)

It is the average diameter of the version space. A greedy strategy selects the unlabeled sample that minimizes the worst-case pairwise disagreement

(16)

Other measures of diameter based on the supremum distance [20, 8] are not amenable to implementation because evaluation of such diameters involves optimization. The pairwise disagreement can be estimated from a finite set of sample hypotheses from the version space.

5.2 Worst-Case Gibbs-Vote Disagreement

We propose to use a new diameter measure called the Gibbs-vote disagreement. It is the expected disagreement between random hypotheses and their majority vote:

(17)

where is the majority vote classifier of hypotheses from . For each , it induces a prediction

(18)

where is the predicted probability of belonging to class given by a hypothesis . The majority vote classifier is the deterministic classifier that has the smallest expected distance to the Gibbs classifier [20, 10]:

(19)

Hence the Gibbs-vote disagreement measures the size of the version space by the expected distance of the random hypotheses to their “center”. Further, the following relation holds

(20)

We defer the proof to the appendix. Essentially, Equation (20) reveals that the Gibbs-vote disagreement is sandwiched between the average radius and diameter.

A greedy strategy selects the unlabeled sample that minimizes the worst-case Gibbs-vote disagreement

(21)

where is the majority vote of hypotheses from the subspace of the current version space if is labeled .

5.3 Diameter Reduction as Reducibility Reduction

Pairwise disagreement shares a simple relation with Gibbs error—it is the expected Gibbs error:

(22)
(23)
(24)

A similar relation holds between Gibbs-vote disagreement and the variation ratio:

(25)
(26)
(27)

where the last equality holds because the predictions of the majority vote classifier are always the worst-case labels for prior mass reduction. Diameter reduction selects samples such that, upon revealing their labels, the induced subspaces have minimum possibility to be further reduced by a potential random query. Thus, it can be thought of as reducing the expected prior mass “reducibility”.

Prior mass reduction finds splits in directions that evenly partition the version space, but could result in version spaces that have irregular shapes, in the sense that the space can be whittled down finely in some directions while being under-split in others. The worst-case error rate of the resulted version space could still be large. Diameter reduction correctly resolve this issue. Fig. 1 illustrates the differences between prior mass and diameter reduction.

5.4 Weighted Diameter Reduction

Tosh and Dasgupta [30] show that in general average diameter cannot be decreased at steady rate and propose to query the unlabeled samples that minimize the diameter weighted by the squared prior mass in the worst-case scenario

(28)
(29)

The potential to be minimized is a surrogate for the “amount” of edges between hypotheses and is closely related to the splittablity of version space [30, 8].

6 Realizability Assumption

Even though neural networks are capable of fitting an arbitrary pool set, we show experimentally that the version space obtained by training on a subset of the pool set with stochastic gradient descent—the “samplable” version space—is biased and not likely to contain the correct labeling of the pool set. Indeed, the distance from the

Bayes classifier, which provides the ground truth labeling, to the “boundary” of the version space is non-negligible.

Figure 2: Left: Projection of to the samplable version space. Right: Wrong agreement of version spaces trained on random samples. Total numbers of samples are 400, 1000, 3000 and 2580 for MNIST, Fashion-MNIST, SVHN and STL-10 respectively.

Let be the projection of the Bayes classifier to the set of hypotheses that agree with on (see the left plot of Fig. 2), i.e.,

(30)
(31)

It is easy to see that provides the ground truth on and predicts the same labels on as hypotheses in do, hence

(32)

where is the disagreement probability restricted to AGR(V), or equivalently the wrong agreement of hypotheses in .

We show the evolution of wrong agreement in the right plot of Fig. 2. As more random samples are queried, the wrong agreement decreases for all datasets, but for some much slower than the others. In Fig. 3, we show for MNIST a 2-D embedding of version spaces using Multi-Dimensional Scaling (MDS) [22]

, which finds a low-dimensional representation of potentially high-dimensional data by preserving pairwise distances between the data points. The Bayes classifier is not contained in any of the samplable version spaces although the distances between them decrease steadily.

Figure 3: Embedding of version spaces on MNIST using MDS. As more random samples are used for training, the samplable version spaces move closer to the Bayes classifier but hardly cover it.

In general neural networks trained with a random subset do not automatically predict all labels in the pool set correctly, unless a relatively large proportion of samples are used for training. However, this fact does not render version space reduction inconsistent, because the samplable version space is not fixed, but it shifts towards the correct labeling and finally covers it when the whole pool set has been used.

We conjecture that the dynamics of active learning with neural networks have two major components: (1) shrinkage of the samplable version space, which is explicitly optimized by the learning algorithm and (2) reduction of bias, which is not directly controllable. Empirical evidence is provided in the next section.

7 Evaluation

Datasets and Architectures We conduct active learning experiments111Source code is available at https://github.com/jiayu-liu/effective-version-space-reduction-for-convnets. on four image classification datasets: MNIST, Fashion-MNIST, SVHN and STL-10. Neural network architectures are chosen to be competent for each dataset but as simple as possible in the hope of controlling the model complexity and mitigating the effect of overfitting. See Table 1 for the complete experiment settings.

Active Learning Methods We compare nine querying methods: Random, variation ratio (VR), Gibbs error (GE), Bayesian Active Learning by Disagreement with Monte Carlo dropout (BALD-MCD) [18, 13], Core-Set [27], Deep-Fool Active Learning (DFAL) [11], pairwise disagreement (PWD), Gibbs-vote disagreement (GVD), and double-weighted pairwise disagreement (M-PWD) [30]. For each method on each dataset, at least three runs of active learning with different random balanced initial training set are performed.

Dataset Pool/Val/Test Model Ensemble Size Init/Query/Total Runs
MNIST 45000/5000/10000 2-conv-layer ConvNet 20 10/5/400 4
Fashion-MNIST 55000/5000/10000 3-conv-layer ConvNet 20 10/10/1000 4
SVHN 40000/5000/15000 6-conv-layer ConvNet 20 100/20/3000 4
STL-10 4000/1000/8000 ResNet18 20 100/40/2580 3
Table 1: Settings for each dataset used in the active learning experiments.
Figure 4: Accuracy over number of queried labels on the test set. Direct diameter reduction methods PWD and GVD are consistently better than Random and are among the best methods. Weighted diameter reduction M-PWD is on par with Random. Other baselines are effective on some datasets but inferior to Random on the others. Note that PWD, GVD and M

-PWD exhibit smaller variances than the others.

Ensemble Size We train networks multiple times from scratch to obtain sample hypotheses and use them for prior mass and diameter estimation. Since diameters are estimated by considering partitioned version spaces, the ensemble size should be at least in the order of number of classes. We set the size to 20. Larger ensemble improves estimation but at the cost of longer training time. In preliminary experiments, we tried larger ensembles (40) and did not observe significant differences. Hence we do not include experiments on changing this hyper-parameter in the paper.

Query Size We set a small query budget for each round to reduce the correlation between queries. Larger budget may alleviate the pressure of frequent retraining, but the effect of each query can not be estimated and examined reliably. We observed in preliminary experiments that using larger budget (one or two orders larger) hides the differences between methods.

Figure 5: Pairwise disagreement and wrong agreement over number of queried labels on the test set. Except direct diameter reduction methods PWD and GVD, other baselines are not consistently better than or on par with Random at reducing version space diameter. Performing worse than Random: GE, VR and BALD-MCD on datasets except MNIST, Core-Set on Fashion-MNIST and SVHN, and DFAL on MNIST and SVHN, and M-PWD on MNIST.
MNIST Fashion-MNIST SVHN STL-10
#labels 400 1000 3000 2580
Random 93.47 0.38 83.90 0.38 85.60 0.23 58.15 0.54
VR 96.74 0.15 83.05 1.09 63.23 1.99 59.13 0.21
GE 96.79 0.10 80.01 0.94 64.08 3.77 58.84 0.34
BALD-MCD 96.51 0.22 84.67 0.41 85.26 0.34 57.35 0.64
Core-Set 95.38 0.28 79.08 0.82 84.91 0.20 58.93 0.33
DFAL 92.88 1.19 85.38 0.60 86.34 0.33 58.81 0.37
PWD 96.92 0.12 85.92 0.10 86.41 0.12 59.45 0.11
GVD 97.02 0.06 86.01 0.15 86.44 0.20 59.33 0.37
M-PWD 93.24 0.09 84.33 0.03 85.42 0.16 57.81 0.20
Table 2: Accuracy on the test set in percentage.
MNIST Fashion-MNIST SVHN STL-10
#labels 400 1000 3000 2580
Random 2.86 0.18 7.55 0.26 13.13 0.29 32.88 0.43
VR 2.27 0.18 10.64 0.72 46.88 2.76 34.21 0.08
GE 2.30 0.04 11.38 1.52 44.87 4.25 34.25 0.09
BALD-MCD 2.39 0.15 8.11 0.51 16.58 0.42 33.55 0.44
Core-Set 2.91 0.18 10.79 1.34 14.66 0.47 33.13 0.64
DFAL 3.79 0.60 7.06 0.60 13.98 0.31 32.41 0.27
PWD 1.93 0.04 6.91 0.16 12.80 0.08 32.25 0.26
GVD 1.98 0.05 6.98 0.26 12.88 0.25 32.96 0.48
M-PWD 3.37 0.13 7.22 0.08 13.31 0.13 33.23 0.18
Table 3: Diameter (pairwise disagreement) on the test set in percentage.

7.1 Diameter Reduction is More Effective Than Prior Mass Reduction

Fig. 4 and Table 2 show that direct diameter reduction methods PWD and GVD are consistently better than Random and achieve higher accuracy than other baselines while weighted diameter reduction M2-PWD is on par with Random. Diameter reduction methods usually exhibit less variances because training on samples queried by PWD, GVD and M2-PWD yields version spaces with smaller diameters and less diverse sample hypotheses. Prior mass reduction is not always effective and even fails on SVHN. This failure is an example of prior mass reduction being incapable of reducing the diameter, and provides empirical evidence that it may not be an appropriate objective for active learning.

7.2 Comparison to Other Baselines

BALD-MCD, Core-Set and DFAL are not consistently better than Random although each of them achieves comparative test accuracy on certain dataset. Their inferiority to Random in terms of test accuracy usually correlates with higher diameter (See description in Fig. 5 and Table 3). BALD-MCD and DFAL are highly related to prior mass reduction methods in that BALD [18] seeks samples for which the model parameters under the posterior disagree the most about the prediction [18], and that DFAL, inspired by margin-based active learning [2], tries to locate the decision boundary with fewer labels which is essentially removing inconsistent hypotheses in the realizable case. However, none of them explicitly minimize the diameter, neither does Core-Set.

Note that for a fair comparison, we do not augment the training set by also adding the adversarial samples as the original DFAL paper [11] does. Samples with minimum adversarial perturbation are then verified reliably to be less effective than those lead to minimum diameter. The original Core-Set paper [27] uses a large query batch size (in the order of 1000). However, many baselines rely on greedy selection and do not perform any batch optimization. To reduce query correlation, we adopt as small batch size as possible. This allows reliable evaluation of the effectiveness of queried samples as in the online setting. We are therefore able to identify one major cause of inferiority to Random as failing to effectively reduce the version space diameter.

7.3 Evolution of Samplable Version Space and its Implications

As shown in Fig. 5 and 3, the samplable version space shifts closer to the correct labeling while reducing its diameter as more labels are queried. These two processes together result in smaller test error.

No Direct Control Over Reduction of Version Space Bias Interestingly, the Core-Set method, which queries representative samples from the pool set by solving a k-center problem in the feature space learned by neural networks, is incapable of achieving negligible wrong agreement on the learned version spaces. Indeed, it suffers larger version space bias than the direct diameter reduction methods. After all, random queries which are i.i.d. by assumption fail to achieve this goal as concluded in Section 6 and other attempts without augmenting the training data seem doomed.

Prior Mass Induced by Stochastic Gradient Descent May Not Be a Reliable Surrogate Measure The continued decline in wrong agreement indicates that the distribution over labelings changes over time. This fact of shifting density over samplable labelings renders the notion of prior mass problematic, hence all notions relying on prior mass may not be well-defined. A direct consequence is that an estimate of the worst-case version space reduction would be more reliable than the average-case one. For example, VR provides a more reliable estimate of version space reduction than GE does.

Inferiority of Weighted Diameter Reduction Method The estimation of weighted diameter involves estimating the prior mass. Hence, the inferiority of M-PWD to PWD and GVD can be attributed to the intrinsic difficulty of obtaining unbiased samplable version spaces and the resulted density shift. A supportive evidence can be seen by noting that on MNIST and Fashion-MNIST, where the wrong agreement is large (hence large density shift), the weighted variant performs worse, while on SVHN and STL-10, where the wrong agreement is small (hence small shift), the gap is less significant.

Figure 6: Distance from the Gibbs and the majority vote classifier to the projection of . On four datasets, the majority vote classifier has a smaller distance, hence smaller error rate. See description of Fig. 2 for total numbers of random samples.

7.4 Gibbs-Vote Disagreement

The Gibbs-vote disagreement is among the best methods on all datasets, except for the early learning stage on SVHN. Its effectiveness can be ascribed to an interesting phenomenon—majority voting reduces mistakes. Although it need not necessarily be the case, this phenomenon occurs in many situations and the boost to accuracy depends on the variance of errors of Gibbs classifiers [23]. We show empirically that the majority vote classifier indeed has smaller error rate than random hypotheses in the version space in Fig. 6. Hence, optimizing the Gibbs-vote disagreement not only reduces the diameter but also implicitly moves the consistent hypotheses closer to the correct labeling, which is useful when the samplable version spaces are biased and do not contain the Bayes classifier.

8 Conclusion

In this work, we studied version space reduction for convolutional neural networks. We revealed the differences and connections between prior mass and diameter reduction methods and proposed the Gibbs-vote disagreement as a new effective diameter-reduction method. With experiments on four datasets, we shed light into how version space reduction works in the deep active learning setting and demonstrated the superiority of diameter reduction over prior mass reduction methods and other baselines.

Acknowledgments

This work was supported by the BMBF project MLWin and the Munich Center for Machine Learning (MCML).

Appendix 0.A Estimators and Algorithm

0.a.1 Effective Hypothesis Space

Let be the set of unlabeled samples in the pool set and the set of queried unlabeled samples. The effective hypothesis space is the restriction of to , or equivalently all possible labelings of :

(33)

0.a.2 Estimators of Diameters

Let be a set of sample hypotheses from

(34)

Assuming

, an unbiased estimator of the pairwise disagreement can be constructed by computing average pairwise distances between hypotheses in

(35)

Similarly, an unbiased estimator of the Gibbs vote disagreement can be constructed by computing average distances between hypotheses in and the empirical majority vote

(36)

0.a.3 Algorithm

A (batch mode) greedy algorithm that selects the unlabeled samples which induce the minimum version space in terms of a given diameter measure in the worst-case scenario is shown in Algorithm 1. Other active learning methods (e.g., prior mass reduction methods) can also be described using this algorithm with line in the algorithm replaced by the corresponding objective functions. Note that in line and the version space is maintained explicitly while in practice we only need to sample from the new version space by training neural networks on the updated set of queried samples. Hence, the version space is always implicitly maintained.

1:  Input: query rounds, batch size of query, size of ensemble, pool of unlabeled samples, diam diamter measure, prior over hypotheses
2:  
3:  
4:  for  to  do
5:     
6:     for  to  do
7:        sample
8:        
9:     end for
10:     for  to  do
11:        
12:        query
13:        
14:     end for
15:     
16:  end for
17:  return  any
Algorithm 1 Worst-Case Diameter Reduction

Appendix 0.B Proof of Equation 20

Proof

where the last inequality is by triangular inequality. By definition, it holds that . Using the relations derived in Section 5.3

it holds that

Appendix 0.C Singly-Weighted Diameter Reduction

Besides the doubly-weighted diameter reduction method mentioned in Section 5.4, there are singly-weighted variants.

0.c.0.1 Weighted Pairwise Disagreement

(37)

which is equivalent to minimizing the potential expected marginal gain on the prior mass reduction utility function

(38)

0.c.0.2 Weighted Gibbs-Vote Disagreement

(39)

which is equivalent to minimizing the potential worst-case marginal gain

(40)

Appendix 0.D Additional Evaluation Results

0.d.1 Evaluation on the Pool Set

In addition to the evaluation results on the test set shown in the paper, we show the results on the pool set as well in Figure 7. On MNIST, Fashion-MNIST and SVHN, diameter reduction are more effective than prior mass reduction methods and other baselines in finding the true labeling on the pool set using as few labels as possible. On STL-10, prior mass reduction performs better at reducing the diameter. An explanation is that we use unlabeled samples in the validation set to estimate relative prior mass and diameters when selecting each query. So the diameter reduction methods are not explicitly optimizing the diameter measured on the pool set, but rather on an unbiased validation set.

Figure 7: Error rate, pairwise disagreement and wrong agreement over number of queried labels on the pool set.

0.d.2 Embedding of Version Spaces

To better illustrate the evolution of version spaces and the existence of bias in version spaces, we show a 2-D embedding of sample hypotheses during the active learning process for each dataset using Multi-Dimensional Scaling (MDS) [22] in Figure 8. MDS finds a low-dimensional representation of potentially high-dimensional data by preserving pairwise distances between the data points. We show sample hypotheses from the first (purple) and the last (red) version spaces as well as intermediate version spaces obtained by training on randomly queried labels (approximately) amount to 25%, 50% and 75% of the total budget. To achieve better visualization, we first compute the embedding of the five Gibbs (random) classifiers and the Bayes classifier and then compute the embedding of each version space separately and center them at the corresponding Gibbs classifier. We use the disagreement probability evaluated on the test set as the distance metric.

As more labels are queried, the version spaces move closer to the Bayes classifier while reducing their diameters. The bias in the version spaces is non-negligible for the four datasets. An active learning algorithm contributes to the shrinkage of the samplable version space but does not have direct control over the reduction of bias. How to efficiently reduce the bias remains an open problem for designing active learning algorithms for neural networks.

Figure 8: Embedding of version spaces using Multi-Dimensional Scaling (MDS). The 2-D embedding of version spaces obtained by training on random samples is calculated through MDS for each dataset. The purple dots in the largest clusters illustrate a set of sample hypotheses from the version space at the beginning of the active learning experiments, while the red dots in the smallest clusters illustrate hypotheses at the end, being closer to the Bayes classifier (star marker) than those from other version spaces. The blue, green and orange dots represent version spaces obtained by training with (approximately) 25%, 50% and 75% labels of the total budget, respectively. The Gibbs classifier (triangle marker) corresponding to each version space is a random classifier that predicts by randomly sampling a hypothesis from the version space and making the same prediction as the sample hypothesis does.

Appendix 0.E Datasets Selection

The four image classification datasets MNIST, Fashion-MNIST, SVHN and STL-10 are chosen based on several considerations: (1) relatively balanced label distribution; (2) there exist neural network models that can train fast on the original training set of the datasets; (3) no data augmentations are needed. Since active learning methods query highly biased samples, a balanced label distribution help mitigate the problem of query label imbalance. The second point helps reduce the time needed to run active learning experiments. The last point guarantees that the samples used for training are exactly those that have been queried.

Appendix 0.F Implementation Details

For the four datasets we consider, no data augmentation is used. Unless otherwise stated, the neural network models are trained using SGD with initial learning rate

and momentum 0.8. The learning rate decays by a factor of 0.1 when there are no improvements on the validation accuracy for any consecutive 10 training epochs until it is smaller than

. The maximum training epochs are 200. The batch size for training is set to 32.

0.f.1 Mnist [24]

We select a random balanced set of 50000 samples from the original 60000 training samples as the training/validation set and use the original 10000 test samples as the test set. The 2-conv-layer ConvNet is trained using RMSProp

[29] with learning rate . The learning rate decays by a factor of 0.5 when there are no improvements on the validation accuracy for any consecutive 10 training epochs until it is smaller than . Dropout [28] of rate 0.5 is applied to the output of the fully-connected layer which lies between the last convolution layer and the output layer. The batch size for training is set to 16.

0.f.2 Fashion-MNIST [31]

We used the original balanced 60000 training and 10000 test samples as the training/validation and test sets. A 3-layer-conv ConvNet is used as the classifier model. Dropout of rate 0.5 is applied to the output of the fully-connected layer which lies between the last convolution layer and the output layer.

0.f.3 Svhn [25]

We select a random balanced set of 45000 samples from the original 73257 training samples as the training/validation set and a random balanced 15000 samples from the original 26032 test samples as the test set. A 6-layer-conv ConvNet is used as the classifier model. Dropouts of rate 0.3 are applied to the output of every two convolution layers and that of the fully-connected layer which lies between the last convolution layer and the output layer.

0.f.4 Stl-10 [4]

We used the original balanced 5000 training and 8000 test samples as the training/validation and test sets. ResNet18 [17] is used as the classifier model. Dropout of rate 0.5 is applied between all convolution layers in each convolutional block [32].

References

  • [1] M. Balcan, A. Beygelzimer, and J. Langford (2006) Agnostic active learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 65––72. Cited by: §2.
  • [2] M. Balcan, A. Broder, and T. Zhang (2007) Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory, pp. 35–50. Cited by: §7.2.
  • [3] W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler (2018) The power of ensembles for active learning in image classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 9368–9377. Cited by: §2.
  • [4] A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the 14th International Conference on Artificial Intelligence and Statistics

    ,
    pp. 215–223. Cited by: §0.F.4.
  • [5] N. V. Cuong, W. S. Lee, N. Ye, K. M. A. Chai, and H. L. Chieu (2013) Active learning for probabilistic hypotheses using the maximum gibbs error criterion. In Advances in Neural Information Processing Systems, pp. 1457–1465. Cited by: §1, §1, §4.1.
  • [6] N. V. Cuong, W. S. Lee, and N. Ye (2014) Near-optimal adaptive pool-based active learning with general loss. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pp. 122–131. Cited by: §2, §4.1.
  • [7] S. Dasgupta (2005) Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, pp. 337–344. Cited by: §1, §1, §2.
  • [8] S. Dasgupta (2006) Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems, pp. 235–242. Cited by: §1, §5.1, §5.4.
  • [9] S. Dasgupta (2009) Two faces of active learning. In Proceedings of the 20th International Conference on Algorithmic Learning Theory, pp. 1–1. Cited by: §1.
  • [10] L. Devroye, L. Györfi, and G. Lugosi (2013) A probabilistic theory of pattern recognition. Vol. 31, Springer Science & Business Media. Cited by: §5.2.
  • [11] M. Ducoffe and F. Precioso (2018) Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841. Cited by: §1, §1, §2, §7.2, §7.
  • [12] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 1050–1059. Cited by: §2.
  • [13] Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning, pp. 1183–1192. Cited by: §1, §1, §2, §7.
  • [14] D. Gissin and S. Shalev-Shwartz (2019) Discriminative active learning. arXiv preprint arXiv:1907.06347. Cited by: §1, §2.
  • [15] D. Golovin and A. Krause (2011) Adaptive submodularity: a new approach to active learning and stochastic optimization. Journal of Artificial Intelligence Research, pp. 427––486. Cited by: §1, §1, §2.
  • [16] S. Hanneke (2007) A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine Learning, pp. 353–360. Cited by: §2.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §0.F.4.
  • [18] N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel (2011) Bayesian active learning for classification and preference learning. Computing Research Repository abs/1112.5745. Cited by: §1, §1, §2, §7.2, §7.
  • [19] A. J. Joshi, F. Porikli, and N. Papanikolopoulos (2009) Multi-class active learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2372–2379. Cited by: §1.
  • [20] M. Kääriäinen (2005) Generalization error bounds using unlabeled data. In Proceedings of the 18th Annual Conference on Learning Theory, pp. 127–142. Cited by: §5.1, §5.2.
  • [21] A. Kirsch, J. van Amersfoort, and Y. Gal (2019) Batchbald: efficient and diverse batch acquisition for deep bayesian active learning. In Advances in Neural Information Processing Systems, pp. 7024–7035. Cited by: §1, §2.
  • [22] J. B. Kruskal (1978) Multidimensional scaling. Sage. Cited by: §0.D.2, §6.
  • [23] A. Lacasse, F. Laviolette, M. Marchand, P. Germain, and N. Usunier (2007) PAC-bayes bounds for the risk of the majority vote and the variance of the gibbs classifier. In Advances in Neural Information Processing Systems, pp. 769–776. Cited by: §7.4.
  • [24] Y. LeCun, C. Cortes, and C. Burges (1998) MNIST handwritten digit database. External Links: Link Cited by: §0.F.1.
  • [25] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §0.F.3.
  • [26] R. Pinsler, J. Gordon, E. Nalisnick, and J. M. Hernández-Lobato (2019) Bayesian batch active learning as sparse subset approximation. In Advances in Neural Information Processing Systems, pp. 6356–6367. Cited by: §2.
  • [27] O. Sener and S. Savarese (2018) Active learning for convolutional neural networks: a core-set approach. In Proceedings of 6th the International Conference on Learning Representations, Cited by: §1, §1, §2, §7.2, §7.
  • [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §0.F.1.
  • [29] T. Tieleman and G. Hinton (2012) Lecture 6.5 – RMSProp: divide the gradient by a running average of its recent magnitude. Note: Coursera: Neural networks for machine learning External Links: Link Cited by: §0.F.1.
  • [30] C. Tosh and S. Dasgupta (2017) Diameter-based active learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 3444–3452. Cited by: §1, §2, §5.4, §7.
  • [31] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §0.F.2.
  • [32] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proceedings of the 27th British Machine Vision Conference, pp. 87.1–87.12. Cited by: §0.F.4.
  • [33] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In Proceedings of the 5th International Conference on Learning Representations, Cited by: §1.