Active learning is a supervised learning framework in which the learner is given access to a pool or stream of unlabeled samples and is allowed to selectively query labels from an oracle (e.g., a human annotator). In each query round, the learner queries the labels of some unlabeled samples and trains on the augmented labeled set to obtain new classifiers. The goal is to learn a good classifier orhypothesis using as few labels as possible. This setting is relevant in many real-world problems, where labeled data are scarce or expensive to obtain, but unlabeled data are cheap and abundant.
Many active learning methods for neural networks rely on measures of the “informativeness” of a query, in the form of classifier uncertainty, margin [19, 11] or information gain [18, 13, 21]. Other methods capture the informativeness by representativeness of the query set using geometry-based  or discriminative  methods. However, most of these methods ignore the notion of the hypothesis space and do not address the problem of sampling bias , which plague many active learning methods. Without carefully handling this problem, an active learning algorithm is not guaranteed to be consistent, i.e., capable of finding the optimal classifier in the hypothesis space.
We consider the hypothesis space of convolutional neural networks (ConvNets) and study version space reduction methods. Version space reduction works by removing hypotheses that are inconsistent with the observed labels from a predefined hypothesis space and maintaining the consistent sub-space, the version space. A key condition called the realizability assumption is that the hypothesis space contains the classifier that provides the ground truth—if not, there are no guarantees that the best hypothesis will not be removed, because a hypothesis might make mistakes on the queried samples but perform well on the data distribution.
For neural networks, the realizability assumption may not hold for all cases. For instance, no neural networks can achieve arbitrarily small test error on some classification datasets. A workaround is to consider the effective labelings on a set of i.i.d. pool samples. To avoid the problem of an unreasonably large effective hypothesis space, as implied by the result of , we only consider the labelings achievable by training on unaltered samples and correct labels. We examine experimentally whether the realizability holds with this restriction and analyze its implications on version space reduction methods.
Prior mass reduction [7, 15, 5] and diameter reduction [8, 30] are two widely used version space reduction approaches. See Fig. 1 for illustration. However, prior mass reduction is not an appropriate objective for active learning  since any intermediate version spaces containing more than one hypothesis may still have a large diameter, i.e., large error rate in the worst-case scenario, despite having substantially reduced mass. We derive connections between prior mass and diameter reduction and introduce a new interpretation of diameter reduction as prior mass “reducibility reduction”.
We propose a new diameter measure called the Gibbs-vote disagreement, which equals the expected distance between the random hypotheses and their majority vote classifier. We show its relation to a common diameter measure, the pairwise disagreement, and discuss under which situations the former may be advantageous. We show experimentally on four image classification datasets that diameter reduction methods perform better than all baselines and that prior mass reduction [7, 15, 5] and other baselines like [18, 13, 27, 11] do not perform consistently better than random query and sometimes fail completely.
2 Related Work
A lot of research has been conducted to study the label complexity for active learning and optimality guarantees for greedy version space reduction. Hanneke  and Balcan et al.  prove upper-bounds on the label complexity in the realizable and non-realizable cases, using a parameter called the disagreement coefficient. Tosh and Dasgupta  propose a diameter-based active learning algorithm and characterize its sample complexity using a parameter called the splitting index. Dasgupta  shows that a greedy strategy maximizing the worst-case prior mass reduction is approximately as good as the optimal strategy. Golovin and Krause  show that the prior mass reduction utility function is adaptive submodular and a greedy algorithm is guaranteed to obtain near-optimal solutions in the average-case scenario. Cuong et al.  prove a worst-case optimality guarantee for pointwise submodular functions.
A variety of methods relying on the informativeness of a query have been proposed for neural networks. Gal et al.  use the Monte Carlo dropout to approximate the mutual information between predictions and model posterior  in a Bayesian setting. Kirsch et al.  extend [18, 13] to a batch query method. Ducoffe and Precioso  use adversarial attacks to generate samples close to the decision boundaries. Sener and Savarese  adopt a core-set approach to select representative samples for query. Gissin and Shalev-Shwartz  use a discriminative method to select samples such that the labeled and the unlabeled set are indistinguishable. Pinsler et al.  formulate batch query as a sparse approximation to the expected complete data posterior of model parameters in a Bayesian setting. Beluch et al.  show that ensemble methods consistently outperform geometry-based methods  and the Monte Carlo dropout method [12, 13].
Let be the input feature space and the label space. Let be a hypothesis space of functions and assume a prior over . A hypothesis randomly drawn from the prior is called a Gibbs classifier. Denote a pool of i.i.d. samples from the data distribution and the set of queried labeled samples. Define the version space corresponding to as
Denote the subset of that is consistent with being labeled as as
The disagreement probability induced by the marginal distributionis defined as
which is a pseudo-metric on the hypothesis space. The disagreement and agreement region are defined as
4 Prior Mass Reduction
4.1 Gibbs Error
The Gibbs error  of an unlabeled sample is the average-case relative prior mass reduction:
where is the conditional distribution of restricted to . Gibbs error measures the proportion of inconsistent hypotheses taking expectation over all possible labelings of , achievable by hypotheses in the version space. A greedy strategy that considers maximizing the average-case absolute prior mass reduction in each query can equivalently select the unlabeled sample that maximizes the Gibbs error
Define the prior mass reduction utility function as
The optimization problem in (7) can be written, up to a scaling factor, as
where the notation denotes the expected marginal gain of in terms of prior mass reduction given the labeled samples in .
A closely related objective for active learning is the label entropy given . It can be shown that the Gibbs error lower bounds the entropy. However, a greedy strategy that maximizes the entropy is not guaranteed to be near-optimal in the adaptive case . Furthermore, empirically this criterion performs similarly or worse than the maximum Gibbs error. For the sake of simplicity, we do not consider this method in this paper.
4.2 Variation Ratio
The variation ratio of an unlabeled sample is the worst-case relative prior mass reduction upon the reveal of its label:
It measures the proportion of inconsistent hypotheses considering the worst-case labeling of and is a lower bound on the Gibbs error. A greedy strategy that considers maximizing the worst-case absolute prior mass reduction in each query selects the unlabeled sample that maximizes the variation ratio
which can be expressed in terms of the prior mass reduction utility function, up to a scaling factor, as
where the notation denotes the worst-case marginal gain of in terms of prior mass reduction given the labeled samples in .
5 Diameter Reduction
5.1 Worst-Case Pairwise Disagreement
The size of the version space can be measured by the expected pairwise disagreement between hypotheses drawn from the conditional distribution:
It is the average diameter of the version space. A greedy strategy selects the unlabeled sample that minimizes the worst-case pairwise disagreement
Other measures of diameter based on the supremum distance [20, 8] are not amenable to implementation because evaluation of such diameters involves optimization. The pairwise disagreement can be estimated from a finite set of sample hypotheses from the version space.
5.2 Worst-Case Gibbs-Vote Disagreement
We propose to use a new diameter measure called the Gibbs-vote disagreement. It is the expected disagreement between random hypotheses and their majority vote:
where is the majority vote classifier of hypotheses from . For each , it induces a prediction
where is the predicted probability of belonging to class given by a hypothesis . The majority vote classifier is the deterministic classifier that has the smallest expected distance to the Gibbs classifier [20, 10]:
Hence the Gibbs-vote disagreement measures the size of the version space by the expected distance of the random hypotheses to their “center”. Further, the following relation holds
We defer the proof to the appendix. Essentially, Equation (20) reveals that the Gibbs-vote disagreement is sandwiched between the average radius and diameter.
A greedy strategy selects the unlabeled sample that minimizes the worst-case Gibbs-vote disagreement
where is the majority vote of hypotheses from the subspace of the current version space if is labeled .
5.3 Diameter Reduction as Reducibility Reduction
Pairwise disagreement shares a simple relation with Gibbs error—it is the expected Gibbs error:
A similar relation holds between Gibbs-vote disagreement and the variation ratio:
where the last equality holds because the predictions of the majority vote classifier are always the worst-case labels for prior mass reduction. Diameter reduction selects samples such that, upon revealing their labels, the induced subspaces have minimum possibility to be further reduced by a potential random query. Thus, it can be thought of as reducing the expected prior mass “reducibility”.
Prior mass reduction finds splits in directions that evenly partition the version space, but could result in version spaces that have irregular shapes, in the sense that the space can be whittled down finely in some directions while being under-split in others. The worst-case error rate of the resulted version space could still be large. Diameter reduction correctly resolve this issue. Fig. 1 illustrates the differences between prior mass and diameter reduction.
5.4 Weighted Diameter Reduction
Tosh and Dasgupta  show that in general average diameter cannot be decreased at steady rate and propose to query the unlabeled samples that minimize the diameter weighted by the squared prior mass in the worst-case scenario
6 Realizability Assumption
Even though neural networks are capable of fitting an arbitrary pool set, we show experimentally that the version space obtained by training on a subset of the pool set with stochastic gradient descent—the “samplable” version space—is biased and not likely to contain the correct labeling of the pool set. Indeed, the distance from theBayes classifier, which provides the ground truth labeling, to the “boundary” of the version space is non-negligible.
Let be the projection of the Bayes classifier to the set of hypotheses that agree with on (see the left plot of Fig. 2), i.e.,
It is easy to see that provides the ground truth on and predicts the same labels on as hypotheses in do, hence
where is the disagreement probability restricted to AGR(V), or equivalently the wrong agreement of hypotheses in .
We show the evolution of wrong agreement in the right plot of Fig. 2. As more random samples are queried, the wrong agreement decreases for all datasets, but for some much slower than the others. In Fig. 3, we show for MNIST a 2-D embedding of version spaces using Multi-Dimensional Scaling (MDS) 
, which finds a low-dimensional representation of potentially high-dimensional data by preserving pairwise distances between the data points. The Bayes classifier is not contained in any of the samplable version spaces although the distances between them decrease steadily.
In general neural networks trained with a random subset do not automatically predict all labels in the pool set correctly, unless a relatively large proportion of samples are used for training. However, this fact does not render version space reduction inconsistent, because the samplable version space is not fixed, but it shifts towards the correct labeling and finally covers it when the whole pool set has been used.
We conjecture that the dynamics of active learning with neural networks have two major components: (1) shrinkage of the samplable version space, which is explicitly optimized by the learning algorithm and (2) reduction of bias, which is not directly controllable. Empirical evidence is provided in the next section.
Datasets and Architectures
We conduct active learning experiments111Source code is available at https://github.com/jiayu-liu/effective-version-space-reduction-for-convnets. on four image classification datasets: MNIST, Fashion-MNIST, SVHN and STL-10.
Neural network architectures are chosen to be competent for each dataset but as simple as possible in the hope of controlling the model complexity and mitigating the effect of overfitting.
See Table 1 for the complete experiment settings.
Active Learning Methods
We compare nine querying methods: Random, variation ratio (VR), Gibbs error (GE), Bayesian Active Learning by Disagreement with Monte Carlo dropout (BALD-MCD) [18, 13], Core-Set , Deep-Fool Active Learning (DFAL) , pairwise disagreement (PWD), Gibbs-vote disagreement (GVD), and double-weighted pairwise disagreement (M-PWD) . For each method on each dataset, at least three runs of active learning with different random balanced initial training set are performed.
We train networks multiple times from scratch to obtain sample hypotheses and use them for prior mass and diameter estimation. Since diameters are estimated by considering partitioned version spaces, the ensemble size should be at least in the order of number of classes. We set the size to 20. Larger ensemble improves estimation but at the cost of longer training time. In preliminary experiments, we tried larger ensembles (40) and did not observe significant differences. Hence we do not include experiments on changing this hyper-parameter in the paper.
Query Size We set a small query budget for each round to reduce the correlation between queries. Larger budget may alleviate the pressure of frequent retraining, but the effect of each query can not be estimated and examined reliably. We observed in preliminary experiments that using larger budget (one or two orders larger) hides the differences between methods.
|Random||93.47 0.38||83.90 0.38||85.60 0.23||58.15 0.54|
|VR||96.74 0.15||83.05 1.09||63.23 1.99||59.13 0.21|
|GE||96.79 0.10||80.01 0.94||64.08 3.77||58.84 0.34|
|BALD-MCD||96.51 0.22||84.67 0.41||85.26 0.34||57.35 0.64|
|Core-Set||95.38 0.28||79.08 0.82||84.91 0.20||58.93 0.33|
|DFAL||92.88 1.19||85.38 0.60||86.34 0.33||58.81 0.37|
|PWD||96.92 0.12||85.92 0.10||86.41 0.12||59.45 0.11|
|GVD||97.02 0.06||86.01 0.15||86.44 0.20||59.33 0.37|
|M-PWD||93.24 0.09||84.33 0.03||85.42 0.16||57.81 0.20|
|Random||2.86 0.18||7.55 0.26||13.13 0.29||32.88 0.43|
|VR||2.27 0.18||10.64 0.72||46.88 2.76||34.21 0.08|
|GE||2.30 0.04||11.38 1.52||44.87 4.25||34.25 0.09|
|BALD-MCD||2.39 0.15||8.11 0.51||16.58 0.42||33.55 0.44|
|Core-Set||2.91 0.18||10.79 1.34||14.66 0.47||33.13 0.64|
|DFAL||3.79 0.60||7.06 0.60||13.98 0.31||32.41 0.27|
|PWD||1.93 0.04||6.91 0.16||12.80 0.08||32.25 0.26|
|GVD||1.98 0.05||6.98 0.26||12.88 0.25||32.96 0.48|
|M-PWD||3.37 0.13||7.22 0.08||13.31 0.13||33.23 0.18|
7.1 Diameter Reduction is More Effective Than Prior Mass Reduction
Fig. 4 and Table 2 show that direct diameter reduction methods PWD and GVD are consistently better than Random and achieve higher accuracy than other baselines while weighted diameter reduction M2-PWD is on par with Random. Diameter reduction methods usually exhibit less variances because training on samples queried by PWD, GVD and M2-PWD yields version spaces with smaller diameters and less diverse sample hypotheses. Prior mass reduction is not always effective and even fails on SVHN. This failure is an example of prior mass reduction being incapable of reducing the diameter, and provides empirical evidence that it may not be an appropriate objective for active learning.
7.2 Comparison to Other Baselines
BALD-MCD, Core-Set and DFAL are not consistently better than Random although each of them achieves comparative test accuracy on certain dataset. Their inferiority to Random in terms of test accuracy usually correlates with higher diameter (See description in Fig. 5 and Table 3). BALD-MCD and DFAL are highly related to prior mass reduction methods in that BALD  seeks samples for which the model parameters under the posterior disagree the most about the prediction , and that DFAL, inspired by margin-based active learning , tries to locate the decision boundary with fewer labels which is essentially removing inconsistent hypotheses in the realizable case. However, none of them explicitly minimize the diameter, neither does Core-Set.
Note that for a fair comparison, we do not augment the training set by also adding the adversarial samples as the original DFAL paper  does. Samples with minimum adversarial perturbation are then verified reliably to be less effective than those lead to minimum diameter. The original Core-Set paper  uses a large query batch size (in the order of 1000). However, many baselines rely on greedy selection and do not perform any batch optimization. To reduce query correlation, we adopt as small batch size as possible. This allows reliable evaluation of the effectiveness of queried samples as in the online setting. We are therefore able to identify one major cause of inferiority to Random as failing to effectively reduce the version space diameter.
7.3 Evolution of Samplable Version Space and its Implications
As shown in Fig. 5 and 3,
the samplable version space shifts closer to the correct labeling while reducing its diameter as more labels are queried. These two processes together result in smaller test error.
No Direct Control Over Reduction of Version Space Bias
Interestingly, the Core-Set method, which queries representative samples from the pool set by solving a k-center problem in the feature space learned by neural networks, is incapable of achieving negligible wrong agreement on the learned version spaces. Indeed, it suffers larger version space bias than the direct diameter reduction methods. After all, random queries which are i.i.d. by assumption fail to achieve this goal as concluded in Section 6 and other attempts without augmenting the training data seem doomed.
Prior Mass Induced by Stochastic Gradient Descent May Not Be a Reliable Surrogate Measure
The continued decline in wrong agreement indicates that the distribution over labelings changes over time. This fact of shifting density over samplable labelings renders the notion of prior mass problematic, hence all notions relying on prior mass may not be well-defined. A direct consequence is that an estimate of the worst-case version space reduction would be more reliable than the average-case one. For example, VR provides a more reliable estimate of version space reduction than GE does.
Inferiority of Weighted Diameter Reduction Method The estimation of weighted diameter involves estimating the prior mass. Hence, the inferiority of M-PWD to PWD and GVD can be attributed to the intrinsic difficulty of obtaining unbiased samplable version spaces and the resulted density shift. A supportive evidence can be seen by noting that on MNIST and Fashion-MNIST, where the wrong agreement is large (hence large density shift), the weighted variant performs worse, while on SVHN and STL-10, where the wrong agreement is small (hence small shift), the gap is less significant.
7.4 Gibbs-Vote Disagreement
The Gibbs-vote disagreement is among the best methods on all datasets, except for the early learning stage on SVHN. Its effectiveness can be ascribed to an interesting phenomenon—majority voting reduces mistakes. Although it need not necessarily be the case, this phenomenon occurs in many situations and the boost to accuracy depends on the variance of errors of Gibbs classifiers . We show empirically that the majority vote classifier indeed has smaller error rate than random hypotheses in the version space in Fig. 6. Hence, optimizing the Gibbs-vote disagreement not only reduces the diameter but also implicitly moves the consistent hypotheses closer to the correct labeling, which is useful when the samplable version spaces are biased and do not contain the Bayes classifier.
In this work, we studied version space reduction for convolutional neural networks. We revealed the differences and connections between prior mass and diameter reduction methods and proposed the Gibbs-vote disagreement as a new effective diameter-reduction method. With experiments on four datasets, we shed light into how version space reduction works in the deep active learning setting and demonstrated the superiority of diameter reduction over prior mass reduction methods and other baselines.
This work was supported by the BMBF project MLWin and the Munich Center for Machine Learning (MCML).
Appendix 0.A Estimators and Algorithm
0.a.1 Effective Hypothesis Space
Let be the set of unlabeled samples in the pool set and the set of queried unlabeled samples. The effective hypothesis space is the restriction of to , or equivalently all possible labelings of :
0.a.2 Estimators of Diameters
Let be a set of sample hypotheses from
, an unbiased estimator of the pairwise disagreement can be constructed by computing average pairwise distances between hypotheses in
Similarly, an unbiased estimator of the Gibbs vote disagreement can be constructed by computing average distances between hypotheses in and the empirical majority vote
A (batch mode) greedy algorithm that selects the unlabeled samples which induce the minimum version space in terms of a given diameter measure in the worst-case scenario is shown in Algorithm 1. Other active learning methods (e.g., prior mass reduction methods) can also be described using this algorithm with line in the algorithm replaced by the corresponding objective functions. Note that in line and the version space is maintained explicitly while in practice we only need to sample from the new version space by training neural networks on the updated set of queried samples. Hence, the version space is always implicitly maintained.
Appendix 0.B Proof of Equation 20
where the last inequality is by triangular inequality. By definition, it holds that . Using the relations derived in Section 5.3
it holds that
Appendix 0.C Singly-Weighted Diameter Reduction
Besides the doubly-weighted diameter reduction method mentioned in Section 5.4, there are singly-weighted variants.
0.c.0.1 Weighted Pairwise Disagreement
which is equivalent to minimizing the potential expected marginal gain on the prior mass reduction utility function
0.c.0.2 Weighted Gibbs-Vote Disagreement
which is equivalent to minimizing the potential worst-case marginal gain
Appendix 0.D Additional Evaluation Results
0.d.1 Evaluation on the Pool Set
In addition to the evaluation results on the test set shown in the paper, we show the results on the pool set as well in Figure 7. On MNIST, Fashion-MNIST and SVHN, diameter reduction are more effective than prior mass reduction methods and other baselines in finding the true labeling on the pool set using as few labels as possible. On STL-10, prior mass reduction performs better at reducing the diameter. An explanation is that we use unlabeled samples in the validation set to estimate relative prior mass and diameters when selecting each query. So the diameter reduction methods are not explicitly optimizing the diameter measured on the pool set, but rather on an unbiased validation set.
0.d.2 Embedding of Version Spaces
To better illustrate the evolution of version spaces and the existence of bias in version spaces, we show a 2-D embedding of sample hypotheses during the active learning process for each dataset using Multi-Dimensional Scaling (MDS)  in Figure 8. MDS finds a low-dimensional representation of potentially high-dimensional data by preserving pairwise distances between the data points. We show sample hypotheses from the first (purple) and the last (red) version spaces as well as intermediate version spaces obtained by training on randomly queried labels (approximately) amount to 25%, 50% and 75% of the total budget. To achieve better visualization, we first compute the embedding of the five Gibbs (random) classifiers and the Bayes classifier and then compute the embedding of each version space separately and center them at the corresponding Gibbs classifier. We use the disagreement probability evaluated on the test set as the distance metric.
As more labels are queried, the version spaces move closer to the Bayes classifier while reducing their diameters. The bias in the version spaces is non-negligible for the four datasets. An active learning algorithm contributes to the shrinkage of the samplable version space but does not have direct control over the reduction of bias. How to efficiently reduce the bias remains an open problem for designing active learning algorithms for neural networks.
Appendix 0.E Datasets Selection
The four image classification datasets MNIST, Fashion-MNIST, SVHN and STL-10 are chosen based on several considerations: (1) relatively balanced label distribution; (2) there exist neural network models that can train fast on the original training set of the datasets; (3) no data augmentations are needed. Since active learning methods query highly biased samples, a balanced label distribution help mitigate the problem of query label imbalance. The second point helps reduce the time needed to run active learning experiments. The last point guarantees that the samples used for training are exactly those that have been queried.
Appendix 0.F Implementation Details
For the four datasets we consider, no data augmentation is used. Unless otherwise stated, the neural network models are trained using SGD with initial learning rate
and momentum 0.8. The learning rate decays by a factor of 0.1 when there are no improvements on the validation accuracy for any consecutive 10 training epochs until it is smaller than. The maximum training epochs are 200. The batch size for training is set to 32.
0.f.1 Mnist 
We select a random balanced set of 50000 samples from the original 60000 training samples as the training/validation set and use the original 10000 test samples as the test set. The 2-conv-layer ConvNet is trained using RMSProp with learning rate . The learning rate decays by a factor of 0.5 when there are no improvements on the validation accuracy for any consecutive 10 training epochs until it is smaller than . Dropout  of rate 0.5 is applied to the output of the fully-connected layer which lies between the last convolution layer and the output layer. The batch size for training is set to 16.
0.f.2 Fashion-MNIST 
We used the original balanced 60000 training and 10000 test samples as the training/validation and test sets. A 3-layer-conv ConvNet is used as the classifier model. Dropout of rate 0.5 is applied to the output of the fully-connected layer which lies between the last convolution layer and the output layer.
0.f.3 Svhn 
We select a random balanced set of 45000 samples from the original 73257 training samples as the training/validation set and a random balanced 15000 samples from the original 26032 test samples as the test set. A 6-layer-conv ConvNet is used as the classifier model. Dropouts of rate 0.3 are applied to the output of every two convolution layers and that of the fully-connected layer which lies between the last convolution layer and the output layer.
0.f.4 Stl-10 
-  (2006) Agnostic active learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 65––72. Cited by: §2.
-  (2007) Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory, pp. 35–50. Cited by: §7.2.
-  (2018) The power of ensembles for active learning in image classification. In , pp. 9368–9377. Cited by: §2.
An analysis of single-layer networks in unsupervised feature learning.
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 215–223. Cited by: §0.F.4.
-  (2013) Active learning for probabilistic hypotheses using the maximum gibbs error criterion. In Advances in Neural Information Processing Systems, pp. 1457–1465. Cited by: §1, §1, §4.1.
-  (2014) Near-optimal adaptive pool-based active learning with general loss. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pp. 122–131. Cited by: §2, §4.1.
-  (2005) Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, pp. 337–344. Cited by: §1, §1, §2.
-  (2006) Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems, pp. 235–242. Cited by: §1, §5.1, §5.4.
-  (2009) Two faces of active learning. In Proceedings of the 20th International Conference on Algorithmic Learning Theory, pp. 1–1. Cited by: §1.
-  (2013) A probabilistic theory of pattern recognition. Vol. 31, Springer Science & Business Media. Cited by: §5.2.
-  (2018) Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841. Cited by: §1, §1, §2, §7.2, §7.
-  (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 1050–1059. Cited by: §2.
-  (2017) Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning, pp. 1183–1192. Cited by: §1, §1, §2, §7.
-  (2019) Discriminative active learning. arXiv preprint arXiv:1907.06347. Cited by: §1, §2.
-  (2011) Adaptive submodularity: a new approach to active learning and stochastic optimization. Journal of Artificial Intelligence Research, pp. 427––486. Cited by: §1, §1, §2.
-  (2007) A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine Learning, pp. 353–360. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §0.F.4.
-  (2011) Bayesian active learning for classification and preference learning. Computing Research Repository abs/1112.5745. Cited by: §1, §1, §2, §7.2, §7.
-  (2009) Multi-class active learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2372–2379. Cited by: §1.
-  (2005) Generalization error bounds using unlabeled data. In Proceedings of the 18th Annual Conference on Learning Theory, pp. 127–142. Cited by: §5.1, §5.2.
-  (2019) Batchbald: efficient and diverse batch acquisition for deep bayesian active learning. In Advances in Neural Information Processing Systems, pp. 7024–7035. Cited by: §1, §2.
-  (1978) Multidimensional scaling. Sage. Cited by: §0.D.2, §6.
-  (2007) PAC-bayes bounds for the risk of the majority vote and the variance of the gibbs classifier. In Advances in Neural Information Processing Systems, pp. 769–776. Cited by: §7.4.
-  (1998) MNIST handwritten digit database. External Links: Cited by: §0.F.1.
-  (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §0.F.3.
-  (2019) Bayesian batch active learning as sparse subset approximation. In Advances in Neural Information Processing Systems, pp. 6356–6367. Cited by: §2.
-  (2018) Active learning for convolutional neural networks: a core-set approach. In Proceedings of 6th the International Conference on Learning Representations, Cited by: §1, §1, §2, §7.2, §7.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §0.F.1.
-  (2012) Lecture 6.5 – RMSProp: divide the gradient by a running average of its recent magnitude. Note: Coursera: Neural networks for machine learning External Links: Cited by: §0.F.1.
-  (2017) Diameter-based active learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 3444–3452. Cited by: §1, §2, §5.4, §7.
-  (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §0.F.2.
-  (2016) Wide residual networks. In Proceedings of the 27th British Machine Vision Conference, pp. 87.1–87.12. Cited by: §0.F.4.
-  (2017) Understanding deep learning requires rethinking generalization. In Proceedings of the 5th International Conference on Learning Representations, Cited by: §1.