Collecting the right data for supervised deep learning is an important and challenging task that can improve performance on most modern computer vision problems. Active learning aims to select, from a large unlabeled dataset, the smallest possible training set to solve a specific task. To this end, the uncertainty of a model is used as a means of quantifying what the model does not know, to select data to be annotated. In deterministic neural networks, uncertainty is typically measured based on the output of the last softmax-normalized layer. However, these estimates do not always provide reliable information to select appropriate training data, as they tend to be overconfident or incorrect [49, 22]. On the other hand, Bayesian methods provide a principled approach to estimate the uncertainty of a model. Though they have recently gained momentum, they have not found widespread use in practice [29, 47, 19].
The formulation of a Bayesian Neural Network (BNN) involves placing a prior distribution over all the parameters of the network, and obtaining the posterior given the observed data . The distribution of predictions provided by a trained BNN helps capture the model’s uncertainty. However, training a BNN involves marginalization over all possible assignments of weights, which is intractable for deep BNNs without approximations [21, 4, 17]. Existing approximation algorithms limit their applicability, since they do not specifically address the fact that deep BNNs on large datasets are more difficult to optimize than deterministic networks, and require extensive parameter tuning to provide good performance and uncertainty estimates . Furthermore, estimating uncertainty in BNNs requires drawing a large number of samples at test time, which can be extremely computationally demanding.
. Different models in an ensemble of networks are treated as if they were samples drawn directly from a BNN posterior. Ensembles are easy to optimize and fast to execute. However, they do not approximate uncertainty in the same manner as a BNN. For example, the parameters in the first kernel of the first layer of a convolutional neural network may serve a completely different purpose in different members of an ensemble. Therefore, the variance of the values of these parameters after training cannot be compared to the variance that would have been obtained in the first kernel of the first layer of a trained BNN with the same architecture.
In this paper, we propose Deep Probabilistic Ensembles (DPEs), a novel approach to approximate BNNs using ensembles. Specifically, we use variational inference , a popular technique for training BNNs, to derive a KL divergence regularization term for ensembles. We show the potential of our approach for active learning on large-scale visual classification benchmarks with up to millions of samples. When applied to CIFAR-10, CIFAR-100 and ImageNet, DPEs consistently outperform strong baselines and previously published results. To further show the applicability of our method, we propose a framework to actively annotate data for semantic segmentation, which on BDD100k yields IoU improvements of up to 20% while handling classes that are underrepresented in the training distribution.
Our contributions therefore are (i) we propose KL regularization for training DPEs, which combine the advantages of ensembles and BNNs; (ii) we apply DPEs to active image classification on large-scale datasets; and (iii) we propose a framework for active semantic segmentation using DPEs. Our experiments on both visual tasks demonstrate the benefits and scalability of our method compared to existing methods. DPEs are parallelizable, easy to implement, yield high performance, and provide good uncertainty estimates when used for active learning.
2 Related Work
Regularizing Ensembles. Using ensembles of neural networks to improve performance has a long history [23, 13]. Attempts to regularize ensembles have typically focused on promoting diversity among the ensemble members, through penalties such as negative correlation . While these methods are presented in the perspective of improving the accuracy, our approach improves the ensemble’s uncertainty estimates with little to no impact on accuracy.
Training BNNs. There are four popular approximations for training BNNs. The first one is Markov Chain Monte Carlo (MCMC). MCMC approximates the posterior through Bayes rule, by sampling to compute an integral . This method is accurate but it is extremely sample inefficient, and has not been applied to deep BNNs. The second one is Variational Inference which is an iterative optimization approach to update a chosen variational distribution on each network parameter using the observed data . Training is fast for reasonable network sizes, but performance is not always ideal, especially for deeper networks, due to the nature of the optimization objective.
The third approximation is Monte Carlo Dropout (MC Dropout)  (and its variants [50, 1]). These methods regularize the network during training with a technique like dropout , and draw Monte Carlo samples during test time with different dropout masks as approximate samples from a BNN posterior. It can be seen as an approximation to variational inference when using Bernoulli priors . Due to its simplicity, several approaches have been proposed using MC dropout for uncertainty estimation [29, 30, 16, 19]. However, empirically, these uncertainty estimates do not perform as well as those obtained from ensembles [32, 2]. In addition, MC dropout requires the dropout rate to be very well-tuned to produce reliable uncertainty estimates . The last approximation is Bayes by Backprop 
which involves maintaining two parameters on each weight, a mean and standard deviation, with their gradients calculated by backpropagation. Empirically, the uncertainty estimates obtained with this approach and MC Dropout perform similarly for active learning.
(Deep) Active Learning. A comprehensive review of classical approaches to active learning can be found at . Most of these approaches rely on a confusion metric based on the model’s predictions, like information theoretical methods that maximize expected information gain , committee based methods that maximize disagreement between committee members [10, 15, 37, 9], entropy maximization methods , and margin based methods [42, 6, 51, 28]. Data samples for which the current model is uncertain, or most confused, are queried for labeling usually one at a time and, in general, the model is re-trained for each of these queries. In the era of deep neural networks, querying single samples and retraining the model for each one is not feasible. Adding a single sample into the training process is likely to have no statistically significant impact on the performance of a deep neural network and the retraining becomes computationally intractable.
Scalable alternatives to classical active learning ideas have been explored for the iterative batch query setting. Pool-based approaches filter unlabeled data using heuristics based on a confusion metric, and then choose what to label based on diversity within this pool[11, 52, 57]. More recent approaches explore active learning without breaking the task down using confusion metrics, through a lower bound on the prediction error known as the core-set loss 
or learning the interestingness of samples in a data-driven manner through reinforcement learning[53, 14, 40]. Results in these directions are promising. However, state-of-the-art active learning algorithms are based on uncertainty estimates from network ensembles . Our experiments demonstrate the gains obtained over ensembles when they are trained with the proposed regularization scheme.
3 Deep Probabilistic Ensembles
For inference in a BNN, we can consider the weights to be latent variables drawn from our prior distribution, . These weights relate to the observed dataset through the likelihood, . We aim to compute the posterior,
It is intractable to analytically integrate over all assignments of weights, so variational inference restricts the problem to a family of distributions over the latent variables. We then try to optimize for the member of this family that is closest to the true posterior in terms of KL divergence,
To make the computation tractable the objective is modified by subtracting a constant independent of our weights . This new objective to be minimized is the negative of the Evidence Lower Bound (ELBO) ,
Substituting for the posterior term from Eq. 1, this can be simplified to
where the first term is the KL divergence between and the chosen prior distribution ; and the second term is the expected negative log likelihood (NLL) of the data based on the current parameters . The optimization difficulty in variational inference arises partly due to this expectation of the NLL. In deterministic networks, fragile co-adaptations exist between different parameters, which can be crucial to their performance . Features typically interact with each other in a complex way, such that the optimal setting for certain parameters is highly dependent on specific configurations of the other parameters in the network. Co-adaptation makes training easier, but can reduce the ability of deterministic networks to generalize . Popular regularization techniques to improve generalization, such as Dropout, can be seen as a trade-off between co-adaptation and training difficulty . An ensemble of deterministic networks exploits co-adaptation, as each network in the ensemble is optimized independently, making them easy to optimize.
For BNNs, the nature of the objective, an expectation over all possible assignments of , prevents the BNN from exploiting co-adaptations during training, since we seek to minimize the NLL for any generic deterministic network sampled from the BNN. While in theory, this should lead to great generalization, it becomes very difficult to tune BNNs to produce competitive results. In this paper, we propose a form of regularization to use the optimization simplicity of ensembles for training BNNs. The main properties of our approach compared to ensembles and BNNs are summarized in Table 1.
KL regularization. The standard approach to training neural networks involves regularizing each individual parameter with or penalty terms. We instead apply the KL divergence term in Eq. 7 as a regularization penalty , to the set of values of a given parameter takes over all members in an ensemble. If we choose the family of Gaussian functions for and , this term can be analytically computed by assuming mutual independence between the network parameters and factoring the term into individual Gaussians. The KL divergence between two Gaussians with means and , standard deviations and is given by
Choice of prior. In our experiments, we choose the network initialization technique proposed by He et al. 
as a prior. For batch normalization parameters, we fix, and set for the weights and
for the biases. For the weights in convolutional layers with the ReLU activation (withinput channels, output channels, and kernel dimensions ) we set and . Linear layers can be considered a special case of convolutions, where the kernel has the same dimensions as the input activation. The KL regularization of a layer is obtained by removing the terms independent of and substituting for and in Eq. 8,
where and are the mean and standard deviation of the set of values taken by a parameter across different ensemble members. In Eq. 9, the first term prevents extremely large variances compared to the prior, so the ensemble members do not diverge completely from each other. The second term heavily penalizes variances less than the prior, promoting diversity between members. The third term closely resembles weight decay, keeping the mean of the weights close that of the prior, especially when their variance is also low.
Objective function. We can now rewrite the minimization objective from Eq. 7 for an ensemble as:
where is the training data, is the number of models in the ensemble, and refers to the parameters of the model . is the cross-entropy loss for classification and is our KL regularization penalty over all the parameters of the ensemble . We obtain by aggregating the penalty from Eq. 9 over all the layers in a network, and use a scaling term to balance the regularizer with the likelihood loss. By summing the loss over each independent model, we are approximating the expectation of the ELBO’s NLL term in Eq. 7 over only the current ensemble configuration, a subset of all possible assignments of . This is the main distinction between our approach and traditional variational inference.
4 Uncertainty Estimation for Active Learning
In this section, we describe how DPEs can be used for active learning. There are four components in our pipeline:
A large unlabeled set consisting of samples, , where each is an input data point. is the dimensional input feature space.
A labeled set, consisting of labeled pairs, , , where each is a data point and each is its corresponding label. For a -way classification problem, the label space would be .
An annotator, that can be queried to map data points to labels
Our DPE , a set of models , each trained to estimate the label of a given data point.
These four components interact with each other in an iterative process. In our setup, commonly referred to as batch mode active learning, every iteration involves 2 stages, which are summarized in Algorithm 1. In the first stage, the current labeled set is used to train the ensemble to convergence. The trained ensemble is applied to the existing unlabeled pool in the second stage, selecting a subset of samples from this pool to send to the annotator for labeling. The growth parameter may be fixed or vary over iterations. The subset is moved from to before the next iteration. Training proceeds in this manner until an annotation budget of total label queries has been met.
Central to the second stage is an acquisition function that is used for ranking all the available unlabeled data. For this, we employ the ensemble predictive entropy, given by
refers to the unnormalized model prediction vector andis the softmax function. Since the entropy for each query sample is independent of the others, the entire batch of samples can be added to in parallel. This allows for increased efficiency, but we do not account for the fact that within the batch, adding certain samples has an impact on the uncertainty associated with the other samples.
In our experiments, to circumvent the need for annotators in the experimental loop, we use datasets for which all samples are annotated in advance, but hide the annotation away from the model. The dataset is initially treated as our unlabeled pool , and labels are revealed to as the experiment proceeds.
5 Active Image Classification
In this section, we evaluate the effectiveness of our approach on active learning for image classification across various levels of task complexity with respect to number of classes, image resolution and quantity of data.
Datasets. We experiment with three widely used benchmarks for image classification: the CIFAR-10 and CIFAR-100 datasets , as well as the ImageNet 2012 classification dataset . CIFAR datasets involve object classification tasks over natural images: CIFAR-10 is coarse-grained over 10 classes, and CIFAR-100 is fine-grained over 100 classes. For both tasks, there are 50k training images and 10k validation images of resolution . ImageNet consists of 1000 object classes, with annotation available for 1.28 million training images and 50k validation images of varying sizes and aspect ratios. All three datasets are balanced in terms of the number of training samples per class.
Network Architecture. We use ResNet 
backbones to implement the individual members of the DPE. With ImageNet, we use a ResNet-18 with the standard kernel sizes and counts. We use a variant of the same architecture for CIFAR, with pre-activation blocks and strides of 1 instead of 2 in the initial convolution and first residual block.
We retrain the ensemble from scratch for every iteration of active learning, to avoid remaining in a potentially sub-optimal local minimum from the earlier model trained with a smaller quantity of data. Optimization is done using Stochastic Gradient Descent with a learning rate of 0.1, batch size of 32 and momentum of 0.9. We do mean-std pre-processing, and augment the labeled dataset on-line with random crops and horizontal flips.
The KL regularization penalty is set to
. We use a patience parameter (set to 25) for counting the number of epochs with no improvement in validation accuracy, in which case the learning rate is dropped by a factor of 0.1. We end training when dropping the learning rate also gives no improvement in the validation accuracy after a given number of epochs. In each iteration, if the early stopping criterion is not met, we train for a maximum of 400 epochs.
In our experiments, we use ensembles of 8 members. Empirically, we found that, when using 3 or less members, the regularization penalty (specifically, the third term in Eq. 9, involving the mean square over the variance of the parameters across the DPE) causes instability and divergence in the loss. In addition, we obtained no benefits when using a larger number of members (e.g., 16 or more).
|Method||10k (20%)||50k (100%)||Ratio|
|Single + Random||85.2||94.4||90.3|
|DPE + Random||87.9||95.2||92.3|
|Single + Linear-8||87.5||94.4||92.7|
|Ours (DPE + Linear-8)||92.0||95.2||96.3|
Growth Parameter. In our first experiment, we focus on analyzing variations of the growth parameter . This parameter decides how many times the model must be retrained, and therefore has large implications on the time needed to reach the final annotation budget and complete the active learning experiment. For this experiment, we consider the CIFAR-10 dataset and an overall budget of samples. As a baseline, we use random sampling as an acquisition function with , training the DPE 8 times. For reference, we also compute the upper bound achieved by training the DPE with all 50k samples in the dataset. We consider three variations of the growth parameter with the acquisition function: (i) Linear-4: we set , training the DPE 4 times; (ii) Exponential-4: we initially set , and then iteratively double its value ( and for the 3 remaining active learning iterations); and (iii) Linear-8: we set , training the DPE 8 times. Fig. 1
shows the error bars over three trials. Linear-8, which is computationally the most demanding, only marginally outperforms the other approaches. It is able to achieve 99.2% of the accuracy of the upper bound with just 32% of the labelling effort. This is probably due to the combination of two factors: (i) less correlation among the samples added (due to the lower growth parameter), and (ii) better uncertainty estimates in the intermediate stages due to more reinitializations. When the model is only retrained 4 times, both Linear-4 and Exponential-4 data addition methods achieve similar performance levels. From a computational point of view, the exponential approach is more efficient as the dataset size for the first three training iterations is smaller (2k, 4k and 8k compared to 4k, 8k and 12k).
|Task||Acquisition||4k (8%)||8k (16%)||16k (32%)|
Comparison to Existing Methods. In a second experiment, we focus on comparing our approach to state-of-the-art deep active learning methods for classification. Specifically, we consider the core-set selection  and ensembles . The former uses a single VGG-16. The latter an ensemble of DenseNet-121 models. In addition, we consider ’Random’ and ’Linear-8’ experiments from Fig. 1 with a single model instead of the DPE. We use the same architecture (ResNet-18) for either a single model or a member of the DPE. For the baseline, we train the model using regularization instead of KL regularization. Table 2 summarizes our results and those reported in their corresponding papers using 10k samples on CIFAR-10. We also include the upper bound of each experiment. That is, the accuracy obtained when the model is trained using all the data available. As shown, our approach outperforms all the others, not only achieving higher accuracy, but also reducing the gap to the corresponding upper bound. Note that, even with random acquisition, the ratio to the upper bound is better for DPEs than for single deterministic models. This is an indicator of the performance gains achieved through ensembles. Interestingly, we observe an even larger improvement for the Linear-8 active learning approach over the deterministic model with DPEs. These results suggest DPEs provide more precise uncertainty estimates, which leads to better data selection strategies.
Acquisition Function. In a third experiment, we analyze the relevance of the acquisition function, a key component in the active learning pipeline. To this end, we compare (as defined in Eq. 11) with four other acquisition functions: categories first entropy (, mutual information (), variance () and variation ratios (). The formulation of these functions and the one used in our approach is listed in Table 3.
Categories first entropy is the sum of the entropy of each of the members of the ensemble. Mutual information, also known as Jensen-Shannon divergence or information radius, is the difference between and . This function is appropriate for active learning as it has been shown to distinguish epistemic uncertainty (i.e., the confusion over the model’s parameters that apply to a given observation) from aleatoric uncertainty (i.e., the inherent stochasticity associated with that observation ). Aleatoric uncertainty cannot be reduced even upon collecting an infinite amount of data, and can lead to inefficiency in selecting samples for an uncertainty based active learning system. Variance measures the inconsistency between the members of the ensemble, which also aims to isolate epistemic uncertainty . Finally, variation ratios is the number of non-modal predictions (that do not agree with the mode, or majority vote) made by the ensemble, and can be thought of as a discretized version of variance.
|Task||Data Sampling||4k (8%)||8k (16%)||16k (32%)|
|Task||Data Sampling||25.6k (2%)||51.2k (4%)|
Our results are listed in Table 4. Each result corresponds to the mean of three trials. Intuitively, functions focused on epistemic uncertainty should perform better than simpler ones. However, as shown in our experiments, in practice, there is no significant difference in performance among these acquisition functions. Our choice of using a simple entropy based acquisition function () provides competitive results compared to the others.
Comparison to Ensembles. Finally, we focus on comparing DPEs to ensembles trained using standard regularization. To this end, we run experiments on CIFAR-10, CIFAR-100 and ImageNet. For CIFAR, we use an initial growth parameter of 2k samples, and compare performances at 4k, 8k and 16k samples (Exponential-4). For ImageNet, we use a slightly different experimental setting, comparing performances at 2% and 4% of the dataset (25.6k and 51.2k samples), since repeated training takes far longer with more samples. A summary of these results and a baseline using random data sampling is listed in Tables 5 and 6 for CIFAR and ImageNet respectively. As shown, active learning with DPEs clearly outperforms random baselines, and consistently provides better results compared to ensemble based active learning methods. Further, though the gap is small, it typically grows as the annotation budget (size of the labeled dataset) increases, which is a particularly appealing property for large-scale labeling projects.
6 Active Semantic Segmentation
In this section, we introduce our framework and discuss our results of applying DPEs for active semantic segmentation. The goal of semantic segmentation is to assign a class to every pixel in an image. Generating high-quality ground truth at the pixel level is a time-consuming and expensive task. For instance, labeling an image was reported to take 60 minutes on average for the CamVid dataset , and approximately 90 minutes for the Cityscapes dataset . Due to the substantial manual effort involved in producing high-quality annotations, segmentation datasets with precise and comprehensive label maps are orders of magnitude smaller than classification datasets, despite the fact that collecting unlabeled data is not very expensive. When working with a limited budget, active learning is an appealing approach to effectively select data to be annotated for this task.
The proposed framework is outlined in Fig. 2
. The DPE consists of a convolutional backbone as an encoder, and an ensemble of decoders. Only a small subset of the parameters and activations of the network are associated with these decoder layers. As a result, memory requirements for training scale linearly in only this subset of parameters. We focus on analyzing uncertainty of higher-level semantic concepts associated with deep features in the decoder, since we train the decoder with KL regularization while the encoder is trained using the commonregularization.
The loss used to train semantic segmentation models is computed independently at each pixel. Therefore, training these models from partial annotations of an image is straightforward. In our experiments, as shown in Fig. 3, we split a training image into a grid of cropped sections. To measure the uncertainty of a crop, we sum the acquisition function over each pixel in the crop. We then make queries for crops through active learning rather than querying for entire images.
|Data Sampling||6.7k (8%)||13.4k (16%)||26.9k (32%)|
mIoU @ 26.9k
mIoU @ 84k
|84k Crops (Upper Bound)||92.1||56.3||81.6||20.7||40.3||42.5||44.9||43.6||84.3||45.0||93.7||56.1||25.1||86.6||38.9||56.9||0.0||38.6||42.1||-||52.1|
|Ours (DPE + )||91.3||55.0||80.3||18.4||36.4||38.2||40.0||39.9||83.5||42.6||93.0||55.1||20.1||85.5||34.1||52.2||0.0||36.8||50.0||50.1||-|
|Target Signs (DPE + )||91.1||55.2||80.1||22.0||36.9||38.7||40.5||41.9||83.6||41.3||93.0||53.2||16.9||85.3||31.8||48.0||0.0||34.1||49.8||49.7||-|
Experiments. We set up the active segmentation experiments in a similar way to the Exponential-4 scheme we used for classification. We use iterations at 8%, 16% and 32% of the data, retraining from scratch for every new iteration.
Dataset. We experiment with the BDD100k  dataset, which consists of 7000 training images and 1000 validation images of resolutions provided with fine pixel annotation over 19 classes. There is significant class imbalance in this dataset, as shown in Fig. 4.
Network Architecture. We use a fully convolutional encoder-decoder network in our experiments . For the encoder, we dilate the last 2 blocks of a ResNet-18 backbone so that the encoded features are 1/8 of the input resolution . For each decoder, we use two convolutional layers of 64 and 19 kernels each.
Implementation Details. Since we have 12 crops per image (see Fig. 3), we perform active learning over the unlabeled set of 84k crops. We use a batch size of 8, learning rate of 0.0001, of , and patience of 10 epochs. Our network is initialized by pre-training on the same 19-class segmentation task with the Cityscapes dataset . We resize the Cityscapes images to a resolution of , and train on random crops of size for 100 epochs, with a batch size of 16 and an exponentially decaying learning rate initialized at 0.001.
To maintain simplicity in our setup, we avoid tricks such as data augmentation, class-weighted sampling, class-weighted losses, or auxiliary losses at intermediate layers. We check the performance of this setup by training a fully-supervised version of our model on the entire dataset, which achieves a mean IoU of 52.1%, just 1% less than the baseline result provided for a model of similar architecture but slightly more parameters with the BDD100k dataset .
Results. The mean segmentation IoUs over 3 trials are shown in Table 7. As with classification, we observe clear overall benefits of DPEs in the semantic segmentation setup, with nearly a 2% improvement in mean IoU over the random baseline and 1% over ensembles at 26.9k training crops. Again, the relative gap with competing methods is initially small, but grows steadily as the size of the labeled set increases. DPEs recover 96.2% of the performance of the upper bound with just 32% of the labeling effort.
We compare our class-wise performance at 26.9k crops to the random baseline and fully-supervised upper bound in Table 8. We observe the largest margins of improvement with our method for foreground classes with fewer instances in the training data. On the motorcycle class, of which there are only 240 training instances, we see a difference in IoU of nearly 20%.
Targeting Specific Classes. We now consider a special use case where the performance in specific classes is critical. To that end, we define an acquisition function suitable for targeting specific classes based on a class weighting vector, w. This class-weighted acquisition function is defined as
The last row in Table 8 shows our results using for targeting the traffic sign class. Specifically, we set for traffic sign to 1 and the remaining weights to 0. As shown, this results in a 2% improvement in IoU over the acquisition function for the traffic sign class, without affecting significantly the IoU improvements on other classes (even improving by 3.6% on the wall class). We note that the gap to the upper bound for traffic signs is improved to 1.7% from 3.7%, which is a 54.1% relative performance gain.
In this paper, we introduced Deep Probabilistic Ensembles (DPEs) for uncertainty estimation in deep neural networks. The key idea is to train ensembles with a novel KL regularization term as a means to approximate variational inference for BNNs. Our results demonstrate that DPEs improve performance on active learning tasks over baselines and state-of-the-art active learning techniques on three different image classification datasets. Importantly, contrary to traditional Bayesian methods, our approach is simple to integrate in existing frameworks and scales to large models and datasets without impacting performance. Moreover, when applied to active semantic segmentation, DPEs yield up to 20% IoU improvement in underrepresented classes.
-  A. Atanov, A. Ashukha, D. Molchanov, K. Neklyudov, and D. Vetrov. Uncertainty Estimation via Stochastic Batch Normalization. ArXiv e-prints, 2018.
-  W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler. The power of ensembles for active learning in image classification. In CVPR, 2018.
-  D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational Inference: A Review for Statisticians. ArXiv e-prints, 2016.
-  C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. In ICML, 2015.
-  G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, 2008.
C. Campbell, N. Cristianini, and A. J. Smola.
Query learning with large margin classifiers.In ICML, 2000.
-  D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 1994.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
-  A. Culotta and A. McCallum. Reducing labeling effort for structured prediction tasks. In AAAI, 2005.
-  I. Dagan and S. P. Engelson. Committee-based sampling for training probabilistic classifiers. In ICML, 1995.
-  B. Demir, C. Persello, and L. Bruzzone. Batch-mode active-learning methods for the interactive classification of remote sensing images. IEEE Trans. on Geo. and Rem. Sen., 2011.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  T. G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, LBCS-1857, pages 1–15. Springer, 2000.
-  M. Fang, Y. Li, and T. Cohn. Learning how to Active Learn: A Deep Reinforcement Learning Approach. ArXiv e-prints, 2017.
-  Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 1997.
-  Y. Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.
-  Y. Gal and Z. Ghahramani. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. ArXiv e-prints, 2015.
-  Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. ArXiv e-prints, 2017.
-  Y. Gal and L. Smith. Sufficient Conditions for Idealised Models to Have No Adversarial Examples: a Theoretical and Empirical Study with Bayesian Neural Networks. ArXiv e-prints, 2018.
-  Y. Geifman, G. Uziel, and R. El-Yaniv. Boosting Uncertainty Estimation for Deep Neural Classifiers. ArXiv e-prints, 2018.
-  A. Graves. Practical variational inference for neural networks. In NIPS. 2011.
-  C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. ArXiv e-prints, 2017.
-  L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell., 12(10):993–1001, Oct. 1990.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ArXiv e-prints, 2016.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. ArXiv e-prints, 2012.
-  A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image classification. In CVPR, 2009.
-  A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding. ArXiv e-prints, 2015.
-  A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NIPS, 2017.
-  A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
-  B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2017.
-  D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR, 1994.
-  Y. Liu and X. Yao. Ensemble learning via negative correlation. Neural Networks, 12(10):1399 – 1404, 1999.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  D. J. C. MacKay. Information-based objective functions for active data selection. Neural Computation, 1992.
-  A. McCallum and K. Nigam. Employing em and pool-based active learning for text classification. In ICML, 1998.
-  R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
-  I. Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout. In NIPS Workshops, 2016.
-  K. Pang, M. Dong, Y. Wu, and T. Hospedales. Meta-Learning Transferable Active Learning Policies by Deep Reinforcement Learning. ArXiv e-prints, 2018.
-  C. P. Robert and G. Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg, 2005.
G. Schohn and D. Cohn.
Less is more: Active learning with support vector machines.In ICML, 2000.
-  O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In ICLR, 2018.
-  B. Settles. Active learning literature survey. Technical report, 2010.
-  P. Shyam, W. Jaśkowski, and F. Gomez. Model-Based Active Exploration. ArXiv e-prints, 2018.
A. Siddhant and Z. C. Lipton.
Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study.ArXiv e-prints, 2018.
-  L. Smith and Y. Gal. Understanding Measures of Uncertainty for Adversarial Example Detection. In UAI, 2018.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In ICML, 2014.
-  C. Szegedy, G. Inc, W. Zaremba, I. Sutskever, G. Inc, J. Bruna, D. Erhan, G. Inc, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In ICLR, 2014.
-  M. Teye, H. Azizpour, and K. Smith. Bayesian Uncertainty Estimation for Batch Normalized Deep Networks. ArXiv e-prints, 2018.
-  S. Tong and D. Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2002.
-  K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image classification. IEEE Trans. Cir. and Sys. for Video Technol., 2017.
-  M. Woodward and C. Finn. Active One-shot Learning. ArXiv e-prints, 2017.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS, 2014.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
-  F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell. BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling. ArXiv e-prints, 2018.
-  Y. Zhang, M. Lease, and B. C. Wallace. Active discriminative text representation learning. In AAAI, 2017.