1 Introduction
Collecting the right data for supervised deep learning is an important and challenging task that can improve performance on most modern computer vision problems. Active learning aims to select, from a large unlabeled dataset, the smallest possible training set to solve a specific task
[7]. To this end, the uncertainty of a model is used as a means of quantifying what the model does not know, to select data to be annotated. In deterministic neural networks, uncertainty is typically measured based on the output of the last softmaxnormalized layer. However, these estimates do not always provide reliable information to select appropriate training data, as they tend to be overconfident or incorrect [49, 22]. On the other hand, Bayesian methods provide a principled approach to estimate the uncertainty of a model. Though they have recently gained momentum, they have not found widespread use in practice [29, 47, 19].The formulation of a Bayesian Neural Network (BNN) involves placing a prior distribution over all the parameters of the network, and obtaining the posterior given the observed data [38]. The distribution of predictions provided by a trained BNN helps capture the model’s uncertainty. However, training a BNN involves marginalization over all possible assignments of weights, which is intractable for deep BNNs without approximations [21, 4, 17]. Existing approximation algorithms limit their applicability, since they do not specifically address the fact that deep BNNs on large datasets are more difficult to optimize than deterministic networks, and require extensive parameter tuning to provide good performance and uncertainty estimates [39]. Furthermore, estimating uncertainty in BNNs requires drawing a large number of samples at test time, which can be extremely computationally demanding.
In practice, a common approach to estimate uncertainty is based on ensembles [32, 2, 20]
. Different models in an ensemble of networks are treated as if they were samples drawn directly from a BNN posterior. Ensembles are easy to optimize and fast to execute. However, they do not approximate uncertainty in the same manner as a BNN. For example, the parameters in the first kernel of the first layer of a convolutional neural network may serve a completely different purpose in different members of an ensemble. Therefore, the variance of the values of these parameters after training cannot be compared to the variance that would have been obtained in the first kernel of the first layer of a trained BNN with the same architecture.
In this paper, we propose Deep Probabilistic Ensembles (DPEs), a novel approach to approximate BNNs using ensembles. Specifically, we use variational inference [3], a popular technique for training BNNs, to derive a KL divergence regularization term for ensembles. We show the potential of our approach for active learning on largescale visual classification benchmarks with up to millions of samples. When applied to CIFAR10, CIFAR100 and ImageNet, DPEs consistently outperform strong baselines and previously published results. To further show the applicability of our method, we propose a framework to actively annotate data for semantic segmentation, which on BDD100k yields IoU improvements of up to 20% while handling classes that are underrepresented in the training distribution.
Our contributions therefore are (i) we propose KL regularization for training DPEs, which combine the advantages of ensembles and BNNs; (ii) we apply DPEs to active image classification on largescale datasets; and (iii) we propose a framework for active semantic segmentation using DPEs. Our experiments on both visual tasks demonstrate the benefits and scalability of our method compared to existing methods. DPEs are parallelizable, easy to implement, yield high performance, and provide good uncertainty estimates when used for active learning.
2 Related Work
Regularizing Ensembles. Using ensembles of neural networks to improve performance has a long history [23, 13]. Attempts to regularize ensembles have typically focused on promoting diversity among the ensemble members, through penalties such as negative correlation [34]. While these methods are presented in the perspective of improving the accuracy, our approach improves the ensemble’s uncertainty estimates with little to no impact on accuracy.
Training BNNs. There are four popular approximations for training BNNs. The first one is Markov Chain Monte Carlo (MCMC). MCMC approximates the posterior through Bayes rule, by sampling to compute an integral [41]. This method is accurate but it is extremely sample inefficient, and has not been applied to deep BNNs. The second one is Variational Inference which is an iterative optimization approach to update a chosen variational distribution on each network parameter using the observed data [3]. Training is fast for reasonable network sizes, but performance is not always ideal, especially for deeper networks, due to the nature of the optimization objective.
The third approximation is Monte Carlo Dropout (MC Dropout) [18] (and its variants [50, 1]). These methods regularize the network during training with a technique like dropout [48], and draw Monte Carlo samples during test time with different dropout masks as approximate samples from a BNN posterior. It can be seen as an approximation to variational inference when using Bernoulli priors [17]. Due to its simplicity, several approaches have been proposed using MC dropout for uncertainty estimation [29, 30, 16, 19]. However, empirically, these uncertainty estimates do not perform as well as those obtained from ensembles [32, 2]. In addition, MC dropout requires the dropout rate to be very welltuned to produce reliable uncertainty estimates [39]. The last approximation is Bayes by Backprop [4]
which involves maintaining two parameters on each weight, a mean and standard deviation, with their gradients calculated by backpropagation. Empirically, the uncertainty estimates obtained with this approach and MC Dropout perform similarly for active learning
[46].(Deep) Active Learning. A comprehensive review of classical approaches to active learning can be found at [44]. Most of these approaches rely on a confusion metric based on the model’s predictions, like information theoretical methods that maximize expected information gain [36], committee based methods that maximize disagreement between committee members [10, 15, 37, 9], entropy maximization methods [33], and margin based methods [42, 6, 51, 28]. Data samples for which the current model is uncertain, or most confused, are queried for labeling usually one at a time and, in general, the model is retrained for each of these queries. In the era of deep neural networks, querying single samples and retraining the model for each one is not feasible. Adding a single sample into the training process is likely to have no statistically significant impact on the performance of a deep neural network and the retraining becomes computationally intractable.
Scalable alternatives to classical active learning ideas have been explored for the iterative batch query setting. Poolbased approaches filter unlabeled data using heuristics based on a confusion metric, and then choose what to label based on diversity within this pool
[11, 52, 57]. More recent approaches explore active learning without breaking the task down using confusion metrics, through a lower bound on the prediction error known as the coreset loss [43]or learning the interestingness of samples in a datadriven manner through reinforcement learning
[53, 14, 40]. Results in these directions are promising. However, stateoftheart active learning algorithms are based on uncertainty estimates from network ensembles [2]. Our experiments demonstrate the gains obtained over ensembles when they are trained with the proposed regularization scheme.3 Deep Probabilistic Ensembles
For inference in a BNN, we can consider the weights to be latent variables drawn from our prior distribution, . These weights relate to the observed dataset through the likelihood, . We aim to compute the posterior,
(1) 
It is intractable to analytically integrate over all assignments of weights, so variational inference restricts the problem to a family of distributions over the latent variables. We then try to optimize for the member of this family that is closest to the true posterior in terms of KL divergence,
(2)  
(3) 
To make the computation tractable the objective is modified by subtracting a constant independent of our weights . This new objective to be minimized is the negative of the Evidence Lower Bound (ELBO) [3],
(4)  
Substituting for the posterior term from Eq. 1, this can be simplified to
(5)  
(6)  
(7) 
where the first term is the KL divergence between and the chosen prior distribution ; and the second term is the expected negative log likelihood (NLL) of the data based on the current parameters . The optimization difficulty in variational inference arises partly due to this expectation of the NLL. In deterministic networks, fragile coadaptations exist between different parameters, which can be crucial to their performance [54]. Features typically interact with each other in a complex way, such that the optimal setting for certain parameters is highly dependent on specific configurations of the other parameters in the network. Coadaptation makes training easier, but can reduce the ability of deterministic networks to generalize [27]. Popular regularization techniques to improve generalization, such as Dropout, can be seen as a tradeoff between coadaptation and training difficulty [48]. An ensemble of deterministic networks exploits coadaptation, as each network in the ensemble is optimized independently, making them easy to optimize.
For BNNs, the nature of the objective, an expectation over all possible assignments of , prevents the BNN from exploiting coadaptations during training, since we seek to minimize the NLL for any generic deterministic network sampled from the BNN. While in theory, this should lead to great generalization, it becomes very difficult to tune BNNs to produce competitive results. In this paper, we propose a form of regularization to use the optimization simplicity of ensembles for training BNNs. The main properties of our approach compared to ensembles and BNNs are summarized in Table 1.
Model  Coadaptation  Training  Generalization 

Ensemble  High  Easy  Low 
DPE  Medium  Easy  High 
BNN  Low  Hard  High 
KL regularization. The standard approach to training neural networks involves regularizing each individual parameter with or penalty terms. We instead apply the KL divergence term in Eq. 7 as a regularization penalty , to the set of values of a given parameter takes over all members in an ensemble. If we choose the family of Gaussian functions for and , this term can be analytically computed by assuming mutual independence between the network parameters and factoring the term into individual Gaussians. The KL divergence between two Gaussians with means and , standard deviations and is given by
(8) 
Choice of prior. In our experiments, we choose the network initialization technique proposed by He et al. [24]
as a prior. For batch normalization parameters, we fix
, and set for the weights andfor the biases. For the weights in convolutional layers with the ReLU activation (with
input channels, output channels, and kernel dimensions ) we set and . Linear layers can be considered a special case of convolutions, where the kernel has the same dimensions as the input activation. The KL regularization of a layer is obtained by removing the terms independent of and substituting for and in Eq. 8,(9) 
where and are the mean and standard deviation of the set of values taken by a parameter across different ensemble members. In Eq. 9, the first term prevents extremely large variances compared to the prior, so the ensemble members do not diverge completely from each other. The second term heavily penalizes variances less than the prior, promoting diversity between members. The third term closely resembles weight decay, keeping the mean of the weights close that of the prior, especially when their variance is also low.
Objective function. We can now rewrite the minimization objective from Eq. 7 for an ensemble as:
(10) 
where is the training data, is the number of models in the ensemble, and refers to the parameters of the model . is the crossentropy loss for classification and is our KL regularization penalty over all the parameters of the ensemble . We obtain by aggregating the penalty from Eq. 9 over all the layers in a network, and use a scaling term to balance the regularizer with the likelihood loss. By summing the loss over each independent model, we are approximating the expectation of the ELBO’s NLL term in Eq. 7 over only the current ensemble configuration, a subset of all possible assignments of . This is the main distinction between our approach and traditional variational inference.
4 Uncertainty Estimation for Active Learning
In this section, we describe how DPEs can be used for active learning. There are four components in our pipeline:

A large unlabeled set consisting of samples, , where each is an input data point. is the dimensional input feature space.

A labeled set, consisting of labeled pairs, , , where each is a data point and each is its corresponding label. For a way classification problem, the label space would be .

An annotator, that can be queried to map data points to labels

Our DPE , a set of models , each trained to estimate the label of a given data point.
These four components interact with each other in an iterative process. In our setup, commonly referred to as batch mode active learning, every iteration involves 2 stages, which are summarized in Algorithm 1. In the first stage, the current labeled set is used to train the ensemble to convergence. The trained ensemble is applied to the existing unlabeled pool in the second stage, selecting a subset of samples from this pool to send to the annotator for labeling. The growth parameter may be fixed or vary over iterations. The subset is moved from to before the next iteration. Training proceeds in this manner until an annotation budget of total label queries has been met.
Central to the second stage is an acquisition function that is used for ranking all the available unlabeled data. For this, we employ the ensemble predictive entropy, given by
(11) 
where, p
refers to the unnormalized model prediction vector and
is the softmax function. Since the entropy for each query sample is independent of the others, the entire batch of samples can be added to in parallel. This allows for increased efficiency, but we do not account for the fact that within the batch, adding certain samples has an impact on the uncertainty associated with the other samples.In our experiments, to circumvent the need for annotators in the experimental loop, we use datasets for which all samples are annotated in advance, but hide the annotation away from the model. The dataset is initially treated as our unlabeled pool , and labels are revealed to as the experiment proceeds.
5 Active Image Classification
In this section, we evaluate the effectiveness of our approach on active learning for image classification across various levels of task complexity with respect to number of classes, image resolution and quantity of data.
Datasets. We experiment with three widely used benchmarks for image classification: the CIFAR10 and CIFAR100 datasets [31], as well as the ImageNet 2012 classification dataset [12]. CIFAR datasets involve object classification tasks over natural images: CIFAR10 is coarsegrained over 10 classes, and CIFAR100 is finegrained over 100 classes. For both tasks, there are 50k training images and 10k validation images of resolution . ImageNet consists of 1000 object classes, with annotation available for 1.28 million training images and 50k validation images of varying sizes and aspect ratios. All three datasets are balanced in terms of the number of training samples per class.
Network Architecture. We use ResNet [25]
backbones to implement the individual members of the DPE. With ImageNet, we use a ResNet18 with the standard kernel sizes and counts. We use a variant of the same architecture for CIFAR, with preactivation blocks and strides of 1 instead of 2 in the initial convolution and first residual block
[26].Implementation Details.
We retrain the ensemble from scratch for every iteration of active learning, to avoid remaining in a potentially suboptimal local minimum from the earlier model trained with a smaller quantity of data. Optimization is done using Stochastic Gradient Descent with a learning rate of 0.1, batch size of 32 and momentum of 0.9. We do meanstd preprocessing, and augment the labeled dataset online with random crops and horizontal flips.
The KL regularization penalty is set to
. We use a patience parameter (set to 25) for counting the number of epochs with no improvement in validation accuracy, in which case the learning rate is dropped by a factor of 0.1. We end training when dropping the learning rate also gives no improvement in the validation accuracy after a given number of epochs. In each iteration, if the early stopping criterion is not met, we train for a maximum of 400 epochs.
In our experiments, we use ensembles of 8 members. Empirically, we found that, when using 3 or less members, the regularization penalty (specifically, the third term in Eq. 9, involving the mean square over the variance of the parameters across the DPE) causes instability and divergence in the loss. In addition, we obtained no benefits when using a larger number of members (e.g., 16 or more).
Method  10k (20%)  50k (100%)  Ratio 

Coreset [43]  74.0  90.0  82.2 
Ensemble [2]  85.0  95.5  89.0 
Single + Random  85.2  94.4  90.3 
DPE + Random  87.9  95.2  92.3 
Single + Linear8  87.5  94.4  92.7 
Ours (DPE + Linear8)  92.0  95.2  96.3 
Growth Parameter. In our first experiment, we focus on analyzing variations of the growth parameter . This parameter decides how many times the model must be retrained, and therefore has large implications on the time needed to reach the final annotation budget and complete the active learning experiment. For this experiment, we consider the CIFAR10 dataset and an overall budget of samples. As a baseline, we use random sampling as an acquisition function with , training the DPE 8 times. For reference, we also compute the upper bound achieved by training the DPE with all 50k samples in the dataset. We consider three variations of the growth parameter with the acquisition function: (i) Linear4: we set , training the DPE 4 times; (ii) Exponential4: we initially set , and then iteratively double its value ( and for the 3 remaining active learning iterations); and (iii) Linear8: we set , training the DPE 8 times. Fig. 1
shows the error bars over three trials. Linear8, which is computationally the most demanding, only marginally outperforms the other approaches. It is able to achieve 99.2% of the accuracy of the upper bound with just 32% of the labelling effort. This is probably due to the combination of two factors: (i) less correlation among the samples added (due to the lower growth parameter), and (ii) better uncertainty estimates in the intermediate stages due to more reinitializations. When the model is only retrained 4 times, both Linear4 and Exponential4 data addition methods achieve similar performance levels. From a computational point of view, the exponential approach is more efficient as the dataset size for the first three training iterations is smaller (2k, 4k and 8k compared to 4k, 8k and 12k).
Function  Formula 

where  
and 
Task  Acquisition  4k (8%)  8k (16%)  16k (32%) 

Random  80.60  86.80  91.08  
82.96  89.92  94.10  
C10  82.70  90.00  93.97  
82.19  90.05  93.92  
82.76  90.29  94.06  
Ours ()  82.58  89.96  93.92  
Random  39.57  54.92  66.65  
38.47  55.63  69.16  
C100  39.96  55.50  69.01  
41.13  56.62  69.17  
41.08  56.97  69.45  
Ours ()  41.05  56.91  69.79 
Comparison to Existing Methods. In a second experiment, we focus on comparing our approach to stateoftheart deep active learning methods for classification. Specifically, we consider the coreset selection [43] and ensembles [2]. The former uses a single VGG16. The latter an ensemble of DenseNet121 models. In addition, we consider ’Random’ and ’Linear8’ experiments from Fig. 1 with a single model instead of the DPE. We use the same architecture (ResNet18) for either a single model or a member of the DPE. For the baseline, we train the model using regularization instead of KL regularization. Table 2 summarizes our results and those reported in their corresponding papers using 10k samples on CIFAR10. We also include the upper bound of each experiment. That is, the accuracy obtained when the model is trained using all the data available. As shown, our approach outperforms all the others, not only achieving higher accuracy, but also reducing the gap to the corresponding upper bound. Note that, even with random acquisition, the ratio to the upper bound is better for DPEs than for single deterministic models. This is an indicator of the performance gains achieved through ensembles. Interestingly, we observe an even larger improvement for the Linear8 active learning approach over the deterministic model with DPEs. These results suggest DPEs provide more precise uncertainty estimates, which leads to better data selection strategies.
Acquisition Function. In a third experiment, we analyze the relevance of the acquisition function, a key component in the active learning pipeline. To this end, we compare (as defined in Eq. 11) with four other acquisition functions: categories first entropy (, mutual information (), variance () and variation ratios (). The formulation of these functions and the one used in our approach is listed in Table 3.
Categories first entropy is the sum of the entropy of each of the members of the ensemble. Mutual information, also known as JensenShannon divergence or information radius, is the difference between and [45]. This function is appropriate for active learning as it has been shown to distinguish epistemic uncertainty (i.e., the confusion over the model’s parameters that apply to a given observation) from aleatoric uncertainty (i.e., the inherent stochasticity associated with that observation [30]). Aleatoric uncertainty cannot be reduced even upon collecting an infinite amount of data, and can lead to inefficiency in selecting samples for an uncertainty based active learning system. Variance measures the inconsistency between the members of the ensemble, which also aims to isolate epistemic uncertainty [47]. Finally, variation ratios is the number of nonmodal predictions (that do not agree with the mode, or majority vote) made by the ensemble, and can be thought of as a discretized version of variance.
Task  Data Sampling  4k (8%)  8k (16%)  16k (32%) 

Random  80.60  86.80  91.08  
C10  Ensemble  82.41  90.05  94.13 
Ours (DPE)  82.88  90.15  94.33  
Random  39.57  54.92  66.65  
C100  Ensemble  40.49  56.89  69.68 
Ours (DPE)  40.87  56.94  70.12 
Task  Data Sampling  25.6k (2%)  51.2k (4%) 

Random  48.97  64.06  
ImageNet  Ensemble  52.95  67.25 
Ours (DPE)  52.89  67.28 
Our results are listed in Table 4. Each result corresponds to the mean of three trials. Intuitively, functions focused on epistemic uncertainty should perform better than simpler ones. However, as shown in our experiments, in practice, there is no significant difference in performance among these acquisition functions. Our choice of using a simple entropy based acquisition function () provides competitive results compared to the others.
Comparison to Ensembles. Finally, we focus on comparing DPEs to ensembles trained using standard regularization. To this end, we run experiments on CIFAR10, CIFAR100 and ImageNet. For CIFAR, we use an initial growth parameter of 2k samples, and compare performances at 4k, 8k and 16k samples (Exponential4). For ImageNet, we use a slightly different experimental setting, comparing performances at 2% and 4% of the dataset (25.6k and 51.2k samples), since repeated training takes far longer with more samples. A summary of these results and a baseline using random data sampling is listed in Tables 5 and 6 for CIFAR and ImageNet respectively. As shown, active learning with DPEs clearly outperforms random baselines, and consistently provides better results compared to ensemble based active learning methods. Further, though the gap is small, it typically grows as the annotation budget (size of the labeled dataset) increases, which is a particularly appealing property for largescale labeling projects.
6 Active Semantic Segmentation
In this section, we introduce our framework and discuss our results of applying DPEs for active semantic segmentation. The goal of semantic segmentation is to assign a class to every pixel in an image. Generating highquality ground truth at the pixel level is a timeconsuming and expensive task. For instance, labeling an image was reported to take 60 minutes on average for the CamVid dataset [5], and approximately 90 minutes for the Cityscapes dataset [8]. Due to the substantial manual effort involved in producing highquality annotations, segmentation datasets with precise and comprehensive label maps are orders of magnitude smaller than classification datasets, despite the fact that collecting unlabeled data is not very expensive. When working with a limited budget, active learning is an appealing approach to effectively select data to be annotated for this task.
The proposed framework is outlined in Fig. 2
. The DPE consists of a convolutional backbone as an encoder, and an ensemble of decoders. Only a small subset of the parameters and activations of the network are associated with these decoder layers. As a result, memory requirements for training scale linearly in only this subset of parameters. We focus on analyzing uncertainty of higherlevel semantic concepts associated with deep features in the decoder, since we train the decoder with KL regularization while the encoder is trained using the common
regularization.The loss used to train semantic segmentation models is computed independently at each pixel. Therefore, training these models from partial annotations of an image is straightforward. In our experiments, as shown in Fig. 3, we split a training image into a grid of cropped sections. To measure the uncertainty of a crop, we sum the acquisition function over each pixel in the crop. We then make queries for crops through active learning rather than querying for entire images.
Data Sampling  6.7k (8%)  13.4k (16%)  26.9k (32%) 

Random  45.37  46.92  48.41 
Ensemble  45.60  47.62  49.14 
Ours (DPE)  45.80  48.10  50.12 
Data Sampling 
road 
sidewalk 
building 
wall 
fence 
pole 
traffic light 
traffic sign 
vegetation 
terrain 
sky 
person 
rider 
car 
truck 
bus 
train 
motorcycle 
bicycle 
mIoU @ 26.9k 
mIoU @ 84k 

84k Crops (Upper Bound)  92.1  56.3  81.6  20.7  40.3  42.5  44.9  43.6  84.3  45.0  93.7  56.1  25.1  86.6  38.9  56.9  0.0  38.6  42.1    52.1 
Random  91.2  54.7  80.2  17.3  38.0  38.4  40.8  40.3  83.6  43.3  92.9  53.6  18.5  84.8  33.3  47.2  0.0  16.9  44.9  48.4   
Ours (DPE + )  91.3  55.0  80.3  18.4  36.4  38.2  40.0  39.9  83.5  42.6  93.0  55.1  20.1  85.5  34.1  52.2  0.0  36.8  50.0  50.1   
Target Signs (DPE + )  91.1  55.2  80.1  22.0  36.9  38.7  40.5  41.9  83.6  41.3  93.0  53.2  16.9  85.3  31.8  48.0  0.0  34.1  49.8  49.7   
Experiments. We set up the active segmentation experiments in a similar way to the Exponential4 scheme we used for classification. We use iterations at 8%, 16% and 32% of the data, retraining from scratch for every new iteration.
Dataset. We experiment with the BDD100k [56] dataset, which consists of 7000 training images and 1000 validation images of resolutions provided with fine pixel annotation over 19 classes. There is significant class imbalance in this dataset, as shown in Fig. 4.
Network Architecture. We use a fully convolutional encoderdecoder network in our experiments [35]. For the encoder, we dilate the last 2 blocks of a ResNet18 backbone so that the encoded features are 1/8 of the input resolution [55]. For each decoder, we use two convolutional layers of 64 and 19 kernels each.
Implementation Details. Since we have 12 crops per image (see Fig. 3), we perform active learning over the unlabeled set of 84k crops. We use a batch size of 8, learning rate of 0.0001, of , and patience of 10 epochs. Our network is initialized by pretraining on the same 19class segmentation task with the Cityscapes dataset [8]. We resize the Cityscapes images to a resolution of , and train on random crops of size for 100 epochs, with a batch size of 16 and an exponentially decaying learning rate initialized at 0.001.
To maintain simplicity in our setup, we avoid tricks such as data augmentation, classweighted sampling, classweighted losses, or auxiliary losses at intermediate layers. We check the performance of this setup by training a fullysupervised version of our model on the entire dataset, which achieves a mean IoU of 52.1%, just 1% less than the baseline result provided for a model of similar architecture but slightly more parameters with the BDD100k dataset [56].
Results. The mean segmentation IoUs over 3 trials are shown in Table 7. As with classification, we observe clear overall benefits of DPEs in the semantic segmentation setup, with nearly a 2% improvement in mean IoU over the random baseline and 1% over ensembles at 26.9k training crops. Again, the relative gap with competing methods is initially small, but grows steadily as the size of the labeled set increases. DPEs recover 96.2% of the performance of the upper bound with just 32% of the labeling effort.
We compare our classwise performance at 26.9k crops to the random baseline and fullysupervised upper bound in Table 8. We observe the largest margins of improvement with our method for foreground classes with fewer instances in the training data. On the motorcycle class, of which there are only 240 training instances, we see a difference in IoU of nearly 20%.
Targeting Specific Classes. We now consider a special use case where the performance in specific classes is critical. To that end, we define an acquisition function suitable for targeting specific classes based on a class weighting vector, w. This classweighted acquisition function is defined as
(12) 
The last row in Table 8 shows our results using for targeting the traffic sign class. Specifically, we set for traffic sign to 1 and the remaining weights to 0. As shown, this results in a 2% improvement in IoU over the acquisition function for the traffic sign class, without affecting significantly the IoU improvements on other classes (even improving by 3.6% on the wall class). We note that the gap to the upper bound for traffic signs is improved to 1.7% from 3.7%, which is a 54.1% relative performance gain.
7 Conclusion
In this paper, we introduced Deep Probabilistic Ensembles (DPEs) for uncertainty estimation in deep neural networks. The key idea is to train ensembles with a novel KL regularization term as a means to approximate variational inference for BNNs. Our results demonstrate that DPEs improve performance on active learning tasks over baselines and stateoftheart active learning techniques on three different image classification datasets. Importantly, contrary to traditional Bayesian methods, our approach is simple to integrate in existing frameworks and scales to large models and datasets without impacting performance. Moreover, when applied to active semantic segmentation, DPEs yield up to 20% IoU improvement in underrepresented classes.
References
 [1] A. Atanov, A. Ashukha, D. Molchanov, K. Neklyudov, and D. Vetrov. Uncertainty Estimation via Stochastic Batch Normalization. ArXiv eprints, 2018.
 [2] W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler. The power of ensembles for active learning in image classification. In CVPR, 2018.
 [3] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational Inference: A Review for Statisticians. ArXiv eprints, 2016.
 [4] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. In ICML, 2015.
 [5] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, 2008.

[6]
C. Campbell, N. Cristianini, and A. J. Smola.
Query learning with large margin classifiers.
In ICML, 2000.  [7] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 1994.

[8]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
In CVPR, 2016.  [9] A. Culotta and A. McCallum. Reducing labeling effort for structured prediction tasks. In AAAI, 2005.
 [10] I. Dagan and S. P. Engelson. Committeebased sampling for training probabilistic classifiers. In ICML, 1995.
 [11] B. Demir, C. Persello, and L. Bruzzone. Batchmode activelearning methods for the interactive classification of remote sensing images. IEEE Trans. on Geo. and Rem. Sen., 2011.
 [12] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [13] T. G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, LBCS1857, pages 1–15. Springer, 2000.
 [14] M. Fang, Y. Li, and T. Cohn. Learning how to Active Learn: A Deep Reinforcement Learning Approach. ArXiv eprints, 2017.
 [15] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 1997.
 [16] Y. Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.
 [17] Y. Gal and Z. Ghahramani. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. ArXiv eprints, 2015.
 [18] Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. ArXiv eprints, 2017.
 [19] Y. Gal and L. Smith. Sufficient Conditions for Idealised Models to Have No Adversarial Examples: a Theoretical and Empirical Study with Bayesian Neural Networks. ArXiv eprints, 2018.
 [20] Y. Geifman, G. Uziel, and R. ElYaniv. Boosting Uncertainty Estimation for Deep Neural Classifiers. ArXiv eprints, 2018.
 [21] A. Graves. Practical variational inference for neural networks. In NIPS. 2011.
 [22] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. ArXiv eprints, 2017.
 [23] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell., 12(10):993–1001, Oct. 1990.
 [24] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, 2015.
 [25] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [26] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ArXiv eprints, 2016.
 [27] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. ArXiv eprints, 2012.
 [28] A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Multiclass active learning for image classification. In CVPR, 2009.
 [29] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian SegNet: Model Uncertainty in Deep Convolutional EncoderDecoder Architectures for Scene Understanding. ArXiv eprints, 2015.
 [30] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NIPS, 2017.
 [31] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [32] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2017.
 [33] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR, 1994.
 [34] Y. Liu and X. Yao. Ensemble learning via negative correlation. Neural Networks, 12(10):1399 – 1404, 1999.
 [35] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [36] D. J. C. MacKay. Informationbased objective functions for active data selection. Neural Computation, 1992.
 [37] A. McCallum and K. Nigam. Employing em and poolbased active learning for text classification. In ICML, 1998.
 [38] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
 [39] I. Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout. In NIPS Workshops, 2016.
 [40] K. Pang, M. Dong, Y. Wu, and T. Hospedales. MetaLearning Transferable Active Learning Policies by Deep Reinforcement Learning. ArXiv eprints, 2018.
 [41] C. P. Robert and G. Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics). SpringerVerlag, Berlin, Heidelberg, 2005.

[42]
G. Schohn and D. Cohn.
Less is more: Active learning with support vector machines.
In ICML, 2000.  [43] O. Sener and S. Savarese. Active learning for convolutional neural networks: A coreset approach. In ICLR, 2018.
 [44] B. Settles. Active learning literature survey. Technical report, 2010.
 [45] P. Shyam, W. Jaśkowski, and F. Gomez. ModelBased Active Exploration. ArXiv eprints, 2018.

[46]
A. Siddhant and Z. C. Lipton.
Deep Bayesian Active Learning for Natural Language Processing: Results of a LargeScale Empirical Study.
ArXiv eprints, 2018.  [47] L. Smith and Y. Gal. Understanding Measures of Uncertainty for Adversarial Example Detection. In UAI, 2018.
 [48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In ICML, 2014.
 [49] C. Szegedy, G. Inc, W. Zaremba, I. Sutskever, G. Inc, J. Bruna, D. Erhan, G. Inc, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In ICLR, 2014.
 [50] M. Teye, H. Azizpour, and K. Smith. Bayesian Uncertainty Estimation for Batch Normalized Deep Networks. ArXiv eprints, 2018.
 [51] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2002.
 [52] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Costeffective active learning for deep image classification. IEEE Trans. Cir. and Sys. for Video Technol., 2017.
 [53] M. Woodward and C. Finn. Active Oneshot Learning. ArXiv eprints, 2017.
 [54] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS, 2014.
 [55] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. In ICLR, 2016.
 [56] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell. BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling. ArXiv eprints, 2018.
 [57] Y. Zhang, M. Lease, and B. C. Wallace. Active discriminative text representation learning. In AAAI, 2017.