BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning

06/19/2019 ∙ by Andreas Kirsch, et al. ∙ University of Oxford 1

We develop BatchBALD, a tractable approximation to the mutual information between a batch of points and model parameters, which we use as an acquisition function to select multiple informative points jointly for the task of deep Bayesian active learning. BatchBALD is a greedy linear-time 1 - 1/e-approximate algorithm amenable to dynamic programming and efficient caching. We compare BatchBALD to the commonly used approach for batch data acquisition and find that the current approach acquires similar and redundant points, sometimes performing worse than randomly acquiring data. We finish by showing that, using BatchBALD to consider dependencies within an acquisition batch, we achieve new state of the art performance on standard benchmarks, providing substantial data efficiency improvements in batch acquisition.



There are no comments yet.


page 8

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A key problem in deep learning is data efficiency. While excellent performance can be obtained with modern tools, these are often data-hungry, rendering the deployment of deep learning in the real-world challenging for many tasks. Active learning (AL)

(Cohn et al., 1996) is a powerful technique for attaining data efficiency. Instead of a-priori collecting and labelling a large dataset, which often comes at a significant expense, in AL we iteratively acquire labels from an expert only for the most informative data points from a pool of available unlabelled data. After each acquisition step, the newly labelled points are added to the training set, and the model is retrained. This process is repeated until a suitable level of accuracy is achieved. The goal of AL is to minimise the amount of data that needs to be labelled. AL has already made real-world impact in manufacturing (Tong, 2001), robotics (Calinon et al., 2007), recommender systems (Adomavicius and Tuzhilin, 2005), medical imaging (Hoi et al., 2006), and NLP (Siddhant and Lipton, 2018), motivating the need for pushing AL even further.

In AL, the informativeness of new points is assessed by an acquisition function. There are a number of intuitive choices, such as model uncertainty and mutual information, and, in this paper, we focus on BALD (Houlsby et al., 2011), which has proven itself in the context of deep learning (Gal et al., 2017; Shen et al., 2018; Janz et al., 2017). BALD is based on mutual information and scores points based on how well their label would inform us about the true model parameter distribution. In deep learning models (He et al., 2016; Simonyan and Zisserman, 2015)

, we generally treat the parameters as point estimates instead of distributions. However, Bayesian neural networks have become a powerful alternative to traditional neural networks and do provide a distribution over their parameters. Improvements in approximate inference

(Blundell et al., 2015; Gal and Ghahramani, 2016)

have enabled their usage for high dimensional data such as images and in conjunction with BALD

(Gal et al., 2017).

In practical AL applications, instead of single data points, batches of data points are acquired during each acquisition step to reduce the number of times the model is retrained and expert-time is requested. Model retraining becomes a computational bottleneck for larger models while expert time is expensive: consider, for example, the effort that goes into commissioning a medical specialist to label a single MRI scan, then waiting until the model is retrained, and then commissioning a new medical specialist to label the next MRI scan, and the extra amount of time this takes.

In Gal et al. (2017), batch acquisition, i.e. the acquisition of multiple points, takes the top points with the highest BALD acquisition score. This naive approach leads to acquiring points that are individually very informative, but not necessarily so jointly. See figure 2 for such a batch acquisition of BALD in which it performs poorly whereas scoring points jointly ("BatchBALD") can find batches of informative data points. Figure 2 shows how a dataset consisting of repeated MNIST digits (with added Gaussian noise) leads BALD to perform worse than random acquisition while BatchBALD sustains good performance.

Figure 1: Idealised acquisitions of BALD and BatchBALD. If a dataset were to contain many (near) replicas for each data point, then BALD would select all replicas of a single informative data point at the expense of other informative data points, wasting data efficiency.
Figure 2: Performance on Repeated MNIST with acquisition size 10. See section 4.1 for further details. BatchBALD outperforms BALD while BALD performs worse than random acquisition due to the replications in the dataset.

Naively finding the best batch to acquire requires enumerating all possible subsets within the available data, which is intractable as the number of potential subsets grows exponentially with the acquisition size and the size of available points to choose from. Instead, we develop a greedy algorithm that selects a batch in linear time, and show that it is at worst a approximation to the optimal choice for our acquisition function. We provide an open-source implementation111

The main contributions of this work are:

  1. BatchBALD, a data-efficient active learning method that acquires sets of high-dimensional image data, leading to improved data efficiency and reduced total run time, section 3.1;

  2. a greedy algorithm to select a batch of points efficiently, section 3.2; and

  3. an estimator for the acquisition function that scales to larger acquisition sizes and to datasets with many classes, section 3.3.

2 Background

2.1 Problem Setting

The Bayesian active learning setup consists of an unlabelled dataset , the current training set , a Bayesian model with model parameters , and output predictions for data point and prediction in the classification case. The conditioning of on expresses that the model has been trained with . Furthermore, an oracle can provide us with the correct label for a data point in the unlabelled pool . The goal is to obtain a certain level of prediction accuracy with the least amount of oracle queries. At each acquisition step, a batch of data points is selected using an acquisition function which scores a candidate batch of unlabelled data points using the current model parameters :


2.2 Bald

BALD (Bayesian Active Learning by Disagreement) (Houlsby et al., 2011) uses an acquisition function that estimates the mutual information between the model predictions and the model parameters. Intuitively, it captures how strongly the model predictions for a given data point and the model parameters are coupled, implying that finding out about the true label of data points with high mutual information would also inform us about the true model parameters. Originally introduced outside the context of deep learning, the only requirement on the model is that it is Bayesian. BALD is defined as:


Looking at the two terms in equation (2), for the mutual information to be high, the left term has to be high and the right term low. The left term is the entropy of the model prediction, which is high when the model’s prediction is uncertain. The right term is an expectation of the entropy of the model prediction over the posterior of the model parameters and is low when the model is overall certain for each draw of model parameters from the posterior. Both can only happen when the model has many possible ways to explain the data, which means that the posterior draws are disagreeing among themselves.

BALD was originally intended for acquiring individual data points and immediately retraining the model. This becomes a bottleneck in deep learning, where retraining takes a substantial amount of time. Applications of BALD (Gal and Ghahramani, 2016; Janz et al., 2017) usually acquire the top . This can be expressed as summing over individual scores:


and finding the optimal batch for this acquisition function using a greedy algorithm, which reduces to picking the top highest-scoring data points.

2.3 Bayesian Neural Networks (BNN)

In this paper we focus on BNNs as our Bayesian model because they scale well to high dimensional inputs, such as images. Compared to regular neural networks, BNNs maintain a distribution over their weights instead of point estimates. Performing exact inference in BNNs is intractable for any reasonably sized model, so we resort to using a variational approximation. Similar to Gal et al. (2017), we use MC dropout (Gal and Ghahramani, 2016), which is easy to implement, scales well to large models and datasets, and is straightforward to optimise.

3 Methods

3.1 BatchBALD

We propose BatchBALD as an extension of BALD whereby we jointly score points by estimating the mutual information between a joint of multiple data points and the model parameters:222 We use the notation to denote the mutual information between the joint of the random variables

and the random variable

conditioned on .


This builds on the insight that independent selection of a batch of data points leads to data inefficiency as correlations between data points in an acquisition batch are not taken into account.

To understand how to compute the mutual information between a set of points and the model parameters, we express , and through joint random variables and

in a product probability space and use the definition of the mutual information for two random variables:


Intuitively, the mutual information between two random variables can be seen as the intersection of their information content. In fact, Yeung (1991) shows that a signed measure

can be defined for discrete random variables

, , such that , , , and so on, where we identify random variables with their counterparts in information space, and conveniently drop conditioning on and .

Using this, BALD can be viewed as the sum of individual intersections , which double counts overlaps between the . Naively extending BALD to the mutual information between and , which is equivalent to , would lead to selecting similar data points instead of diverse ones under maximisation.

BatchBALD, on the other hand, takes overlaps into account by computing and is more likely to acquire a more diverse cover under maximisation:


This is depicted in figure 3 and also motivates that , which we prove in appendix B.1. For acquisition size 1, BatchBALD and BALD are equivalent.

(a) BALD
(b) BatchBALD
Figure 3: Intuition behind BALD and BatchBALD using I-diagrams (Yeung, 1991). BALD overestimates the joint mutual information. BatchBALD, however, takes the overlap between variables into account and will strive to acquire a better cover of . Areas contributing to the respective score are shown in grey, and areas that are double-counted in dark grey.

3.2 Greedy approximation algorithm for BatchBALD

Input: acquisition size , unlabelled dataset , model parameters
1 for  to  do
2       foreach  do 
3 end for
Output: acquisition batch
Algorithm 1 Greedy BatchBALD -approximate algorithm

To avoid the combinatorial explosion that arises from jointly scoring subsets of points, we introduce a greedy approximation for computing BatchBALD, depicted in algorithm 1. In appendix A, we prove that is submodular, which means the greedy algorithm is -approximate (Nemhauser et al., 1978; Krause et al., 2008).

In appendix B.2, we show that, under idealised conditions, when using BatchBALD and a fixed final , the active learning loop itself can be seen as a greedy -approximation algorithm, and that an active learning loop with BatchBALD and acquisition size larger than 1 is bounded by an an active learning loop with individual acquisitions, that is BALD/BatchBALD with acquisition size 1, which is the ideal case.

3.3 Computing

For brevity, we leave out conditioning on , and , and denotes in this section. is then written as:


Because the are independent when conditioned on , computing the right term of equation (8) is simplified as the conditional joint entropy decomposes into a sum. We can approximate the expectation using a Monte-Carlo estimator with samples from our model parameter distribution :


Computing the left term of equation (8) is difficult because the unconditioned joint probability does not factorise. Applying the equality , and, using sampled , we compute the entropy by summing over all possible configurations of :


3.4 Efficient implementation

In each iteration of the algorithm, stay fixed while varies over . We can reduce the required computations by factorizing into . We store in a matrix of shape and in a matrix of shape . The sum in (12) can be then be turned into a matrix product:


This can be further sped up by using batch matrix multiplication to compute the joint entropy for different . only has to be computed once, and we can recursively compute using and , which allows us to sample for each only once at the beginning of the algorithm.

For larger acquisition sizes, we use MC samples of as enumerating all possible configurations becomes infeasible. See appendix C for details.

Monte-Carlo sampling bounds the time complexity of the full BatchBALD algorithm to compared to for naively finding the exact optimal batch and for BALD333 is the acquisition size, is the number of classes, is the number of MC dropout samples, and is the number of sampled configurations of . .

4 Experiments

(a) BALD
(b) BatchBALD
Figure 4: Performance on MNIST for increasing acquisition sizes. BALD’s performance drops drastically as the acquisition size increases. BatchBALD maintains strong performance even with increasing acquisition size.

In our experiments, we start by showing how a naive application of the BALD algorithm to an image dataset can lead to poor results in a dataset with many (near) duplicate data points, and show that BatchBALD solves this problem in a grounded way while obtaining favourable results (figure 2).

We then illustrate BatchBALD’s effectiveness on standard AL datasets: MNIST and EMNIST. EMNIST (Cohen et al., 2017) is an extension of MNIST that also includes letters, for a total of 47 classes, and has a twice as large training set. See appendix E for examples of the dataset. We show that BatchBALD provides a substantial performance improvement in these scenarios, too, and has more diverse acquisitions.

In our experiments, we repeatedly go through active learning loops. One active learning loop consists of training the model on the available labelled data and subsequently acquiring new data points using a chosen acquisition function. As the labelled dataset is small in the beginning, it is important to avoid overfitting. We do this by using early stopping after 3 epochs of declining accuracy on the validation set. We pick the model with the highest validation accuracy. Throughout our experiments, we use the Adam

(Kingma and Ba, 2014)

optimiser with learning rate 0.001 and betas 0.9/0.999. All our results report the median of 6 trials, with lower and upper quartiles. We use these quartiles to draw the filled error bars on our figures.

We reinitialize the model after each acquisition, similar to Gal et al. (2017). We found this helps the model improve even when very small batches are acquired. It also decorrelates subsequent acquisitions as final model performance is dependent on a particular initialization (Frankle and Carbin, 2019).

When computing , it is important to keep the dropout masks in MC dropout consistent while sampling from the model. This is necessary to capture dependencies between the inputs for BatchBALD, and it makes the scores for different points more comparable by removing this source of noise. We do not keep the masks fixed when computing BALD scores because its performance usually benefits from the added noise. We also do not need to keep these masks fixed for training and evaluating the model.

In all our experiments, we either compute joint entropies exactly by enumerating all configurations, or we estimate them using 10,000 MC samples, picking whichever method is faster. In practice, we compute joint entropies exactly for roughly the first 4 data points in an acquisition batch and use MC sampling thereafter.

Figure 5: Performance on MNIST. BatchBALD outperforms BALD with acquisition size 10 and performs close to the optimum of acquisition size 1.
Figure 6: Relative total time on MNIST. Normalized to training BatchBALD with acquisition size 10 to 95% accuracy. The stars mark when 95% accuracy is reached for each method.

4.1 Repeated MNIST

As demonstrated in the introduction, naively applying BALD to a dataset that contains many (near) replicated data points leads to poor performance. We show how this manifests in practice by taking the MNIST dataset and replicating each data point in the training set three times. After normalising the dataset, we add isotropic Gaussian noise with a standard deviation of 0.1 to simulate slight differences between the duplicated data points in the training set. All results are obtained using an acquisition size of 10 and 10 MC dropout samples. The initial dataset was constructed by taking a balanced set of 20 data points

444These initial data points were chosen by running BALD 6 times with the initial dataset picked randomly and choosing the set of the median model. They were subsequently held fixed., two of each class (similar to (Gal et al., 2017)).

Our model consists of two blocks of [convolution, dropout, max-pooling, relu], with 32 and 64 5x5 convolution filters. These blocks are followed by a two-layer MLP that includes dropout between the layers and has 128 and 10 hidden units. The dropout probability is 0.5 in all three locations. This architecture achieves 99% accuracy with 10 MC dropout samples during test time on the full MNIST dataset.

The results can be seen in figure 2. In this illustrative scenario, BALD performs poorly, and even randomly acquiring points performs better. However, BatchBALD is able to cope with the replication perfectly. In appendix D, we also compare with Variation Ratios (Freeman, 1965), and Mean STD (Kendall et al., 2015) which perform on par with random acquisition.

4.2 Mnist

90% accuracy 95% accuracy
BatchBALD 70 / 90 / 110 190 / 200 / 230
BALD 555reimplementation using reported experimental setup 120 / 120 / 170 250 / 250 / >300
BALD (Gal et al., 2017) 145 335
Table 1: Number of required data points on MNIST until 90% and 95% accuracy are reached. 25%-, 50%- and 75%-quartiles for the number of required data points when available.

For the second experiment, we follow the setup of Gal et al. (2017) and perform AL on the MNIST dataset using 100 MC dropout samples. We use the same model architecture and initial dataset as described in section 4.1. Due to differences in model architecture, hyper parameters and model retraining, we significantly outperform the original results in Gal et al. (2017) as shown in table 1.

We first look at BALD for increasing acquisition size in figure 3(a). As we increase the acquisition size from the ideal of acquiring points individually and fully retraining after each points (acquisition size 1) to 40, there is a substantial performance drop.

BatchBALD, in figure 3(b), is able to maintain performance when doubling the acquisition size from 5 to 10. Performance drops only slightly at 40, possibly due to estimator noise.

The results for acquisition size 10 for both BALD and BatchBALD are compared in figure 6. BatchBALD outperforms BALD. Indeed, BatchBALD with acquisition size 10 performs close to the ideal with acquisition size 1. The total run time of training these three models until 95% accuracy is visualized in figure 6, where we see that BatchBALD with acquisition size 10 is much faster than BALD with acquisition size 1, and only marginally slower than BALD with acquisition size 10.

4.3 Emnist

In this experiment, we show that BatchBALD also provides a significant improvement when we consider the more difficult EMNIST dataset (Cohen et al., 2017) in the Balanced setup, which consists of 47 classes, comprising letters and digits. The training set consists of 112,800 28x28 images balanced by class, of which the last 18,800 images constitute the validation set. We do not use an initial dataset and instead perform the initial acquisition step with the randomly initialized model. We use 10 MC dropout samples.

We use a similar model architecture as before, but with added capacity. Three blocks of [convolution, dropout, max-pooling, relu], with 32, 64 and 128 3x3 convolution filters, and 2x2 max pooling. These blocks are followed by a two-layer MLP with 512 and 47 hidden units, with again a dropout layer in between. We use dropout probability 0.5 throughout the model.

The results for acquisition size 5 can be seen in figure 8. BatchBALD outperforms both random acquisition and BALD while BALD is unable to beat random acquisition. Figure 8 gives some insight into why BatchBALD performs better than BALD. The entropy of the categorical distribution of acquired class labels is consistently higher, meaning that BatchBALD acquires a more diverse set of data points. In figure 9, the classes on the x-axis are sorted by number of data points that were acquired of that class. We see that BALD undersamples classes while BatchBALD is more consistent.

Figure 7: Performance on EMNIST. BatchBALD consistently outperforms both random acquisition and BALD while BALD is unable to beat random acquisition.
Figure 8: Entropy of acquired class labels over acquisition steps on EMNIST. BatchBALD steadily acquires a more diverse set of data points.
Figure 9: Histogram of acquired class labels on EMNIST. BatchBALD left and BALD right. Classes are sorted by number of acquisitions, and only the lower half is shown for clarity. Several EMNIST classes are underrepresented in BALD while BatchBALD acquires classes more uniformly. The histograms were created from all acquired points at the end of an active learning loop. See appendix F for the full histograms including random acquisition.

5 Related work

AL is closely related to Bayesian Optimisation (BO), which is concerned with finding the global optimum of a function (Snoek et al., 2012), with the fewest number of function evaluations. This is generally done using a Gaussian Process. A common problem in BO is the lack of parallelism, with usually a single worker being responsible for function evaluations. In real-world settings, there are usually many such workers available and making optimal use of them is an open problem (González et al., 2016; Alvi et al., 2019) with some work exploring mutual information for optimising a multi-objective problem (Hernández-Lobato et al., 2016).

In AL of molecular data, the lack of diversity in batches of data points acquired using the BALD objective has been noted by Janz et al. (2017), who propose to resolve it by limiting the number of MC dropout samples and relying on noisy estimates.

A related approach to AL is semi-supervised learning (also sometimes referred to as weakly-supervised), in which the labelled data is commonly assumed to be fixed and the unlabelled data is used for unsupervised learning

(Kingma et al., 2014; Rasmus et al., 2015). Wang et al. (2017); Sener and Savarese (2018) explore combining it with AL.

6 Scope and limitations

Unbalanced datasets BALD and BatchBALD do not work well when the test set is unbalanced as they aim to learn well about all classes and do not follow the density of the dataset. However, if the test set is balanced, but the training set is not, we expect BatchBALD to perform well.

Unlabelled data BatchBALD does not take into account any information from the unlabelled dataset. However, BatchBALD uses the underlying Bayesian model for estimating uncertainty for unlabelled data points, and semi-supervised learning could improve these estimates by providing more information about the underlying structure of the feature space. We leave a semi-supervised extension of BatchBALD to future work.

Noisy estimator A significant amount of noise is introduced by MC-dropout’s variational approximation to training BNNs. Sampling of the joint entropies introduces additional noise. The quality of larger acquisition batches would be improved by reducing this noise.

7 Conclusion

We have introduced a new batch acquisition function, BatchBALD, for Deep Bayesian Active Learning, and a greedy algorithm that selects good candidate batches compared to the intractable optimal solution. Acquisitions show increased diversity of data points and improved performance over BALD and other methods.

While our method comes with additional computational cost during acquisition, BatchBALD is able to significantly reduce the number of data points that need to be labelled and the number of times the model has to be retrained, potentially saving considerable costs and filling an important gap in practical Deep Bayesian Active Learning.

8 Acknowledgements

The authors want to thank Binxin (Robin) Ru for helpful references to submodularity and the appropriate proofs. We would also like to thank the rest of OATML for their feedback at several stages of the project. AK is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems (grant reference EP/L015897/1). JvA is grateful for funding by the EPSRC (grant reference EP/N509711/1) and Google-DeepMind.

8.1 Author contributions

AK derived the original estimator, proved submodularity and bounds, implemented BatchBALD efficiently, and ran the experiments. JvA developed the narrative and experimental design, advised on debugging, structured the paper into its current form, and pushed it forward at difficult times. JvA and AK wrote the paper jointly.


  • Adomavicius and Tuzhilin [2005] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge & Data Engineering, 2005.
  • Alvi et al. [2019] Ahsan S Alvi, Binxin Ru, Jan Calliess, Stephen J Roberts, and Michael A Osborne. Asynchronous batch Bayesian optimisation with improved local penalisation. arXiv preprint arXiv:1901.10452, 2019.
  • Blundell et al. [2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning

    , Proceedings of Machine Learning Research, pages 1613–1622, 2015.

  • Calinon et al. [2007] Sylvain Calinon, Florent Guenter, and Aude Billard. On learning, representing, and generalizing a task in a humanoid robot. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(2):286–298, 2007.
  • Cohen et al. [2017] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2921–2926. IEEE, 2017.
  • Cohn et al. [1996] David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models.

    Journal of artificial intelligence research

    , 4:129–145, 1996.
  • Frankle and Carbin [2019] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
  • Freeman [1965] Linton C Freeman. Elementary applied statistics: for students in behavioral science. John Wiley & Sons, 1965.
  • Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
  • Gal et al. [2017] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1183–1192. JMLR. org, 2017.
  • González et al. [2016] Javier González, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch Bayesian optimization via local penalization. In Artificial Intelligence and Statistics, pages 648–657, 2016.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • Hernández-Lobato et al. [2016] Daniel Hernández-Lobato, Jose Hernandez-Lobato, Amar Shah, and Ryan Adams. Predictive entropy search for multi-objective Bayesian optimization. In International Conference on Machine Learning, pages 1492–1501, 2016.
  • Hoi et al. [2006] Steven CH Hoi, Rong Jin, Jianke Zhu, and Michael R Lyu. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd international conference on Machine learning, pages 417–424. ACM, 2006.
  • Houlsby et al. [2011] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
  • Janz et al. [2017] David Janz, Jos van der Westhuizen, and José Miguel Hernández-Lobato. Actively learning what makes a discrete sequence valid. arXiv preprint arXiv:1708.04465, 2017.
  • Kendall et al. [2015] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680, 2015.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma et al. [2014] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014.
  • Krause et al. [2008] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235–284, 2008.
  • Nemhauser et al. [1978] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294, 1978.
  • Rasmus et al. [2015] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in neural information processing systems, pages 3546–3554, 2015.
  • Sener and Savarese [2018] Ozan Sener and Silvio Savarese.

    Active learning for convolutional neural networks: A core-set approach.

    In International Conference on Learning Representations, 2018.
  • Shen et al. [2018] Yanyao Shen, Hyokun Yun, Zachary C. Lipton, Yakov Kronrod, and Animashree Anandkumar.

    Deep active learning for named entity recognition.

    In International Conference on Learning Representations, 2018.
  • Siddhant and Lipton [2018] Aditya Siddhant and Zachary C Lipton. Deep Bayesian active learning for natural language processing: Results of a large-scale empirical study. arXiv preprint arXiv:1808.05697, 2018.
  • Simonyan and Zisserman [2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
  • Tong [2001] Simon Tong. Active learning: theory and applications, volume 1. Stanford University USA, 2001.
  • Wang et al. [2017] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2017.
  • Yeung [1991] Raymond W Yeung. A new outlook on shannon’s information measures. IEEE transactions on information theory, 37(3):466–474, 1991.

Appendix A Proof of submodularity

Nemhauser et al. [1978] show that if a function is submodular, then a greedy algorithm like algorithm 1 is -approximate. Here, we show that is submodular.

We will show that satisfies the following equivalent definition of submodularity:

Definition A.1.

A function defined on subsets of is called submodular if for every set and two non-identical points :


Submodularity expresses that there are "diminishing returns" for adding additional points to .

Lemma A.2.

is submodular for .


Let . We start by substituting the definition of into (14) and subtracting twice on both sides, using that :


We rewrite the left-hand side using the definition of the mutual information and reorder:


where we have used that entropies are subadditive in general and additive given . ∎

Following Nemhauser et al. [1978], we can conclude that algorithm 1 is -approximate.

Appendix B Connection between BatchBALD and BALD

In the following section, we show that BALD approximates BatchBALD and that BatchBALD approximates BALD with acquisition size 1. The BALD score is an upper bound of the BatchBALD score for any candidate batch. At the same time, BatchBALD can be seen as performing BALD with acquisition size 1 during each step of its greedy algorithm in an idealised setting.

b.1 BALD as an approximation of BatchBALD

Using the subadditivity of information entropy and the independence of the given , we show that BALD is an approximation of BatchBALD and is always an upper bound on the respective BatchBALD score:


b.2 BatchBALD as an approximation of BALD with acquisition size 1

To see why BALD with acquisition size 1 can be seen as an upper bound for BatchBALD performance in an idealised setting, we reformulate line 1 in algorithm 1 on page 1.

Instead of the original term , we can equivalently maximise
as the right term is constant for all within the inner loop, which, in turn, is equivalent to
once we expand . This means that, at each step of the inner loop, our greedy algorithm is maximising the mutual information of the individual available data points with the model parameters conditioned on all the additional data points that have already been picked for acquisition and the existing training set. Finally, assuming training our model captures all available information,

where are the actual labels of . The mutual information decreases as becomes more concentrated as we expand its training set, and thus the overlap of and will become smaller (in an information-measure-theoretical sense).

This shows that every step of the inner loop in our algorithm is at most as good as retraining our model on the new training set and picking using with acquisition size 1.

Relevance for the active training loop. We see that the active training loop as a whole is computing a greedy -approximation of the mutual information of all acquired data points over all acquisitions with the model parameters.

Appendix C Sampling of configurations

We are using the same notation as in section 3.3. We factor to avoid recomputations and rewrite as:


To be flexible in the way we sample , we perform importance sampling of using , and, assuming we also have samples from , we can approximate:


where we store in a matrix of shape and in a matrix of shape and is a matrix of s. Equation (38) allows us to cache inside the inner loop of algorithm 1 and use batch matrix multiplication for efficient computation.

Appendix D Additional results for Repeated MNIST

We show that BatchBALD also outperforms Var Ratios [Freeman, 1965] and Mean STD [Kendall et al., 2015].

Figure 10: Performance of Repeated MNIST. BALD, BatchBALD, Var Ratios, Mean STD and random acquisition: acquisition size 10 with 10 MC dropout samples.

Appendix E Example visualisation of EMNIST

Figure 11: Examples of all 47 classes of EMNIST

Appendix F Entropy and class acquisitions including random acquisition

Figure 12: Performance on EMNIST. BatchBALD consistently outperforms both random acquisition and BALD while BALD is unable to beat random acquisition.
Figure 13: Entropy of acquired class labels over acquisition steps on EMNIST. BatchBALD steadily acquires a more diverse set of data points than BALD.
Figure 14: Histogram of acquired class labels on EMNIST. BatchBALD left and BALD right. Classes are sorted by number of acquisitions. Several EMNIST classes are underrepresented in BALD and random acquisition while BatchBALD acquires classes more uniformly. The histograms were created from all acquired points at the end of an active learning loop