1 Introduction
A key problem in deep learning is data efficiency. While excellent performance can be obtained with modern tools, these are often datahungry, rendering the deployment of deep learning in the realworld challenging for many tasks. Active learning (AL)
(Cohn et al., 1996) is a powerful technique for attaining data efficiency. Instead of apriori collecting and labelling a large dataset, which often comes at a significant expense, in AL we iteratively acquire labels from an expert only for the most informative data points from a pool of available unlabelled data. After each acquisition step, the newly labelled points are added to the training set, and the model is retrained. This process is repeated until a suitable level of accuracy is achieved. The goal of AL is to minimise the amount of data that needs to be labelled. AL has already made realworld impact in manufacturing (Tong, 2001), robotics (Calinon et al., 2007), recommender systems (Adomavicius and Tuzhilin, 2005), medical imaging (Hoi et al., 2006), and NLP (Siddhant and Lipton, 2018), motivating the need for pushing AL even further.In AL, the informativeness of new points is assessed by an acquisition function. There are a number of intuitive choices, such as model uncertainty and mutual information, and, in this paper, we focus on BALD (Houlsby et al., 2011), which has proven itself in the context of deep learning (Gal et al., 2017; Shen et al., 2018; Janz et al., 2017). BALD is based on mutual information and scores points based on how well their label would inform us about the true model parameter distribution. In deep learning models (He et al., 2016; Simonyan and Zisserman, 2015)
, we generally treat the parameters as point estimates instead of distributions. However, Bayesian neural networks have become a powerful alternative to traditional neural networks and do provide a distribution over their parameters. Improvements in approximate inference
(Blundell et al., 2015; Gal and Ghahramani, 2016)have enabled their usage for high dimensional data such as images and in conjunction with BALD
(Gal et al., 2017).In practical AL applications, instead of single data points, batches of data points are acquired during each acquisition step to reduce the number of times the model is retrained and experttime is requested. Model retraining becomes a computational bottleneck for larger models while expert time is expensive: consider, for example, the effort that goes into commissioning a medical specialist to label a single MRI scan, then waiting until the model is retrained, and then commissioning a new medical specialist to label the next MRI scan, and the extra amount of time this takes.
In Gal et al. (2017), batch acquisition, i.e. the acquisition of multiple points, takes the top points with the highest BALD acquisition score. This naive approach leads to acquiring points that are individually very informative, but not necessarily so jointly. See figure 2 for such a batch acquisition of BALD in which it performs poorly whereas scoring points jointly ("BatchBALD") can find batches of informative data points. Figure 2 shows how a dataset consisting of repeated MNIST digits (with added Gaussian noise) leads BALD to perform worse than random acquisition while BatchBALD sustains good performance.
Naively finding the best batch to acquire requires enumerating all possible subsets within the available data, which is intractable as the number of potential subsets grows exponentially with the acquisition size and the size of available points to choose from. Instead, we develop a greedy algorithm that selects a batch in linear time, and show that it is at worst a approximation to the optimal choice for our acquisition function. We provide an opensource implementation^{1}^{1}1https://github.com/BlackHC/BatchBALD.
The main contributions of this work are:

BatchBALD, a dataefficient active learning method that acquires sets of highdimensional image data, leading to improved data efficiency and reduced total run time, section 3.1;

a greedy algorithm to select a batch of points efficiently, section 3.2; and

an estimator for the acquisition function that scales to larger acquisition sizes and to datasets with many classes, section 3.3.
2 Background
2.1 Problem Setting
The Bayesian active learning setup consists of an unlabelled dataset , the current training set , a Bayesian model with model parameters , and output predictions for data point and prediction in the classification case. The conditioning of on expresses that the model has been trained with . Furthermore, an oracle can provide us with the correct label for a data point in the unlabelled pool . The goal is to obtain a certain level of prediction accuracy with the least amount of oracle queries. At each acquisition step, a batch of data points is selected using an acquisition function which scores a candidate batch of unlabelled data points using the current model parameters :
(1) 
2.2 Bald
BALD (Bayesian Active Learning by Disagreement) (Houlsby et al., 2011) uses an acquisition function that estimates the mutual information between the model predictions and the model parameters. Intuitively, it captures how strongly the model predictions for a given data point and the model parameters are coupled, implying that finding out about the true label of data points with high mutual information would also inform us about the true model parameters. Originally introduced outside the context of deep learning, the only requirement on the model is that it is Bayesian. BALD is defined as:
(2) 
Looking at the two terms in equation (2), for the mutual information to be high, the left term has to be high and the right term low. The left term is the entropy of the model prediction, which is high when the model’s prediction is uncertain. The right term is an expectation of the entropy of the model prediction over the posterior of the model parameters and is low when the model is overall certain for each draw of model parameters from the posterior. Both can only happen when the model has many possible ways to explain the data, which means that the posterior draws are disagreeing among themselves.
BALD was originally intended for acquiring individual data points and immediately retraining the model. This becomes a bottleneck in deep learning, where retraining takes a substantial amount of time. Applications of BALD (Gal and Ghahramani, 2016; Janz et al., 2017) usually acquire the top . This can be expressed as summing over individual scores:
(3) 
and finding the optimal batch for this acquisition function using a greedy algorithm, which reduces to picking the top highestscoring data points.
2.3 Bayesian Neural Networks (BNN)
In this paper we focus on BNNs as our Bayesian model because they scale well to high dimensional inputs, such as images. Compared to regular neural networks, BNNs maintain a distribution over their weights instead of point estimates. Performing exact inference in BNNs is intractable for any reasonably sized model, so we resort to using a variational approximation. Similar to Gal et al. (2017), we use MC dropout (Gal and Ghahramani, 2016), which is easy to implement, scales well to large models and datasets, and is straightforward to optimise.
3 Methods
3.1 BatchBALD
We propose BatchBALD as an extension of BALD whereby we jointly score points by estimating the mutual information between a joint of multiple data points and the model parameters:^{2}^{2}2 We use the notation to denote the mutual information between the joint of the random variables
and the random variable
conditioned on .(4) 
This builds on the insight that independent selection of a batch of data points leads to data inefficiency as correlations between data points in an acquisition batch are not taken into account.
To understand how to compute the mutual information between a set of points and the model parameters, we express , and through joint random variables and
in a product probability space and use the definition of the mutual information for two random variables:
(5) 
Intuitively, the mutual information between two random variables can be seen as the intersection of their information content. In fact, Yeung (1991) shows that a signed measure
can be defined for discrete random variables
, , such that , , , and so on, where we identify random variables with their counterparts in information space, and conveniently drop conditioning on and .Using this, BALD can be viewed as the sum of individual intersections , which double counts overlaps between the . Naively extending BALD to the mutual information between and , which is equivalent to , would lead to selecting similar data points instead of diverse ones under maximisation.
BatchBALD, on the other hand, takes overlaps into account by computing and is more likely to acquire a more diverse cover under maximisation:
(6)  
(7) 
This is depicted in figure 3 and also motivates that , which we prove in appendix B.1. For acquisition size 1, BatchBALD and BALD are equivalent.


3.2 Greedy approximation algorithm for BatchBALD
To avoid the combinatorial explosion that arises from jointly scoring subsets of points, we introduce a greedy approximation for computing BatchBALD, depicted in algorithm 1. In appendix A, we prove that is submodular, which means the greedy algorithm is approximate (Nemhauser et al., 1978; Krause et al., 2008).
In appendix B.2, we show that, under idealised conditions, when using BatchBALD and a fixed final , the active learning loop itself can be seen as a greedy approximation algorithm, and that an active learning loop with BatchBALD and acquisition size larger than 1 is bounded by an an active learning loop with individual acquisitions, that is BALD/BatchBALD with acquisition size 1, which is the ideal case.
3.3 Computing
For brevity, we leave out conditioning on , and , and denotes in this section. is then written as:
(8) 
Because the are independent when conditioned on , computing the right term of equation (8) is simplified as the conditional joint entropy decomposes into a sum. We can approximate the expectation using a MonteCarlo estimator with samples from our model parameter distribution :
(9) 
Computing the left term of equation (8) is difficult because the unconditioned joint probability does not factorise. Applying the equality , and, using sampled , we compute the entropy by summing over all possible configurations of :
(10)  
(11)  
(12) 
3.4 Efficient implementation
In each iteration of the algorithm, stay fixed while varies over . We can reduce the required computations by factorizing into . We store in a matrix of shape and in a matrix of shape . The sum in (12) can be then be turned into a matrix product:
(13) 
This can be further sped up by using batch matrix multiplication to compute the joint entropy for different . only has to be computed once, and we can recursively compute using and , which allows us to sample for each only once at the beginning of the algorithm.
For larger acquisition sizes, we use MC samples of as enumerating all possible configurations becomes infeasible. See appendix C for details.
MonteCarlo sampling bounds the time complexity of the full BatchBALD algorithm to compared to for naively finding the exact optimal batch and for BALD^{3}^{3}3 is the acquisition size, is the number of classes, is the number of MC dropout samples, and is the number of sampled configurations of . .
4 Experiments
In our experiments, we start by showing how a naive application of the BALD algorithm to an image dataset can lead to poor results in a dataset with many (near) duplicate data points, and show that BatchBALD solves this problem in a grounded way while obtaining favourable results (figure 2).
We then illustrate BatchBALD’s effectiveness on standard AL datasets: MNIST and EMNIST. EMNIST (Cohen et al., 2017) is an extension of MNIST that also includes letters, for a total of 47 classes, and has a twice as large training set. See appendix E for examples of the dataset. We show that BatchBALD provides a substantial performance improvement in these scenarios, too, and has more diverse acquisitions.
In our experiments, we repeatedly go through active learning loops. One active learning loop consists of training the model on the available labelled data and subsequently acquiring new data points using a chosen acquisition function. As the labelled dataset is small in the beginning, it is important to avoid overfitting. We do this by using early stopping after 3 epochs of declining accuracy on the validation set. We pick the model with the highest validation accuracy. Throughout our experiments, we use the Adam
(Kingma and Ba, 2014)optimiser with learning rate 0.001 and betas 0.9/0.999. All our results report the median of 6 trials, with lower and upper quartiles. We use these quartiles to draw the filled error bars on our figures.
We reinitialize the model after each acquisition, similar to Gal et al. (2017). We found this helps the model improve even when very small batches are acquired. It also decorrelates subsequent acquisitions as final model performance is dependent on a particular initialization (Frankle and Carbin, 2019).
When computing , it is important to keep the dropout masks in MC dropout consistent while sampling from the model. This is necessary to capture dependencies between the inputs for BatchBALD, and it makes the scores for different points more comparable by removing this source of noise. We do not keep the masks fixed when computing BALD scores because its performance usually benefits from the added noise. We also do not need to keep these masks fixed for training and evaluating the model.
In all our experiments, we either compute joint entropies exactly by enumerating all configurations, or we estimate them using 10,000 MC samples, picking whichever method is faster. In practice, we compute joint entropies exactly for roughly the first 4 data points in an acquisition batch and use MC sampling thereafter.
4.1 Repeated MNIST
As demonstrated in the introduction, naively applying BALD to a dataset that contains many (near) replicated data points leads to poor performance. We show how this manifests in practice by taking the MNIST dataset and replicating each data point in the training set three times. After normalising the dataset, we add isotropic Gaussian noise with a standard deviation of 0.1 to simulate slight differences between the duplicated data points in the training set. All results are obtained using an acquisition size of 10 and 10 MC dropout samples. The initial dataset was constructed by taking a balanced set of 20 data points
^{4}^{4}4These initial data points were chosen by running BALD 6 times with the initial dataset picked randomly and choosing the set of the median model. They were subsequently held fixed., two of each class (similar to (Gal et al., 2017)).Our model consists of two blocks of [convolution, dropout, maxpooling, relu], with 32 and 64 5x5 convolution filters. These blocks are followed by a twolayer MLP that includes dropout between the layers and has 128 and 10 hidden units. The dropout probability is 0.5 in all three locations. This architecture achieves 99% accuracy with 10 MC dropout samples during test time on the full MNIST dataset.
The results can be seen in figure 2. In this illustrative scenario, BALD performs poorly, and even randomly acquiring points performs better. However, BatchBALD is able to cope with the replication perfectly. In appendix D, we also compare with Variation Ratios (Freeman, 1965), and Mean STD (Kendall et al., 2015) which perform on par with random acquisition.
4.2 Mnist
90% accuracy  95% accuracy  
BatchBALD  70 / 90 / 110  190 / 200 / 230 
BALD ^{5}^{5}5reimplementation using reported experimental setup  120 / 120 / 170  250 / 250 / >300 
BALD (Gal et al., 2017)  145  335 
For the second experiment, we follow the setup of Gal et al. (2017) and perform AL on the MNIST dataset using 100 MC dropout samples. We use the same model architecture and initial dataset as described in section 4.1. Due to differences in model architecture, hyper parameters and model retraining, we significantly outperform the original results in Gal et al. (2017) as shown in table 1.
We first look at BALD for increasing acquisition size in figure 3(a). As we increase the acquisition size from the ideal of acquiring points individually and fully retraining after each points (acquisition size 1) to 40, there is a substantial performance drop.
BatchBALD, in figure 3(b), is able to maintain performance when doubling the acquisition size from 5 to 10. Performance drops only slightly at 40, possibly due to estimator noise.
The results for acquisition size 10 for both BALD and BatchBALD are compared in figure 6. BatchBALD outperforms BALD. Indeed, BatchBALD with acquisition size 10 performs close to the ideal with acquisition size 1. The total run time of training these three models until 95% accuracy is visualized in figure 6, where we see that BatchBALD with acquisition size 10 is much faster than BALD with acquisition size 1, and only marginally slower than BALD with acquisition size 10.
4.3 Emnist
In this experiment, we show that BatchBALD also provides a significant improvement when we consider the more difficult EMNIST dataset (Cohen et al., 2017) in the Balanced setup, which consists of 47 classes, comprising letters and digits. The training set consists of 112,800 28x28 images balanced by class, of which the last 18,800 images constitute the validation set. We do not use an initial dataset and instead perform the initial acquisition step with the randomly initialized model. We use 10 MC dropout samples.
We use a similar model architecture as before, but with added capacity. Three blocks of [convolution, dropout, maxpooling, relu], with 32, 64 and 128 3x3 convolution filters, and 2x2 max pooling. These blocks are followed by a twolayer MLP with 512 and 47 hidden units, with again a dropout layer in between. We use dropout probability 0.5 throughout the model.
The results for acquisition size 5 can be seen in figure 8. BatchBALD outperforms both random acquisition and BALD while BALD is unable to beat random acquisition. Figure 8 gives some insight into why BatchBALD performs better than BALD. The entropy of the categorical distribution of acquired class labels is consistently higher, meaning that BatchBALD acquires a more diverse set of data points. In figure 9, the classes on the xaxis are sorted by number of data points that were acquired of that class. We see that BALD undersamples classes while BatchBALD is more consistent.
5 Related work
AL is closely related to Bayesian Optimisation (BO), which is concerned with finding the global optimum of a function (Snoek et al., 2012), with the fewest number of function evaluations. This is generally done using a Gaussian Process. A common problem in BO is the lack of parallelism, with usually a single worker being responsible for function evaluations. In realworld settings, there are usually many such workers available and making optimal use of them is an open problem (González et al., 2016; Alvi et al., 2019) with some work exploring mutual information for optimising a multiobjective problem (HernándezLobato et al., 2016).
In AL of molecular data, the lack of diversity in batches of data points acquired using the BALD objective has been noted by Janz et al. (2017), who propose to resolve it by limiting the number of MC dropout samples and relying on noisy estimates.
A related approach to AL is semisupervised learning (also sometimes referred to as weaklysupervised), in which the labelled data is commonly assumed to be fixed and the unlabelled data is used for unsupervised learning
(Kingma et al., 2014; Rasmus et al., 2015). Wang et al. (2017); Sener and Savarese (2018) explore combining it with AL.6 Scope and limitations
Unbalanced datasets BALD and BatchBALD do not work well when the test set is unbalanced as they aim to learn well about all classes and do not follow the density of the dataset. However, if the test set is balanced, but the training set is not, we expect BatchBALD to perform well.
Unlabelled data BatchBALD does not take into account any information from the unlabelled dataset. However, BatchBALD uses the underlying Bayesian model for estimating uncertainty for unlabelled data points, and semisupervised learning could improve these estimates by providing more information about the underlying structure of the feature space. We leave a semisupervised extension of BatchBALD to future work.
Noisy estimator A significant amount of noise is introduced by MCdropout’s variational approximation to training BNNs. Sampling of the joint entropies introduces additional noise. The quality of larger acquisition batches would be improved by reducing this noise.
7 Conclusion
We have introduced a new batch acquisition function, BatchBALD, for Deep Bayesian Active Learning, and a greedy algorithm that selects good candidate batches compared to the intractable optimal solution. Acquisitions show increased diversity of data points and improved performance over BALD and other methods.
While our method comes with additional computational cost during acquisition, BatchBALD is able to significantly reduce the number of data points that need to be labelled and the number of times the model has to be retrained, potentially saving considerable costs and filling an important gap in practical Deep Bayesian Active Learning.
8 Acknowledgements
The authors want to thank Binxin (Robin) Ru for helpful references to submodularity and the appropriate proofs. We would also like to thank the rest of OATML for their feedback at several stages of the project. AK is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems (grant reference EP/L015897/1). JvA is grateful for funding by the EPSRC (grant reference EP/N509711/1) and GoogleDeepMind.
8.1 Author contributions
AK derived the original estimator, proved submodularity and bounds, implemented BatchBALD efficiently, and ran the experiments. JvA developed the narrative and experimental design, advised on debugging, structured the paper into its current form, and pushed it forward at difficult times. JvA and AK wrote the paper jointly.
References
 Adomavicius and Tuzhilin [2005] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A survey of the stateoftheart and possible extensions. IEEE Transactions on Knowledge & Data Engineering, 2005.
 Alvi et al. [2019] Ahsan S Alvi, Binxin Ru, Jan Calliess, Stephen J Roberts, and Michael A Osborne. Asynchronous batch Bayesian optimisation with improved local penalisation. arXiv preprint arXiv:1901.10452, 2019.

Blundell et al. [2015]
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra.
Weight uncertainty in neural network.
In Proceedings of the 32nd International Conference on Machine
Learning
, Proceedings of Machine Learning Research, pages 1613–1622, 2015.
 Calinon et al. [2007] Sylvain Calinon, Florent Guenter, and Aude Billard. On learning, representing, and generalizing a task in a humanoid robot. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(2):286–298, 2007.
 Cohen et al. [2017] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2921–2926. IEEE, 2017.

Cohn et al. [1996]
David A Cohn, Zoubin Ghahramani, and Michael I Jordan.
Active learning with statistical models.
Journal of artificial intelligence research
, 4:129–145, 1996.  Frankle and Carbin [2019] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
 Freeman [1965] Linton C Freeman. Elementary applied statistics: for students in behavioral science. John Wiley & Sons, 1965.
 Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
 Gal et al. [2017] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1183–1192. JMLR. org, 2017.
 González et al. [2016] Javier González, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch Bayesian optimization via local penalization. In Artificial Intelligence and Statistics, pages 648–657, 2016.

He et al. [2016]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  HernándezLobato et al. [2016] Daniel HernándezLobato, Jose HernandezLobato, Amar Shah, and Ryan Adams. Predictive entropy search for multiobjective Bayesian optimization. In International Conference on Machine Learning, pages 1492–1501, 2016.
 Hoi et al. [2006] Steven CH Hoi, Rong Jin, Jianke Zhu, and Michael R Lyu. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd international conference on Machine learning, pages 417–424. ACM, 2006.
 Houlsby et al. [2011] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
 Janz et al. [2017] David Janz, Jos van der Westhuizen, and José Miguel HernándezLobato. Actively learning what makes a discrete sequence valid. arXiv preprint arXiv:1708.04465, 2017.
 Kendall et al. [2015] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. arXiv preprint arXiv:1511.02680, 2015.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma et al. [2014] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014.
 Krause et al. [2008] Andreas Krause, Ajit Singh, and Carlos Guestrin. Nearoptimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235–284, 2008.
 Nemhauser et al. [1978] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294, 1978.
 Rasmus et al. [2015] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semisupervised learning with ladder networks. In Advances in neural information processing systems, pages 3546–3554, 2015.

Sener and Savarese [2018]
Ozan Sener and Silvio Savarese.
Active learning for convolutional neural networks: A coreset approach.
In International Conference on Learning Representations, 2018. 
Shen et al. [2018]
Yanyao Shen, Hyokun Yun, Zachary C. Lipton, Yakov Kronrod, and Animashree
Anandkumar.
Deep active learning for named entity recognition.
In International Conference on Learning Representations, 2018.  Siddhant and Lipton [2018] Aditya Siddhant and Zachary C Lipton. Deep Bayesian active learning for natural language processing: Results of a largescale empirical study. arXiv preprint arXiv:1808.05697, 2018.
 Simonyan and Zisserman [2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, 2015.
 Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 Tong [2001] Simon Tong. Active learning: theory and applications, volume 1. Stanford University USA, 2001.
 Wang et al. [2017] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Costeffective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2017.
 Yeung [1991] Raymond W Yeung. A new outlook on shannon’s information measures. IEEE transactions on information theory, 37(3):466–474, 1991.
Appendix A Proof of submodularity
Nemhauser et al. [1978] show that if a function is submodular, then a greedy algorithm like algorithm 1 is approximate. Here, we show that is submodular.
We will show that satisfies the following equivalent definition of submodularity:
Definition A.1.
A function defined on subsets of is called submodular if for every set and two nonidentical points :
(14) 
Submodularity expresses that there are "diminishing returns" for adding additional points to .
Lemma A.2.
is submodular for .
Proof.
Let . We start by substituting the definition of into (14) and subtracting twice on both sides, using that :
(15)  
(16) 
We rewrite the lefthand side using the definition of the mutual information and reorder:
(17)  
(18)  
(19)  
(20) 
where we have used that entropies are subadditive in general and additive given . ∎
Appendix B Connection between BatchBALD and BALD
In the following section, we show that BALD approximates BatchBALD and that BatchBALD approximates BALD with acquisition size 1. The BALD score is an upper bound of the BatchBALD score for any candidate batch. At the same time, BatchBALD can be seen as performing BALD with acquisition size 1 during each step of its greedy algorithm in an idealised setting.
b.1 BALD as an approximation of BatchBALD
Using the subadditivity of information entropy and the independence of the given , we show that BALD is an approximation of BatchBALD and is always an upper bound on the respective BatchBALD score:
(21)  
(22)  
(23)  
(24) 
b.2 BatchBALD as an approximation of BALD with acquisition size 1
To see why BALD with acquisition size 1 can be seen as an upper bound for BatchBALD performance in an idealised setting, we reformulate line 1 in algorithm 1 on page 1.
Instead of the original term , we can equivalently maximise  
(26)  
as the right term is constant for all within the inner loop, which, in turn, is equivalent to  
(28)  
(29)  
once we expand . This means that, at each step of the inner loop, our greedy algorithm is maximising the mutual information of the individual available data points with the model parameters conditioned on all the additional data points that have already been picked for acquisition and the existing training set. Finally, assuming training our model captures all available information,  
(30)  
(31) 
where are the actual labels of . The mutual information decreases as becomes more concentrated as we expand its training set, and thus the overlap of and will become smaller (in an informationmeasuretheoretical sense).
This shows that every step of the inner loop in our algorithm is at most as good as retraining our model on the new training set and picking using with acquisition size 1.
Relevance for the active training loop. We see that the active training loop as a whole is computing a greedy approximation of the mutual information of all acquired data points over all acquisitions with the model parameters.
Appendix C Sampling of configurations
We are using the same notation as in section 3.3. We factor to avoid recomputations and rewrite as:
(32)  
(33)  
(34) 
To be flexible in the way we sample , we perform importance sampling of using , and, assuming we also have samples from , we can approximate:
(35)  
(36)  
(37)  
(38) 
where we store in a matrix of shape and in a matrix of shape and is a matrix of s. Equation (38) allows us to cache inside the inner loop of algorithm 1 and use batch matrix multiplication for efficient computation.