## 1 Introduction

Machine learning models are widely used for a vast array of real world problems. They have been applied successfully in a variety of areas including biology (ching2018opportunities), chemistry (sanchez2018inverse), physics (guest2018deep), and materials engineering (aspuru2018materials). Key to the success of modern machine learning methods is access to high quality data for training the model. However such data can be expensive to collect for many problems. Active learning (settles2009active) is a popular methodology to intelligently select the fewest new data points to be labeled while not sacrificing model accuracy. The usual active learning setting is pool-based active learning where one has access to a large unlabeled dataset and uses active learning to iteratively select new points from to label. Our goal in this paper is to develop an active learning acquisition function to select points that maximize the eventual test accuracy which is also one of the most popular criteria used to evaluate an active learning acquisition function.

In active learning, an acquisition function is used to select which new points to label. A large number of acquisition functions have been developed over the years, mostly for classification (settles2009active)

. Acquisition functions use model predictions or point locations (in input feature or learned representation space) to decide which points would be most helpful to label to improve model accuracy. Acquisition functions then query for the labels of those points and add them to the training set. A natural choice of acquisition function is to acquire labels for points with the highest uncertainty or points closest to the decision boundary. Taking a Bayesian point of view, several acquisition functions select points which give the most amount of knowledge regarding a model’s parameters where knowledge is defined as the statistical dependency between the parameters of the model and the predictions for the selected points. Mutual information (MI) is the usual choice for the dependency though other choices are possible. While the focus for such functions has been the acquisition of one point at a time, as each round of label acquisition and retraining of the ML model, particularly in the case of deep neural networks, can be expensive. There have been several papers in the past few years that acquire points in

batch. To ensure that a batch is diverse, the methods either measure the MI for entire batch together with respect to the model’s parameters or explicitly encourage batch diversity in the input or learned representation space.Another intuitive strategy is to select points that we expect would provide substantial information about the labels of the rest of the unlabeled set, thus reducing model uncertainty. We show that the popular strategy of acquiring labels for points that maximize the mutual information with respect to the model parameters does not always minimize the uncertainty of the model’s predictions averaged over the still unlabeled points post acquisition. This suboptimal uncertainty can negatively affect test accuracy. Motivated by this observation, we propose acquiring a batch of points such that the model’s predictions on have as high a statistical dependency as possible with the model’s predictions on the entire unlabeled set . Thus we want a batch that condenses the most amount of information about the model’s predictions on

. We call our method Information Condensing Active Learning (ICAL). Naively searching over all possible batches to find the optimal one would take an exponential amount of time. We develop a greedy algorithm to select the batch of points efficiently. Similar greedy approaches have also been explored in the context of feature selection

(da2015global; blanchet2008forward).A key desideratum for our acquisition function is to be model agnostic. This is partly because the model distribution can be very heterogeneous. For example, ensembles which are often used as a model distribution can consist of just decision trees in a random forest or different architectures for a neural network. This means we cannot assume any closed form for the model’s predictive distribution, and have to resort to Monte Carlo sampling of the predictions from the model to estimate the dependency between the model’s predictions on the query batch and the unlabeled set. Mutual information, however, is known to be hard to approximate using just samples

(song2019understanding). Thus to scale the method to larger batch sizes, we use the Hilbert-Schmidt Independence Criterion (HSIC), one of the most powerful extant statistical dependency measures for high dimensional settings. Another advantage of HSIC is that it is differentiable, which as we will discuss later in the text, can allow applications of the acquisition function to areas where MI would be difficult to make work.To summarize, we introduce Information Condensing Active Learning (ICAL) which maximizes the amount of information being gained with respect to the model’s predictions on the unlabeled set of points. ICAL is a batch mode acquisition function that is model agnostic and can be applied to both classification and regression tasks. We then develop an algorithm that can scale ICAL to large batch sizes when using HSIC as the dependency measure between random variables.

## 2 Related work

A review of work on acquisition functions for active learning prior to the recent focus on deep learning is given by settles2009active. The BALD (Bayesian Active Learning by Disagreement) (houlsby2011bayesian) acquisition function chooses a query point which has the highest mutual information about the model parameters. This turns out to be the point on which individual models sampled from the model distribution are confident about in their prediction but the overall predictive distribution for that point has high entropy. In other words this is the point on which the models are individually confident but disagree on the most.

In guo2008discriminative which builds on guo2007optimistic, they formulate the problem as an integer program where they select a batch such that the post acquisition model is highly confident on the training set and has low uncertainty on the unlabeled set. While the latter aspect is related to what we do, they need to retrain their model for every candidate batch they search over in the course of trying to find the optimal batch. As the total number of possible batches is exponential in the size of the unlabeled set, this can get too computationally expensive for neural networks limiting the applicability of this approach. Thus as far as we know, guo2008discriminative

has only been applied to logistic regression. BMDR

(wang2015querying)queries points that are as close to the classifier decision boundary as possible while still being representative of the overall sample distribution. The representativeness is measured using the maximum mean discrepancy (MMD)

(gretton2012kernel) of the input features between the query batch and the set of all points with a lower MMD indicating a more representative query batch. However this approach is limited to classification problems as it needs a decision boundary to exist. BMAL (hoi2006batch) selects a batch such that the Fisher information matrices for the total unlabeled set and the selected batch are as close as possible. The Fisher information matrix is however quadratic in the number of parameters and thus infeasible to compute for modern deep neural networks. FASS (Filtered Active Subset Selection) (wei2015submodularity) picks the most uncertain points and then selects a subset of those points that are as similar as possible to the whole candidate batch which favors points that can represent the diversity of the initial set of most uncertain points.Recently active learning methods have been extended to the deep learning setting. gal2017deep adapts BALD (houlsby2011bayesian) to the deep learning setting by using Monte Carlo Dropout (gal2016dropout) to do inference for their Bayesian Neural Network. They extend BALD to the batch setting for neural networks with BatchBALD (kirsch2019batchbald). In pinsler2019bayesian, they adapt the Bayesian Coreset (campbell2018bayesian) approach for active learning, though their approach requires a batch size that changes for every acquisition. As the neural network decision boundary is intractable, DeepFool (ducoffe2018adversarial) uses the concept of adversarial examples (goodfellow2014explaining) to find points close to the decision boundary. However this approach is again limited to classification tasks. In sener2017active, they frame the problem as a core-set selection problem. They try and select points that -cover the entire dataset. They formulate the problem as a set covering problem and solve it using an integer program. FF-Comp (geifman2017deep) also frames the problem as a coreset problem. It builds a batch by selecting a point which is farthest away from closest point to it in the set of points already in the batch. DAL (gissin2019discriminative) trains a classifier after every acquisition to try and distinguish between the labeled and unlabeled set of examples. It then selects the points the classifier is most confident about being unlabeled based on the idea that those are the points that are least like the training points and thus labeling them should be informative. Finally BADGE (ash2019deep) samples points which are high magnitude and diverse in a hallucinated gradient space with respect to the last layer of a neural network. All of FF-Comp, DAL, sener2017active, and BADGE however operate on the learned representation space, as that is the only way the methods incorporate feedback from the training labels into the active learning acquisition function, and they are thus not model-agnostic, as they are not extendable to any model distribution where it is difficult to have a notion of a common representation space (as in a random forest or ensembles with hetereogenous architectures, etc.).

There is also extensive prior work on exploiting Gaussian Processes (GPs) for Active Learning (houlsby2011bayesian; krause2008near). However GPs are hard to scale especially for modern image datasets.

## 3 Background

We first define the statistical quantities we rely upon and then describe the acquisition functions we will be using for comparison as well as the baseline.

### 3.1 Statistical background

The entropy of a distribution is defined as , where

is the probability of the

. Mutual information (MI) between two random variables is defined as , where is the joint probability of . Note that .A divergence between two distributions is a measure of the discrepancy or difference between two distributions . A key property of a divergence is that it is 0 if and only if are the same distribution. In this paper, we will be using the KL divergence and the MMD, which are respectively defined as

where is a kernel in the Reproducing Kernel Hilbert Space (RKHS) and is the mean embedding of the distribution into as per the kernel . We can then use the notion of divergence to define the dependency between a set of random variables as follows

where

is the joint distribution of

, the marginal of with being the product of marginals. For the dependency is exactly MI as defined above. For the dependency is the Hilbert-Schmidt Independence Criterion () which we discuss next.### 3.2 Hilbert-Schmidt Independence Criterion (HSIC)

Suppose we have two (possibly multivariate) distributions and we want to measure the dependence between them. A well known way to measure it is using distance covariance which intuitively, measures the covariance between the distances of pairs of samples from the joint distribution and the product of marginal distributions (szekely2007measuring). HSIC can simply be thought of as distance covariance except in a kernel space (sejdinovic2013equivalence).

More formally, if are drawn from the joint distribution , then their is defined as –

where and are independent pairs drawn from . Note that if and only if , that is, if are independent, for chracteristic kernels and .

For the case where we are measuring the joint dependence between variables, we can use the statistic (sejdinovic2013kernel; pfister2018kernel). The computational complexity of is bounded by the time taken to compute the kernel matrix which is where is the number of samples and the number of random variables. We use to denoate the empirical estimator of the statistic.

### 3.3 Baseline Acquisition Functions

We give a brief overview of the acquisition functions we compare our method against in this paper. For simplicity we only consider the classification context though the extension to a regression context is straightforward. Let the batch to acquire be denoted by with , and be the number of Monte Carlo dropout samples. Given a model distribution , training data , unlabeled data , input space , set of labels and an acquisition function , we decide which point to query next via:

#### Max entropy

selects the points that maximize the predictive entropy

#### BatchBALD

BatchBALD (kirsch2019batchbald) tries to find a batch of points that has the highest mutual information with respect to the model parameters. BALD is the non-batched version of BatchBALD. Formally

#### Filtered active submodular selection (FASS)

FASS (wei2015submodularity) samples the most uncertain points and then subselect points that are as representative of as possible. For the measure of uncertainty, FASS uses entropy . To measure the representativeness of to , FASS tries to choose to maximize the following function

Here is the set of points in with predicted label, and is the similarity function between points indexed by where and is the maximum distance between two points. The idea here is that if a point in already exists that is close to some point , then will favor adding points to the batch that are close to points other than , thus increasing the batch diversity. Note that FASS is equivalent to Max Entropy if .

#### Bayesian Coresets

In pinsler2019bayesian, they try to build a batch such that the log posterior after acquiring that batch best approximates the complete data log posterior (i.e. the log posterior after acquiring the entire pool set). Their approach closely follows the general Bayesian Coreset (campbell2018bayesian) approach which constructs a weighted subset of data that approximates the full dataset. Crucially (pinsler2019bayesian)

assume that the posterior predictive distribution

of a point is independent of that of the corresponding distribution of another point – an assumption we do not make. We show in the next section why avoiding such an assumption lets us more effectively minimize the error with respect to the test distribution versus just optimizing for maxmizing information gain for the model posterior. As (pinsler2019bayesian) require a variable batch size whereas all other methods (including ours) use a fixed batch size, for fairness of comparison, if the batch for this approach is smaller than the batch size being used, we fill the rest of the batch with random points. In practice, we only observe this being necessary for CIFAR.#### Random

The points are selected uniformly at random from the unlabeled pool. Thus

is the uniform distribution.

## 4 Motivation

As mentioned in the Introduction, the intuition behind our method is that we want to acquire points that will give us as much information as possible about the still unlabeled points, thereby increasing the confidence of the model’s predictions. As the example below shows, modern active learning methods that pick the point with the most amount of information with respect to the model parameters could in fact end up increasing the average uncertainty of prediction on the still unlabeled data. More formally, if is the pmf or pdf for the prediction on point , then as the example below shows, the optimal choice of for acquisition may not be optimal for decreasing post acquisition. This can pose a problem if our metric is test set accuracy. If the model is well calibrated, then we should expect worse average entropy (uncertainty) to lead to an increase in the number of errors.

#### Example 1

Suppose we have a model distribution with 10 possible models

with equal prior probability of being the true model (

). Let the datapoints be with their labels taking 4 possible values. We define as the probability of the th class for the datapoint given by the th model. Let1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | |

1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 |

Given that we have no other information about the models, we update the posterior probabilities for the models as follows – if a model

outputs label for a point but after acquisition, the label for is not , then we know that is not the correct model and thus its posterior probability is 0 (so it is eliminated). Otherwise we have no way of distinguishing between the remaining models so they all have equal posterior probability. Then for the mutual information isFor , . However selecting would decrease the expected posterior entropy from to only . Acquiring any of instead of , however, would decrease that entropy to 0, which would cause a much larger decrease in the expected posterior entropy averaged over if is large enough. The detailed calculations are in the Appendix.

While may not contribute much to the entropy of the joint predictive distribution or to the MI with respect to the model parameters compared to , collectively they will be weighted times more than when looking at the accuracy. We should thus expect a well-calibrated model to have a higher uncertainty, and thus make a lot more errors on , if is acquired versus if any of are acquired. For instance, in the above example, as increases, the expected error rate would approach (0.7 as 0.3 of the times the value of would also fix what the true model is reducing error rate on all to 0) if is acquired as the errors for are correlated, whereas the rate would approach 0 were any of to be acquired.

This motivates our choice of acquisition function as one that selects the set of points whose acquisition would maximize the information gained about predictive distribution on the unlabeled set. In Figure 1, we show the average posterior entropy of the model’s predictions for our method compared to BatchBALD, BayesCoreset, and Random acquisition. As can be seen from the figure, we are able to reduce the average posterior entropy much more effectively compared to the other two. Details of this experiment are in Section 6.2.

## 5 Information Condensing Active Learning (ICAL)

In this section we present our acquisition function. As before, let be the training points, the unlabeled points, the random variable denoting the prediction for by the model trained on , and the dependency measure being used. Then

that is, we try to find the batch that has highest average dependency with respect to the unlabeled points’ marginal predictive distribution.

### 5.1 Scaling estimation

As we mentioned in the introduction, we can use MI as the dependency measure but it is tricky to estimate MI using just samples from the distribution, particularly high-dimensional or continuous variables. Furthermore, MI estimators are usually not differentiable. Thus if we wanted to apply ICAL to domains where the pool set is continuous and infinite (for example, if we wanted to query gene expression perturbations for a cell), we would run into obstacles. This motivates our choice of as the dependency measure. In addition to being differentiable,

has better empirical sample complexity for measuring dependency as opposed to estimators for MI. Indeed, popular MI estimators have been found to have variance with respect to ground truth MI that increases exponentially with the MI value

(song2019understanding). has also been successfully used in the related context of feature selection via dependency maximization in the past (da2015global; song2012feature). Furthermore, is the Maximum Mean Discrepancy (MMD) between the joint distribution and the production of marginals. MMD is known to be KL-divergence (ramdas2015decreasing) and thus MI. Thus we use as the dependency measure for the rest of the paper.Naively implementing would require steps per candidate batch being evaluated where is the number of samples taken from ( to estimate , which we need to do times).

However, recall from Section 3.3 that is a function of solely the kernel matrices corresponding to the random variables (in this case ). Now one can define the kernel . We can then prove the following theorems (proofs are in the Appendix).

###### Theorem 1.

is a valid kernel.

###### Theorem 2.

where . Using this reformulation, we only have to compute once per acquisition round. This lowers the computation cost to . Estimating would still require to increase very rapidly with (proportional to the dimension of the joint distribution). To get around this but still maintain batch diversity, we try two strategies.

For normal ICAL, we average the kernel matrices of points in the candidate batch. We then subsample points from every time a point is added to the batch and only compare the dependency with those. Thus even if two points are highly correlated, and one of them is added to the batch, the other would not necessarily get a similarly high statistic, and other points would get prioritized. We find in practice, that this is sufficient to acquire a diverse batch, as evidenced by Figure 4. This seems to be the case even for very large batches, and has the added benefit of further lowering the computational cost for evaluating a candidate batch to . We use for all our experiments.

We develop another strategy we call ICAL-pointwise which computes the marginal increase in dependence as a result of adding a point to the batch. If a point is highly correlated with elements of the current batch, the marginal increase would be negligible, making the point much less likely to be selected. Figure 2 is a representative figure of the relative performance of ICAL and ICAL-pointwise. The two variants perform very similarly despite ICAL-pointwise’s slight advantage in the early acquisitions. ICAL-pointwise however requires much less time for equivalent performance which we discuss briefly in Section 5.2 and more fully in the Appendix. For ease of presentation, we use ICAL in the Results section and defer the full description and evaluation of ICAL-pointwise to the Appendix.

As there are an exponential number of candidate batches, an exhaustive search to find the optimal batch is infeasible. For ICAL we use a greedy forward selection strategy to build the batch and find that it performs well empirically. As the over all of has to be computed every time a new point is being selected for the batch, and we have to perform this operation for each point that is added to the batch, this gives a computation cost of .

It is possible that global nonlinear optimization of the batch ICAL criterion would work even better than greedy optimization already does with respect to state of the art methods. Efficient techniques for doing this optimization are not obvious and beyond the scope of this work. We note however that greedy forward selection is a popular technique that has been successfully used in a large variety of contexts (da2015global; blanchet2008forward)

### 5.2 Further scaling to large batch sizes

To scale to large batch sizes, instead of adding points to the batch to be acquired one at a time, we can add points in minibatches of size . While this comes at the cost of possible diversity in the batch, we find that the tradeoff is acceptable for the datasets we experimented with. This gives a final computation cost of . By contrast the corresponding runtime for BatchBALD is where is the number of classes. For all experiments with ICAL, we were able to use without any scaling difficulties. For ICAL-pointwise, we used only for CIFAR-10 and CIFAR-100. As alluded to previously, ICAL-pointwise can accommodate much larger compared to ICAL before its performance degrades, allowing for much greater scaling. We evaluate this aspect of ICAL-pointwise in the Appendix.

The final algorithm is given in Algorithm 1.

## 6 Results

We demonstrate the effectiveness of ICAL using standard image datasets including MNIST (lecun1998gradient), Repeated MNIST (kirsch2019batchbald), Extended MNIST (EMNIST) (cohen2017emnist), fashion-MNIST, and CIFAR-10 (krizhevsky2009learning). We compare ICAL with three state of the art methods for batched active learning acquisition – BatchBALD, FASS, and BayesCoreset. We also compare against BALD and Max Entropy (MaxEnt) which are not explicitly designed for batched selection, as well as against a Random acquisition baseline. ICAL consistently outperforms BatchBALD, FASS, and BayesCoreset on accuracy and negative log likelihood (NLL).

Throughout our experiments, for each dataset we hold out a fixed test set for evaluating model performance after training and a fixed validation set for training purposes. We retrain the model from the beginning after each acquisition to avoid correlation of subsequently trained models, and we use early stopping after 3 (6 for ResNet18) consecutive epochs of validation accuracy drop. Following

(gal2017deep), we use Neural Networks with MC dropout (gal2016dropout) as a variational approximation for Bayesian Neural Networks. We simply use a mixture of rational quadratic kernels for , which has been used successfully with kernel based statistical dependency measures in the past, with mixture length scales of as in (binkowski2018demystifying). All models are optimized with the Adam optimizer (kingma2014adam) using learning rate of 0.001 and betas (0.9,0.999). The small batch size experiments are repeated 6 times with different seeds and a different initial training set for each run, with balanced label distribution across all classes. The same set of seeds is used for different methods on the same task. 8 different seeds are used for large batch size experiments using CIFAR datasets.### 6.1 MNIST and Repeated MNIST

We first examine ICAL’s performance on MNIST, which is a standard image dataset for handwritten digits. We further test out the scenario where duplicated data points exist (repeated MNIST) as proposed in (kirsch2019batchbald)

. Each data point in MNIST is replicated three times in repeated-MNIST, and isotropic Gaussian noise with std=0.1 is added after normalizing the image. We use a CNN consists of two convolutional layers with 32 and 64 5x5 convolution filters, each followed by MC dropout, max-pooling and ReLU. One fully connected layer with 128 hidden units and MC dropout is used after convolutional layers and the output soft-max layer has dimension of 10. All dropout uses probability of 0.5, and the architecture achieved over 99% accuracy on full MNIST. We use a validation set of size 1024 for MNIST and 3072 for repeated-MNIST, and a balanced test set of size 10,000 for both datasets. All models are trained for up to 30 epochs for MNIST and up to 40 epochs for repeated-MNIST. We sample an initial training set of size 20 (2 per class) and conduct 30 acquisitions of batch size 10 on both datasets, and we use 50 MC dropout samples to estimate the posterior.

The test accuracy and negative log-likelihood (NLL) are shown in Figure 3. ICAL significantly improves the NLL and outperforms all other baselines on accuracy, with higher margins on the earlier acquisition rounds. The performance is consistent across all runs (the variance is smaller than other baselines), and is robust even in the repeated-MNIST setup, where all the other greedy methods show worsen performance.

### 6.2 Emnist

We then extend the task to a more sophisticated dataset named Extended-MNIST, which consists of 47 classes of 28x28 images of both digits and letters. We used the balanced EMNIST where each class has 2400 training examples. We use a validation set of 16384 and test set of size 18800 (400 per class), and train for up to 40 epochs. We use a CNN consisting of three convolutional layers with 32, 64, and 128 3x3 convolution filters, each followed by MC dropout, 2x2 max-pooling and ReLU. A fully connected layer with 512 hidden units and MC dropout is used after convolutional layers. We use an initial train set of 47 (1 per class) and make 60 acquisitions of batch size 5. 50 MC dropout samples are used to estimate the posterior.

The results are in Figure 5. We do substantially better in terms of both accuracy and NLL compared to all other methods. A clue as to why our method outperforms on EMNIST can be found in Figure 4. ICAL is able to acquire more diversed and balanced batches while all other methods have overly/under-represented classes (note that BatchBALD, Random and MaxEnt each totally miss examples from one of classes). This indicates that our method is much more robust in terms of performance even when the number of classes increases, whereas other alternatives degenerate.

### 6.3 Fashion-MNIST

We also examine ICAL’s performance on fashion-MNIST which consists of 10 classes of 28x28 Zalando’s article images. We use a validation set of 3072 and test set of size 10000 (1000 per class), and train for up to 40 epochs. The network architecture is the same as the one used in MNIST task. We use an initial train set of 20 (2 per class) and make 30 acquisitions of batch size 10. 100 MC dropout samples are used to estimate the posterior. As shown in Figure 5, we again do significantly better in terms of both accuracy and NLL compared to all other methods. Note that almost all baselines were inferior to random baseline except ICAL, showing the robustness of our method.

### 6.4 Cifar

Finally we test our method on the CIFAR-10 and CIFAR-100 datasets (krizhevsky2009learning) in a large batch size setting. CIFAR-10 consists of 10 classes with 6000 images per class whereas CIFAR-100 has 100 classes with 600 images per class. We use a validation set of size 1024, and a balanced test set of size 10,000 for both datasets. For CIFAR-10, we start with an initial training set of 10000 examples (1000 per class) while for CIFAR-100, we start with 20000 examples (200 per class). We do 10 acquisitions on CIFAR-10 and 7 acquisitions on CIFAR-100 with batch size of 3000. We use a ResNet18 with additional 2 fully connected layers with MC dropouts, and train for up to 60 epochs with learning rate 0.1 (allow early stopping). We run with 8 different seeds. The results are in Figure 6. Note that we are unable to compare against BatchBALD for either CIFAR dataset as it runs out of memory.

For CIFAR-10, ICAL dominates all other methods for all acquisitions except two – when the acquired dataset size is 19000 and when it is 28000. ICAL also achieves the highest accuracy at the end of all 10 acquisitions. With CIFAR-100, on all acquisitions ICAL outperforms a majority of the methods. Furthermore, ICAL again finishes with the highest accuracy by a significant margin at the end of the acquisition rounds.

## 7 Conclusion

We develop a novel batch mode active learning acquisition function ICAL that is model agnostic and applicable to both classification and regression tasks. We develop key optimizations that enable us to scale our method to large acquisition batch and unlabeled set sizes. We show that we are robustly able to outperform state of the art methods for batch mode active learning on a variety of image classification tasks in a deep neural network setting. Future work will involve scaling the method to even larger batch sizes possibly using techniques developed in the feature selection context (da2015global). Another interesting avenue for research could be to combine getting the most amount of information for both the model parameters and for the labels for the unlabeled set into a single acquisition function and get the best of both worlds.

## Acknowledgements

The authors would like to thank Dougal Sutherland and Tatsunori Hashimoto for many useful discussions about this project.

## References

## References

## Appendix

### Derivation for Example 1

For , the mutual information between the predicted label and model parameters is:

For ,

After acquiring , assuming the true label for is 1, then we update the posterior over the model parameter such that and for . Then the expected averaged posterior entropy for is:

Similarly, we could compute the case where the true label for is 2-4:

The expectation of the averaged posterior entropy with respect to predicted label for (since we don’t know the true label) is:

### Proof of Theorem 1

is positive semidefinite (psd) and symmetric as the sum of psd symmetric matrices is also psd symmetric.

### Proof of Theorem 2

We show here that

but the extension to the arbitrary sums is straightforward. Using the definition of in (sejdinovic2013kernel),

### ICAL-pointwise

To evaluate the marginal dependency increase if a candidate point is added to batch , we sample a set from the pool set and compute the pairwise of both and with respect to each point in

. Let the resulting vectors (each of length

) with the scores be and . Then the marginal dependency increase statistic for point is where is the element of the vector. When then modify the as follows - and use the point with the highest value of as the point to acquire. Note that as we want to get as accurate an estimate of as possible, we ideally want to choose as large a set as possible. In general, we also want to choose to be greater than the number of classes. This makes ICAL-pointwise more memory intensive compared to ICAL. We also tried another criterion for batch selection based on the minimal-redundancy-maximal-relevance (peng2005feature) but that had significantly worse performance compared to ICAL and ICAL-pointwise.In Figure 7, we analyze the performance of ICAL versus ICAL-pointwise when their parameters are set such that computational cost is about the same. As can be seen they are broadly similar with ICAL-pointwise having a slight advantage in earlier acquisitions and ICAL being slightly better in later ones.

We also analyze the relative performance as the mini-batch size changes in Figure 8. In the Figure, is the number of iterations taken to build the entire acquisition batch (note that the actual acquisition happens after the entire batch has been built). ICAL-pointwise requires more computation time than ICAL in small setup, however if time is the major constraint, ICAL-pointwise is to be preferred as its performance degrades more slowly as , the size of the minibatch, increases. As the performance usually peaks at , if one is trying to get the best performance or if memory is a constraint, then ICAL is to be preferred.

### Runtime and memory considerations

BatchBALD runs out of memory on CIFAR-10 and CIFAR-100 and thus we are unable to compare against it for those two datasets. For the MNIST-variant datasets, ICAL takes about a minute for building the batch to acquire (batch sizes of 5 and 10). For CIFAR-10 (batch size 3000), with , the runtime is about 20 minutes but it scales linearly with (Figure 10). Thus it is only 5 minutes for ( ) which is already sufficient to give comparable performance to (Figure 9). For CIFAR-100 (batch size 3000), the performance does degrade with high but as we mentioned previously, ICAL-pointwise holds up a lot better in terms of performance with high (Figure 8) and thus if time is a strong consideration, that variant should be used instead.