1 Introduction
Information theory has provided the foundations for many recent advancements in deep learning: for example, information bottlenecks inform objectives both in supervised and unsupervised learning of high-dimensional data. However, the employed notation can be ambiguous for more complex expressions found in applied settings as information theory is used by researchers from different backgrounds (statistics, computer science, information engineering to name some). For example,
is sometimes used to denote the cross-entropy between and , which conflicts with common notation of the joint entropy for and , or it is not clarified that as conditional entropy of given is an expectation over . A good notation, on the other hand, can convey valuable intuitions and concisely express new ideas which we consider the first step to new advances.In addition to disambiguating the commonly used notation, we propose that an extension of information-theoretic quantities to relations between events (outcomes) and random variables can be of great use in machine learning. The mutual information is only defined for random variables , or as point-wise information (DeWeese and Meister, 1999) for two outcomes. A natural extension is to consider the mutual information between an outcome and a random variable . Why is this not more commonly known outside the fields of neuroscience and cognitive sciences? We find a consistent extension for such cases which combines two previously separate quantities. As an application, we examine the core-set problem which consists of selecting the most informative samples of a training set given the labels and provide new results.
Concretely, we examine BALD (Bayesian Active Learning by Disagreement), an acquisition function in Bayesian active learning, and extend it to the core-set problem. In active learning, we have access to a huge reservoir of unlabelled data in a pool set and iteratively select samples from this pool set to be labeled by an oracle (e.g. human experts) to increase the model performance as quickly as possible. Acquisition functions are used to score all pool samples and the highest scorer is acquired. The goal of an acquisition function is to score the most informative samples the highest. BALD (Gal et al., 2017; Houlsby et al., 2011) is such an acquisition function. It maximizes the expected information gain of the model parameters given the prediction variable for a candidate sample from the pool set. It is immediately related to the concept of reduction in posterior uncertainty known from Bayesian optimal experimental design (Lindley, 1956).
The core-set problem on the other hand consists of identifying the most informative samples given the labels, the core set, such that training a model on this core set will perform as well as a model trained on the whole dataset.
In this paper, we introduce Core-Set by Disagreement (CSD), which using our notation, maximizes , the information gain of the model parameters given the true label of a sample in the dataset. Figure 1 shows that it strongly outperforms BALD on MNIST.
2 Information Theory Notation
For a general introduction to information theory, we refer to Cover and Thomas (2006); Yeung (2008)
. In the following, we introduce a disambiguated notation. We start with notation that is explicit about the probability distribution
.Definition 2.1.
Let Shannon’s information content , cross-entropy , entropy , and KL divergence
be defined for a probability distribution
and non-negative function as:where we use the random variable (upper-case ) to make clear which random variable the expectation is over.
Moreover, when focusing on variational aspects of the cross-entropy and the true probability distribution is understood, we use the notation for (Kirsch et al., 2020; Xu et al., 2020):
Definition 2.2.
When the true probability distribution is understood from context, we will use the following short-hands:
When we have a family of variational distribution that is parameterized by , we will simplify further to when the context is clear.
The main motivation for this notation follows from the following proposition:
Lemma 2.3.
We have the following lower-bound for the cross-entropy and KL, with :
Proof.
This follows directly from the non-negativity of the KL for densities when we substitute . ∎
These definitions are trivially extended to joints of random variables and conditionals by substituting the random variable of the product space or the conditional variable.
We can further canonically extend the definitions to tie random variables to specific outcomes, e.g. : if we mix (untied) random variables and tied random variables (outcomes), we define as an operator that takes an expectation of Shannon’s information content for the given expression over the random variables conditioned on the outcomes. For example, . Similarly, we have . The motivation is that we maintain identities . This leads to the following definition:
Definition 2.4.
Given random variables and and outcome , we define:
where we have shortened to .
The intuition behind all this is that e.g. measures the average length of transmitting and together when unbeknownst to the sender and receiver.
The mutual information and point-wise mutual information (Fano, 1961; Church and Hanks, 1990) are defined as:
Definition 2.5.
For random variables and and outcomes and respectively, the point-wise mutual information and the mutual information are:
This is similarly extended to and so on. There are two common, sensible quantities we can define when we want to consider the information overlap between an random variable and an outcome: the information gain and the surprise (DeWeese and Meister, 1999; Butts, 2003). The two quantities are usually defined separately in the cognitive sciences and neuroscience (Williams, 2011); however, we can unify them by relaxing the symmetry of the mutual information:
Definition 2.6.
Given random variables and and outcome for , we define the information gain and the surprise as:
This unifying definition is novel to the best of our knowledge. The information gain measures how much the entropy (uncertainty) of reduces when we learn about while the surprise measures how many additional nats we need to transmit when we already know . The subtle difference becomes clearer when we note that the information gain can be chained while the surprise can not:
Lemma 2.7.
Given random variables and and outcomes and for :
Proof.
We have
while
∎
However, both quantities do chain for random variables:
Lemma 2.8.
Given random variables , , , and outcome for :
We can extend this to triple mutual information terms by adopting the extension (Yeung, 2008) for outcomes as well: and so on.
Lemma 2.9.
For random variables and : and and
3 Example Application: Applying Bayesian Active Learning to the Core-Set Problem
We start by briefly revisiting Bayesian Deep Learning and Active Learning before introducing our Core-Set by Disagreement acquisition function.
Bayesian Modelling. The model parameters are treated as a random variable with prior distribution . We denote the training set , where are the input samples and the labels or targets.
The probabilistic model is as follows:
where and are outcomes for the random variables and denoting the input and label, respectively.
To include multiple labels and inputs, we expand the model to joints of random variables and obtaining
We are only interested in discriminative models and thus do not explicitly specify .
The posterior parameter distribution
is determined via Bayesian inference. We obtain
using Bayes’ law:which allows for predictions by marginalizing over :
Exact Bayesian inference is intractable for complex models, and we use variational inference for approximate inference using a distribution . We determine by minimizing the following KL divergence:
For Bayesian deep learning models, we use the local reparameterization trick and Monte-Carlo dropout for (Gal and Ghahramani, 2016).
Active Learning. In active learning, we have access to an unlabelled pool set . We iteratively select samples from the pool set and ask an oracle for labels. We then add those newly labeled samples to the training set, retrain our model, and repeat this process until the model satisfies our performance requirements.
Samples are usually acquired in batches instead of individually. We score candidate acquisition batches with the acquisition batch size b using an acquisition function and pick the highest scoring one:
BALD was originally introduced as a one-sample acquisition function of the expected information gain between the prediction for a candidate input and the model parameters : In BatchBALD (Kirsch et al., 2019), this one-sample case was canonically extended to the batch acquisition case using the expected information gain between the joint of the predictions for the batch candidates and the model parameters :
Notation. Instead of , , we will write , and so on to to cut down on notation. Like above, all terms can be canonically extended to sets by substituting the joint. We provide the full derivations in the appendix. Also note again that lower-case variables like are outcomes and upper-case variables like are random variables, with the exception of the datasets etc, which are sets of outcomes.
3.1 Bald Core-Set by Disagreement
We examine BALD through the lens of our new notation and develop CSD as information gain. First, we note that BALD does not optimize for loss (cross-entropy) of the test distribution to become minimal. It does not try to pick labels which minimize the generalization loss.
BALD maximizes the expected information gain: . We assume that our Bayesian model contains the true generating model parameters and by selecting samples that minimize the uncertainty , the model parameters will converge towards these true parameters as .
BALD as an Approximation. BALD as the expected information gain is just an expectation given the current model’s predictions for :
(1) |
where following the definition:
(2) |
That is, we can view BALD as weighting the information gains for different by the current model’s belief that is correct. If we had access to the label or a better proxy distribution for the label, we could improve on this. This could in particular help with the cold starting problem in active learning (when one starts training with no initial training set) and the model predictions are not trustworthy at all. When we have access to the labels, we can directly use the information gain and select the samples using a Core-Set by Disagreement aquisition function:
(3) |
Evaluating the Information Gain. We show how to compute the information for the special case of an MC dropout model with dropout rate
. Computing the information gain for complex models is not trivial as its canonical expansion requires an explicit model. However, most available Bayesian neural networks, like MC dropout models, only provide implicit models which we can only sample from but which do not provide an easy way to compute their density. Moreover, to compute
naively we would have to perform a Bayesian inference step. However, we can rewrite:(4) |
We can expand to:
(5) |
The expectation of is just , so we have:
To compute , we use importance sampling:
Finally, if we use Monte-Carlo dropout with dropout rate to obtain a variational model distribution , we have , and we can approximate as:
In this special case, we indeed have . We can use the surprise to approximate the information gain. However, as we will see below this approximation is brittle.
4 Experiments
MNIST. We implement CSD and evaluate it on MNIST to show that it can identify a core-set of training samples that achieves high accuracy and low loss. Because we compute the information gain using the provided labels, CSD is very sensitive to mislabeled samples: if a sample is mislabeled, it will necessary have a very high information gain and the model will become misspecified. For this reason, we train a LeNet ensemble with 5 models on MNIST and discard all samples with predictive entropy nats and whose labels do not match the predictions. This removes about 5678 samples from the training set.
We use a LeNet model (LeCun et al., 1998) with MC dropout (dropout rate ) in the core-set setting where we have access to labels but otherwise use an active learning setup with greedy acquisition. We use individual acquisition and compare to BALD, which does not make use of label information, and which we use as a sanity baseline. The training regime follows the one described in Kirsch et al. (2019).
Acquisition Function | 90% Acc | 95% Acc |
---|---|---|
Uniform | 125/130/150 | — |
BALD | 88/91/99 | 130/145/167 |
CSD (ours) | 55/58/58 | 105/111/115 |
25%/50%/75% quantiles for reaching 90% and 95% accuracy on MNIST.
5 trials each.Figure 1 shows that CSD strongly outperforms BALD on MNIST (both with individual acquisitions). Indeed, only 58 samples are required to reach 90% accuracy on average and 111 samples for 95% accuracy compared to BALD which needs about 30 samples more in each case, see also Table 1. We show an ablation of using CSD without removing mislabeled or ambiguous samples from the training set in Figure 2.
CIFAR-10.
However, we cannot produce the same results on cleaned CIFAR-10 (similar like MNIST described above) with ResNet18 models and MC dropout. BALD performs much better than CSD, even when cold starting. The accuracy plot is depicted in
Figure 3. This indicates that something is wrong. We have not been able to identify the issue yet.BatchCSD. Finally, we examine an extension of CSD to the batch case following (Kirsch et al., 2019) and compute using the approximation . However, even for a batch acquisition size of 5, this approximation does not work well in the batch case already as depicted in fig. 4 on MNIST. BatchCSD performs worse than Uniform for samples and worse than BALD for 150 samples.
5 Conclusion & Limitations
In addition to a unified notation for information-theoretic quantities for both random variables and outcomes, we have unified information gain and surprise by defining the mutual information appropriately. We have used this notation to reinterpret BALD as expected information gain and found an approximation for the information gain which allowed use to introduce CSD and show that it works on MNIST. However, we have not been able to provide good results for CIFAR-10 or successfully extend our approximation to the batch case. Moreover, the approximation we have used only works for MC dropout with dropout rate . Our approach requires an explicit model, otherwise. Importantly, unlike BALD, the information gain in CSD does also not seem to be submodular, and we cannot infer a optimality that way (Kirsch et al., 2019)—although BALD’s submodularity and optimality is not tied to its generalization loss anyway.
References
- Butts [2003] Daniel A Butts. How much information is associated with a particular stimulus? Network: Computation in Neural Systems, 14(2):177–187, 2003.
- Church and Hanks [1990] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29, 1990. URL https://www.aclweb.org/anthology/J90-1003.
- Cover and Thomas [2006] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, USA, 2006. ISBN 0471241954.
- DeWeese and Meister [1999] Michael R DeWeese and Markus Meister. How to measure the information gained from one symbol. Network: Computation in Neural Systems, 10(4):325–340, 1999.
- Fano [1961] Robert M Fano. Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793–794, 1961.
- Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
- Gal et al. [2017] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183–1192. PMLR, 2017.
- Houlsby et al. [2011] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
- Kirsch et al. [2019] Andreas Kirsch, Joost van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In Advances in Neural Information Processing Systems, pages 7024–7035, 2019.
- Kirsch et al. [2020] Andreas Kirsch, Clare Lyle, and Yarin Gal. Unpacking information bottlenecks: Unifying information-theoretic objectives in deep learning. arXiv preprint arXiv:2003.12537, 2020.
- LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lindley [1956] Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, pages 986–1005, 1956.
- Williams [2011] Paul L Williams. Information dynamics: Its theory and application to embodied cognitive systems. PhD thesis, PhD thesis, Indiana University, 2011.
- Xu et al. [2020] Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of usable information under computational constraints. arXiv preprint arXiv:2002.10689, 2020.
- Yeung [2008] R.W. Yeung. Information Theory and Network Coding. Information Technology: Transmission, Processing and Storage. Springer US, 2008. ISBN 9780387792347.