# A Practical Unified Notation for Information-Theoretic Quantities in ML

Information theory is of importance to machine learning, but the notation for information-theoretic quantities is sometimes opaque. The right notation can convey valuable intuitions and concisely express new ideas. We propose such a notation for machine learning users and expand it to include information-theoretic quantities between events (outcomes) and random variables. We apply this notation to a popular information-theoretic acquisition function in Bayesian active learning which selects the most informative (unlabelled) samples to be labelled by an expert. We demonstrate the value of our notation when extending the acquisition function to the core-set problem, which consists of selecting the most informative samples given the labels.

## Authors

• 8 publications
• 91 publications
• ### An Information-Theoretic Framework for Unifying Active Learning Problems

This paper presents an information-theoretic framework for unifying acti...
12/19/2020 ∙ by Quoc Phong Nguyen, et al. ∙ 11

• ### Prioritized training on points that are learnable, worth learning, and not yet learned

We introduce Goldilocks Selection, a technique for faster model training...
07/06/2021 ∙ by Sören Mindermann, et al. ∙ 5

• ### Information-Theoretic Methods for Identifying Relationships among Climate Variables

Information-theoretic quantities, such as entropy, are used to quantify ...
12/19/2014 ∙ by Kevin H. Knuth, et al. ∙ 0

• ### The Value of Interaction in Data Intelligence

In human computer interaction (HCI), it is common to evaluate the value ...
12/14/2018 ∙ by Min Chen, et al. ∙ 0

• ### Robustness of Dynamical Quantities of Interest via Goal-Oriented Information Theory

Variational-principle-based methods that relate expectations of a quanti...
06/21/2019 ∙ by Jeremiah Birrell, et al. ∙ 0

• ### The Information Bottleneck Problem and Its Applications in Machine Learning

Inference capabilities of machine learning (ML) systems skyrocketed in r...
04/30/2020 ∙ by Ziv Goldfeld, et al. ∙ 0

• ### Efficient Information Theoretic Clustering on Discrete Lattices

We consider the problem of clustering data that reside on discrete, low ...
10/26/2013 ∙ by Christian Bauckhage, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Information theory has provided the foundations for many recent advancements in deep learning: for example, information bottlenecks inform objectives both in supervised and unsupervised learning of high-dimensional data. However, the employed notation can be ambiguous for more complex expressions found in applied settings as information theory is used by researchers from different backgrounds (statistics, computer science, information engineering to name some). For example,

is sometimes used to denote the cross-entropy between and , which conflicts with common notation of the joint entropy for and , or it is not clarified that as conditional entropy of given is an expectation over . A good notation, on the other hand, can convey valuable intuitions and concisely express new ideas which we consider the first step to new advances.

In addition to disambiguating the commonly used notation, we propose that an extension of information-theoretic quantities to relations between events (outcomes) and random variables can be of great use in machine learning. The mutual information is only defined for random variables , or as point-wise information (DeWeese and Meister, 1999) for two outcomes. A natural extension is to consider the mutual information between an outcome and a random variable . Why is this not more commonly known outside the fields of neuroscience and cognitive sciences? We find a consistent extension for such cases which combines two previously separate quantities. As an application, we examine the core-set problem which consists of selecting the most informative samples of a training set given the labels and provide new results.

Concretely, we examine BALD (Bayesian Active Learning by Disagreement), an acquisition function in Bayesian active learning, and extend it to the core-set problem. In active learning, we have access to a huge reservoir of unlabelled data in a pool set and iteratively select samples from this pool set to be labeled by an oracle (e.g. human experts) to increase the model performance as quickly as possible. Acquisition functions are used to score all pool samples and the highest scorer is acquired. The goal of an acquisition function is to score the most informative samples the highest. BALD (Gal et al., 2017; Houlsby et al., 2011) is such an acquisition function. It maximizes the expected information gain of the model parameters given the prediction variable for a candidate sample from the pool set. It is immediately related to the concept of reduction in posterior uncertainty known from Bayesian optimal experimental design (Lindley, 1956).

The core-set problem on the other hand consists of identifying the most informative samples given the labels, the core set, such that training a model on this core set will perform as well as a model trained on the whole dataset.

In this paper, we introduce Core-Set by Disagreement (CSD), which using our notation, maximizes , the information gain of the model parameters given the true label of a sample in the dataset. Figure 1 shows that it strongly outperforms BALD on MNIST.

## 2 Information Theory Notation

For a general introduction to information theory, we refer to Cover and Thomas (2006); Yeung (2008)

. In the following, we introduce a disambiguated notation. We start with notation that is explicit about the probability distribution

.

###### Definition 2.1.

Let Shannon’s information content , cross-entropy , entropy , and KL divergence

be defined for a probability distribution

and non-negative function as:

 h(q) \vcentcolon=−lnq H(p(X)\nonscript∥\allowbreak\nonscriptq(X)) \vcentcolon=Ep(x)h(q(x)) H(p(X)) \vcentcolon=H(p(X)\nonscript∥\allowbreak\nonscriptp(X)) DKL(p(X)\nonscript∥\allowbreak\nonscriptq(X)) \vcentcolon=H(p(X)\nonscript∥\allowbreak\nonscriptq(X))−H(p(X)),

where we use the random variable (upper-case ) to make clear which random variable the expectation is over.

Moreover, when focusing on variational aspects of the cross-entropy and the true probability distribution is understood, we use the notation for (Kirsch et al., 2020; Xu et al., 2020):

###### Definition 2.2.

When the true probability distribution is understood from context, we will use the following short-hands:

 H[X] \vcentcolon=H(p(X)) \vcentcolon=H(p(X)\nonscript∥\allowbreak\nonscriptq(X)).

When we have a family of variational distribution that is parameterized by , we will simplify further to when the context is clear.

The main motivation for this notation follows from the following proposition:

###### Lemma 2.3.

We have the following lower-bound for the cross-entropy and KL, with :

 H(p(X)\nonscript∥\allowbreak\nonscriptq(X)) ≥H(p(X))+h(Zq), DKL(p(X)\nonscript∥\allowbreak\nonscriptq(X)) ≥h(Zq).
###### Proof.

This follows directly from the non-negativity of the KL for densities when we substitute . ∎

These definitions are trivially extended to joints of random variables and conditionals by substituting the random variable of the product space or the conditional variable.

We can further canonically extend the definitions to tie random variables to specific outcomes, e.g. : if we mix (untied) random variables and tied random variables (outcomes), we define as an operator that takes an expectation of Shannon’s information content for the given expression over the random variables conditioned on the outcomes. For example, . Similarly, we have . The motivation is that we maintain identities . This leads to the following definition:

###### Definition 2.4.

Given random variables and and outcome , we define:

 H[y] \vcentcolon=h(p(y)) H[X,y] \vcentcolon=Ep(x\nonscript|\allowbreak\nonscripty)H[x,y]=Ep(x\nonscript|\allowbreak\nonscripty)h(p(x,y)) H[X\nonscript|\allowbreak\nonscripty] \vcentcolon=Ep(x\nonscript|\allowbreak\nonscripty)H[x\nonscript|\allowbreak\nonscripty]=Ep(x\nonscript|\allowbreak\nonscripty)h(p(x\nonscript|\allowbreak\nonscripty)) =H[X,y]−H[y] H[y\nonscript|\allowbreak\nonscriptX] \vcentcolon=Ep(x\nonscript|\allowbreak\nonscripty)H[y\nonscript|\allowbreak\nonscriptx]=Ep(x\nonscript|\allowbreak\nonscripty)h(p(y\nonscript|\allowbreak\nonscriptx)) \definecolor[named]pgfstrokecolorrgb1,0,0\pgfsys@color@rgb@stroke100\pgfsys@color@rgb@fill100≠H[X,y]−H[X],

where we have shortened to .

The intuition behind all this is that e.g. measures the average length of transmitting and together when unbeknownst to the sender and receiver.

The mutual information and point-wise mutual information (Fano, 1961; Church and Hanks, 1990) are defined as:

###### Definition 2.5.

For random variables and and outcomes and respectively, the point-wise mutual information and the mutual information are:

 I[x;y]\vcentcolon=H[x]−H[x\nonscript|\allowbreak\nonscripty]=h(p(x)p(y)p(x,y)) I[X;Y]\vcentcolon=H[X]−H[X\nonscript|\allowbreak\nonscriptY]=Ep(x,y)I[x;y].

This is similarly extended to and so on. There are two common, sensible quantities we can define when we want to consider the information overlap between an random variable and an outcome: the information gain and the surprise (DeWeese and Meister, 1999; Butts, 2003). The two quantities are usually defined separately in the cognitive sciences and neuroscience (Williams, 2011); however, we can unify them by relaxing the symmetry of the mutual information:

###### Definition 2.6.

Given random variables and and outcome for , we define the information gain and the surprise as:

 I[X;y]\vcentcolon=H[X]−H[X\nonscript|\allowbreak\nonscripty] I[y;X]\vcentcolon=H[y]−H[y\nonscript|\allowbreak\nonscriptX]=Ep(x\nonscript|\allowbreak\nonscripty)I[x;y].

This unifying definition is novel to the best of our knowledge. The information gain measures how much the entropy (uncertainty) of reduces when we learn about while the surprise measures how many additional nats we need to transmit when we already know . The subtle difference becomes clearer when we note that the information gain can be chained while the surprise can not:

###### Lemma 2.7.

Given random variables and and outcomes and for :

 I[X;y1,y2] =I[X;y1]+I[X;y2\nonscript|\allowbreak\nonscripty1] I[y1,y2;X] \definecolor[named]pgfstrokecolorrgb1,0,0\pgfsys@color@rgb@stroke100\pgfsys@color@rgb@fill100≠I[y1;X]+I[y2;X\nonscript|\allowbreak\nonscripty1].
###### Proof.

We have

 I[X;y1,y2] =H[X]−H[X\nonscript|\allowbreak\nonscripty1]+H[X\nonscript|\allowbreak\nonscripty1]−H[X\nonscript|\allowbreak\nonscripty1,y2] =I[X;y1]+I[X;y2\nonscript|\allowbreak\nonscripty1],

while

 I[y1,y2;X] =Ep(x\nonscript|\allowbreak\nonscripty1,y2)I[y1;x]\definecolor[named]pgfstrokecolorrgb1,0,0\pgfsys@color@rgb@stroke100\pgfsys@color@rgb@fill100≠Ep(x\nonscript|\allowbreak\nonscripty1)I[y1;x]=I[y1;x] +Ep(x\nonscript|\allowbreak\nonscripty1,y2)I[y2;x\nonscript|\allowbreak\nonscripty1]=I[y2;X\nonscript|\allowbreak\nonscripty1].

However, both quantities do chain for random variables:

###### Lemma 2.8.

Given random variables , , , and outcome for :

 I[X1,X2;y]=I[X1;y]+I[X1;y\nonscript|\allowbreak\nonscriptX2] I[y;X1,X2]=I[y;X1]+I[y;X2\nonscript|\allowbreak\nonscriptX1].

We can extend this to triple mutual information terms by adopting the extension (Yeung, 2008) for outcomes as well: and so on.

###### Lemma 2.9.

For random variables and : and and

## 3 Example Application: Applying Bayesian Active Learning to the Core-Set Problem

We start by briefly revisiting Bayesian Deep Learning and Active Learning before introducing our Core-Set by Disagreement acquisition function.

Bayesian Modelling. The model parameters are treated as a random variable with prior distribution . We denote the training set , where are the input samples and the labels or targets.

The probabilistic model is as follows:

 p(y,x,ω)=p(y\nonscript|\allowbreak\nonscriptx,ω)p(ω)p(x),

where and are outcomes for the random variables and denoting the input and label, respectively.

To include multiple labels and inputs, we expand the model to joints of random variables and obtaining

 p({yi}i,{xi}i,ω)=∏i∈Ip(yi\nonscript|\allowbreak\nonscriptxi,ω)p(xi)p(ω).

We are only interested in discriminative models and thus do not explicitly specify .

The posterior parameter distribution

is determined via Bayesian inference. We obtain

using Bayes’ law:

 p(ω\nonscript|\allowbreak\nonscriptDtrain)∝p({ytraini}i\nonscript|\allowbreak\nonscript{xtraini}i,ω)p(ω).

which allows for predictions by marginalizing over :

 p(y\nonscript|\allowbreak\nonscriptx,Dtrain)=Eω∼p(ω\nonscript|\allowbreak\nonscriptDtrain% )p(y\nonscript|\allowbreak\nonscriptx,ω).

Exact Bayesian inference is intractable for complex models, and we use variational inference for approximate inference using a distribution . We determine by minimizing the following KL divergence:

 DKL(q(ω)\nonscript∥\allowbreak\nonscriptp(ω\nonscript|\allowbreak\nonscriptDtrain))==H(q(ω)\nonscript∥\allowbreak\nonscriptp({ytraini}i\nonscript|\allowbreak\nonscript{xtraini}i,ω))likelihood+DKL(q(ω)\nonscript∥\allowbreak\nonscriptp(ω))prior regularization+logp(Dtrain)model evidence≥0.

For Bayesian deep learning models, we use the local reparameterization trick and Monte-Carlo dropout for (Gal and Ghahramani, 2016).

Active Learning. In active learning, we have access to an unlabelled pool set . We iteratively select samples from the pool set and ask an oracle for labels. We then add those newly labeled samples to the training set, retrain our model, and repeat this process until the model satisfies our performance requirements.

Samples are usually acquired in batches instead of individually. We score candidate acquisition batches with the acquisition batch size b using an acquisition function and pick the highest scoring one:

 argmax{xacqi}i∈{1,…,b}⊆Dpoola({xacqi}i,p(Ω\nonscript|\allowbreak\nonscriptDtrain))

BALD was originally introduced as a one-sample acquisition function of the expected information gain between the prediction for a candidate input and the model parameters : In BatchBALD (Kirsch et al., 2019), this one-sample case was canonically extended to the batch acquisition case using the expected information gain between the joint of the predictions for the batch candidates and the model parameters :

 aBALD({xacqi}i,p(Ω\nonscript|\allowbreak\nonscriptDtrain% )):==I[Ω;{Yacqi}i\nonscript|\allowbreak\nonscript{xacqi}i,Dtrain]

Notation. Instead of , , we will write , and so on to to cut down on notation. Like above, all terms can be canonically extended to sets by substituting the joint. We provide the full derivations in the appendix. Also note again that lower-case variables like are outcomes and upper-case variables like are random variables, with the exception of the datasets etc, which are sets of outcomes.

### 3.1 Bald → Core-Set by Disagreement

We examine BALD through the lens of our new notation and develop CSD as information gain. First, we note that BALD does not optimize for loss (cross-entropy) of the test distribution to become minimal. It does not try to pick labels which minimize the generalization loss.

BALD maximizes the expected information gain: . We assume that our Bayesian model contains the true generating model parameters and by selecting samples that minimize the uncertainty , the model parameters will converge towards these true parameters as .

BALD as an Approximation. BALD as the expected information gain is just an expectation given the current model’s predictions for :

 I[Ω;Y\nonscript|\allowbreak\nonscriptx,Dtrain]=Ep(y\nonscript|\allowbreak\nonscriptx,Dtrain)I[Ω;y\nonscript|\allowbreak\nonscriptx,Dtrain]. (1)

where following the definition:

 I[Ω;y\nonscript|\allowbreak\nonscriptx,Dtrain]=H[Ω\nonscript|\allowbreak\nonscriptDtrain]−H[Ω\nonscript|\allowbreak\nonscripty,x,Dtrain]. (2)

That is, we can view BALD as weighting the information gains for different by the current model’s belief that is correct. If we had access to the label or a better proxy distribution for the label, we could improve on this. This could in particular help with the cold starting problem in active learning (when one starts training with no initial training set) and the model predictions are not trustworthy at all. When we have access to the labels, we can directly use the information gain and select the samples using a Core-Set by Disagreement aquisition function:

 aCSD(yacq,x% acq,p(Ω\nonscript|\allowbreak\nonscriptDtrain)):==I[Ω;yacq\nonscript|\allowbreak\nonscriptxacq,Dtrain] (3)

Evaluating the Information Gain. We show how to compute the information for the special case of an MC dropout model with dropout rate

. Computing the information gain for complex models is not trivial as its canonical expansion requires an explicit model. However, most available Bayesian neural networks, like MC dropout models, only provide implicit models which we can only sample from but which do not provide an easy way to compute their density. Moreover, to compute

naively we would have to perform a Bayesian inference step. However, we can rewrite:

 I[Ω;y\nonscript|\allowbreak\nonscriptx,Dtrain]==H[Ω\nonscript|\allowbreak\nonscriptDtrain]−H[Ω,y\nonscript|\allowbreak\nonscriptx,Dtrain]+H[y\nonscript|\allowbreak\nonscriptx,Dtrain]=H[Ω\nonscript|\allowbreak\nonscriptDtrain]+H[y\nonscript|\allowbreak\nonscriptx,Dtrain]−Ep(ω\nonscript|\allowbreak\nonscripty,x,Dtrain)H[ω,y\nonscript|\allowbreak\nonscriptx,Dtrain]. (4)

We can expand to:

 (5)

The expectation of is just , so we have:

 I[Ω;y\nonscript|\allowbreak\nonscriptx,Dtrain]= =Ep(ω\nonscript|\allowbreak\nonscriptDtrain% )H[ω\nonscript|\allowbreak\nonscriptDtrain]−Ep(ω\nonscript|\allowbreak\nonscripty,x,D% train)H[ω\nonscript|\allowbreak\nonscriptDtrain]to18.9pt\vboxto18.9pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{7.249756pt}{0.0pt}\pgfsys@curveto{7.249756pt}{4.00393pt}{4.0039% 3pt}{7.249756pt}{0.0pt}{7.249756pt}\pgfsys@curveto{-4.00393pt}{7.249756pt}{-7.% 249756pt}{4.00393pt}{-7.249756pt}{0.0pt}\pgfsys@curveto{-7.249756pt}{-4.00393% pt}{-4.00393pt}{-7.249756pt}{0.0pt}{-7.249756pt}\pgfsys@curveto{4.00393pt}{-7.% 249756pt}{7.249756pt}{-4.00393pt}{7.249756pt}{0.0pt}\pgfsys@closepath% \pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-5.24992pt}% {-1.749973pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{3}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicture +H[y\nonscript|\allowbreak\nonscriptx,Dtrain]−H[y\nonscript|\allowbreak\nonscriptΩ,D% train]=I[y;Ω\nonscript|\allowbreak\nonscriptDtrain].

To compute , we use importance sampling:

 H[y\nonscript|\allowbreak\nonscriptΩ,Dtrain]=Ep(ω\nonscript|\allowbreak\nonscripty,x,Dtrain)H[y\nonscript|\allowbreak\nonscriptω,x]= =Ep(ω\nonscript|\allowbreak\nonscriptDtrain)p(ω\nonscript|\allowbreak\nonscripty,x,Dtrain)p(w\nonscript|\allowbreak\nonscriptDtrain)H[y\nonscript|\allowbreak\nonscriptω,x] =Ep(ω\nonscript|\allowbreak\nonscriptDtrain)p(y\nonscript|\allowbreak\nonscriptx,ω)p(y\nonscript|\allowbreak\nonscriptx,Dtrain)H[y\nonscript|\allowbreak\nonscriptω,x]

Finally, if we use Monte-Carlo dropout with dropout rate to obtain a variational model distribution , we have , and we can approximate as:

 Ep(ω\nonscript|\allowbreak\nonscriptDtrain)H[ω\nonscript|\allowbreak\nonscriptD% train]−Ep(ω\nonscript|\allowbreak\nonscripty,x,Dtrain)H[ω\nonscript|\allowbreak\nonscriptDtrain]= ≈Ep(ω\nonscript|\allowbreak\nonscriptDtrain)h(q(ω))−Ep(ω\nonscript|\allowbreak\nonscripty,x,Dtrain)h(q(ω)) =h(q(ω))−h(q(ω))=0.

In this special case, we indeed have . We can use the surprise to approximate the information gain. However, as we will see below this approximation is brittle.

## 4 Experiments

MNIST. We implement CSD and evaluate it on MNIST to show that it can identify a core-set of training samples that achieves high accuracy and low loss. Because we compute the information gain using the provided labels, CSD is very sensitive to mislabeled samples: if a sample is mislabeled, it will necessary have a very high information gain and the model will become misspecified. For this reason, we train a LeNet ensemble with 5 models on MNIST and discard all samples with predictive entropy nats and whose labels do not match the predictions. This removes about 5678 samples from the training set.

We use a LeNet model (LeCun et al., 1998) with MC dropout (dropout rate ) in the core-set setting where we have access to labels but otherwise use an active learning setup with greedy acquisition. We use individual acquisition and compare to BALD, which does not make use of label information, and which we use as a sanity baseline. The training regime follows the one described in Kirsch et al. (2019).

Figure 1 shows that CSD strongly outperforms BALD on MNIST (both with individual acquisitions). Indeed, only 58 samples are required to reach 90% accuracy on average and 111 samples for 95% accuracy compared to BALD which needs about 30 samples more in each case, see also Table 1. We show an ablation of using CSD without removing mislabeled or ambiguous samples from the training set in Figure 2.

CIFAR-10.

However, we cannot produce the same results on cleaned CIFAR-10 (similar like MNIST described above) with ResNet18 models and MC dropout. BALD performs much better than CSD, even when cold starting. The accuracy plot is depicted in

Figure 3. This indicates that something is wrong. We have not been able to identify the issue yet.

BatchCSD. Finally, we examine an extension of CSD to the batch case following (Kirsch et al., 2019) and compute using the approximation . However, even for a batch acquisition size of 5, this approximation does not work well in the batch case already as depicted in fig. 4 on MNIST. BatchCSD performs worse than Uniform for samples and worse than BALD for 150 samples.

## 5 Conclusion & Limitations

In addition to a unified notation for information-theoretic quantities for both random variables and outcomes, we have unified information gain and surprise by defining the mutual information appropriately. We have used this notation to reinterpret BALD as expected information gain and found an approximation for the information gain which allowed use to introduce CSD and show that it works on MNIST. However, we have not been able to provide good results for CIFAR-10 or successfully extend our approximation to the batch case. Moreover, the approximation we have used only works for MC dropout with dropout rate . Our approach requires an explicit model, otherwise. Importantly, unlike BALD, the information gain in CSD does also not seem to be submodular, and we cannot infer a optimality that way (Kirsch et al., 2019)—although BALD’s submodularity and optimality is not tied to its generalization loss anyway.

## References

• Butts [2003] Daniel A Butts. How much information is associated with a particular stimulus? Network: Computation in Neural Systems, 14(2):177–187, 2003.
• Church and Hanks [1990] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29, 1990.
• Cover and Thomas [2006] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, USA, 2006. ISBN 0471241954.
• DeWeese and Meister [1999] Michael R DeWeese and Markus Meister. How to measure the information gained from one symbol. Network: Computation in Neural Systems, 10(4):325–340, 1999.
• Fano [1961] Robert M Fano. Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793–794, 1961.
• Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
• Gal et al. [2017] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183–1192. PMLR, 2017.
• Houlsby et al. [2011] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
• Kirsch et al. [2019] Andreas Kirsch, Joost van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In Advances in Neural Information Processing Systems, pages 7024–7035, 2019.
• Kirsch et al. [2020] Andreas Kirsch, Clare Lyle, and Yarin Gal. Unpacking information bottlenecks: Unifying information-theoretic objectives in deep learning. arXiv preprint arXiv:2003.12537, 2020.
• LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
• Lindley [1956] Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, pages 986–1005, 1956.
• Williams [2011] Paul L Williams. Information dynamics: Its theory and application to embodied cognitive systems. PhD thesis, PhD thesis, Indiana University, 2011.
• Xu et al. [2020] Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of usable information under computational constraints. arXiv preprint arXiv:2002.10689, 2020.
• Yeung [2008] R.W. Yeung. Information Theory and Network Coding. Information Technology: Transmission, Processing and Storage. Springer US, 2008. ISBN 9780387792347.