A Practical Unified Notation for Information-Theoretic Quantities in ML

06/22/2021 ∙ by Andreas Kirsch, et al. ∙ 51

Information theory is of importance to machine learning, but the notation for information-theoretic quantities is sometimes opaque. The right notation can convey valuable intuitions and concisely express new ideas. We propose such a notation for machine learning users and expand it to include information-theoretic quantities between events (outcomes) and random variables. We apply this notation to a popular information-theoretic acquisition function in Bayesian active learning which selects the most informative (unlabelled) samples to be labelled by an expert. We demonstrate the value of our notation when extending the acquisition function to the core-set problem, which consists of selecting the most informative samples given the labels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Information theory has provided the foundations for many recent advancements in deep learning: for example, information bottlenecks inform objectives both in supervised and unsupervised learning of high-dimensional data. However, the employed notation can be ambiguous for more complex expressions found in applied settings as information theory is used by researchers from different backgrounds (statistics, computer science, information engineering to name some). For example,

is sometimes used to denote the cross-entropy between and , which conflicts with common notation of the joint entropy for and , or it is not clarified that as conditional entropy of given is an expectation over . A good notation, on the other hand, can convey valuable intuitions and concisely express new ideas which we consider the first step to new advances.

In addition to disambiguating the commonly used notation, we propose that an extension of information-theoretic quantities to relations between events (outcomes) and random variables can be of great use in machine learning. The mutual information is only defined for random variables , or as point-wise information (DeWeese and Meister, 1999) for two outcomes. A natural extension is to consider the mutual information between an outcome and a random variable . Why is this not more commonly known outside the fields of neuroscience and cognitive sciences? We find a consistent extension for such cases which combines two previously separate quantities. As an application, we examine the core-set problem which consists of selecting the most informative samples of a training set given the labels and provide new results.

Figure 1:

CSD vs BALD vs uniform acquisition on MNIST after ambiguous and mislabeled training samples have been dropped from the training set.

CSD requires only 58 samples to reach 90% accuracy compared to 91 samples for BALD. 5 trials each.

Concretely, we examine BALD (Bayesian Active Learning by Disagreement), an acquisition function in Bayesian active learning, and extend it to the core-set problem. In active learning, we have access to a huge reservoir of unlabelled data in a pool set and iteratively select samples from this pool set to be labeled by an oracle (e.g. human experts) to increase the model performance as quickly as possible. Acquisition functions are used to score all pool samples and the highest scorer is acquired. The goal of an acquisition function is to score the most informative samples the highest. BALD (Gal et al., 2017; Houlsby et al., 2011) is such an acquisition function. It maximizes the expected information gain of the model parameters given the prediction variable for a candidate sample from the pool set. It is immediately related to the concept of reduction in posterior uncertainty known from Bayesian optimal experimental design (Lindley, 1956).

The core-set problem on the other hand consists of identifying the most informative samples given the labels, the core set, such that training a model on this core set will perform as well as a model trained on the whole dataset.

In this paper, we introduce Core-Set by Disagreement (CSD), which using our notation, maximizes , the information gain of the model parameters given the true label of a sample in the dataset. Figure 1 shows that it strongly outperforms BALD on MNIST.

2 Information Theory Notation

For a general introduction to information theory, we refer to Cover and Thomas (2006); Yeung (2008)

. In the following, we introduce a disambiguated notation. We start with notation that is explicit about the probability distribution

.

Definition 2.1.

Let Shannon’s information content , cross-entropy , entropy , and KL divergence

be defined for a probability distribution

and non-negative function as:

where we use the random variable (upper-case ) to make clear which random variable the expectation is over.

Moreover, when focusing on variational aspects of the cross-entropy and the true probability distribution is understood, we use the notation for (Kirsch et al., 2020; Xu et al., 2020):

Definition 2.2.

When the true probability distribution is understood from context, we will use the following short-hands:

When we have a family of variational distribution that is parameterized by , we will simplify further to when the context is clear.

The main motivation for this notation follows from the following proposition:

Lemma 2.3.

We have the following lower-bound for the cross-entropy and KL, with :

Proof.

This follows directly from the non-negativity of the KL for densities when we substitute . ∎

These definitions are trivially extended to joints of random variables and conditionals by substituting the random variable of the product space or the conditional variable.

We can further canonically extend the definitions to tie random variables to specific outcomes, e.g. : if we mix (untied) random variables and tied random variables (outcomes), we define as an operator that takes an expectation of Shannon’s information content for the given expression over the random variables conditioned on the outcomes. For example, . Similarly, we have . The motivation is that we maintain identities . This leads to the following definition:

Definition 2.4.

Given random variables and and outcome , we define:

where we have shortened to .

The intuition behind all this is that e.g. measures the average length of transmitting and together when unbeknownst to the sender and receiver.

The mutual information and point-wise mutual information (Fano, 1961; Church and Hanks, 1990) are defined as:

Definition 2.5.

For random variables and and outcomes and respectively, the point-wise mutual information and the mutual information are:

This is similarly extended to and so on. There are two common, sensible quantities we can define when we want to consider the information overlap between an random variable and an outcome: the information gain and the surprise (DeWeese and Meister, 1999; Butts, 2003). The two quantities are usually defined separately in the cognitive sciences and neuroscience (Williams, 2011); however, we can unify them by relaxing the symmetry of the mutual information:

Definition 2.6.

Given random variables and and outcome for , we define the information gain and the surprise as:

This unifying definition is novel to the best of our knowledge. The information gain measures how much the entropy (uncertainty) of reduces when we learn about while the surprise measures how many additional nats we need to transmit when we already know . The subtle difference becomes clearer when we note that the information gain can be chained while the surprise can not:

Lemma 2.7.

Given random variables and and outcomes and for :

Proof.

We have

while

However, both quantities do chain for random variables:

Lemma 2.8.

Given random variables , , , and outcome for :

We can extend this to triple mutual information terms by adopting the extension (Yeung, 2008) for outcomes as well: and so on.

Lemma 2.9.

For random variables and : and and

3 Example Application: Applying Bayesian Active Learning to the Core-Set Problem

We start by briefly revisiting Bayesian Deep Learning and Active Learning before introducing our Core-Set by Disagreement acquisition function.

Bayesian Modelling. The model parameters are treated as a random variable with prior distribution . We denote the training set , where are the input samples and the labels or targets.

The probabilistic model is as follows:

where and are outcomes for the random variables and denoting the input and label, respectively.

To include multiple labels and inputs, we expand the model to joints of random variables and obtaining

We are only interested in discriminative models and thus do not explicitly specify .

The posterior parameter distribution

is determined via Bayesian inference. We obtain

using Bayes’ law:

which allows for predictions by marginalizing over :

Exact Bayesian inference is intractable for complex models, and we use variational inference for approximate inference using a distribution . We determine by minimizing the following KL divergence:

For Bayesian deep learning models, we use the local reparameterization trick and Monte-Carlo dropout for (Gal and Ghahramani, 2016).

Active Learning. In active learning, we have access to an unlabelled pool set . We iteratively select samples from the pool set and ask an oracle for labels. We then add those newly labeled samples to the training set, retrain our model, and repeat this process until the model satisfies our performance requirements.

Samples are usually acquired in batches instead of individually. We score candidate acquisition batches with the acquisition batch size b using an acquisition function and pick the highest scoring one:

BALD was originally introduced as a one-sample acquisition function of the expected information gain between the prediction for a candidate input and the model parameters : In BatchBALD (Kirsch et al., 2019), this one-sample case was canonically extended to the batch acquisition case using the expected information gain between the joint of the predictions for the batch candidates and the model parameters :

Notation. Instead of , , we will write , and so on to to cut down on notation. Like above, all terms can be canonically extended to sets by substituting the joint. We provide the full derivations in the appendix. Also note again that lower-case variables like are outcomes and upper-case variables like are random variables, with the exception of the datasets etc, which are sets of outcomes.

3.1 Bald Core-Set by Disagreement

We examine BALD through the lens of our new notation and develop CSD as information gain. First, we note that BALD does not optimize for loss (cross-entropy) of the test distribution to become minimal. It does not try to pick labels which minimize the generalization loss.

BALD maximizes the expected information gain: . We assume that our Bayesian model contains the true generating model parameters and by selecting samples that minimize the uncertainty , the model parameters will converge towards these true parameters as .

BALD as an Approximation. BALD as the expected information gain is just an expectation given the current model’s predictions for :

(1)

where following the definition:

(2)

That is, we can view BALD as weighting the information gains for different by the current model’s belief that is correct. If we had access to the label or a better proxy distribution for the label, we could improve on this. This could in particular help with the cold starting problem in active learning (when one starts training with no initial training set) and the model predictions are not trustworthy at all. When we have access to the labels, we can directly use the information gain and select the samples using a Core-Set by Disagreement aquisition function:

(3)

Evaluating the Information Gain. We show how to compute the information for the special case of an MC dropout model with dropout rate

. Computing the information gain for complex models is not trivial as its canonical expansion requires an explicit model. However, most available Bayesian neural networks, like MC dropout models, only provide implicit models which we can only sample from but which do not provide an easy way to compute their density. Moreover, to compute

naively we would have to perform a Bayesian inference step. However, we can rewrite:

(4)

We can expand to:

(5)

The expectation of

2
is just , so we have:

To compute , we use importance sampling:

Finally, if we use Monte-Carlo dropout with dropout rate to obtain a variational model distribution , we have , and we can approximate

3
as:

In this special case, we indeed have . We can use the surprise to approximate the information gain. However, as we will see below this approximation is brittle.

4 Experiments

MNIST. We implement CSD and evaluate it on MNIST to show that it can identify a core-set of training samples that achieves high accuracy and low loss. Because we compute the information gain using the provided labels, CSD is very sensitive to mislabeled samples: if a sample is mislabeled, it will necessary have a very high information gain and the model will become misspecified. For this reason, we train a LeNet ensemble with 5 models on MNIST and discard all samples with predictive entropy nats and whose labels do not match the predictions. This removes about 5678 samples from the training set.

We use a LeNet model (LeCun et al., 1998) with MC dropout (dropout rate ) in the core-set setting where we have access to labels but otherwise use an active learning setup with greedy acquisition. We use individual acquisition and compare to BALD, which does not make use of label information, and which we use as a sanity baseline. The training regime follows the one described in Kirsch et al. (2019).

Acquisition Function 90% Acc 95% Acc
Uniform 125/130/150
BALD 88/91/99 130/145/167
CSD (ours) 55/58/58 105/111/115
Table 1:

25%/50%/75% quantiles for reaching 90% and 95% accuracy on MNIST.

5 trials each.

Figure 1 shows that CSD strongly outperforms BALD on MNIST (both with individual acquisitions). Indeed, only 58 samples are required to reach 90% accuracy on average and 111 samples for 95% accuracy compared to BALD which needs about 30 samples more in each case, see also Table 1. We show an ablation of using CSD without removing mislabeled or ambiguous samples from the training set in Figure 2.

Figure 2: Ablation with ambiguous and mislabeled training samples included: CSD vs BALD vs uniform acquisition on MNIST. CSD performs worse than uniform acquisition. 5 trials each.

CIFAR-10.

However, we cannot produce the same results on cleaned CIFAR-10 (similar like MNIST described above) with ResNet18 models and MC dropout. BALD performs much better than CSD, even when cold starting. The accuracy plot is depicted in

Figure 3. This indicates that something is wrong. We have not been able to identify the issue yet.

BatchCSD. Finally, we examine an extension of CSD to the batch case following (Kirsch et al., 2019) and compute using the approximation . However, even for a batch acquisition size of 5, this approximation does not work well in the batch case already as depicted in fig. 4 on MNIST. BatchCSD performs worse than Uniform for samples and worse than BALD for 150 samples.

Figure 3: CSD vs BALD on CIFAR-10 without ambiguous and mislabeled training samples have been removed. CSD performs worse than BALD. 5 trials each.
Figure 4: Ablation BatchCSD vs CSD (vs BALD vs uniform acquisition) on MNIST. BatchCSD performs worse than BALD. 5 trials each.

5 Conclusion & Limitations

In addition to a unified notation for information-theoretic quantities for both random variables and outcomes, we have unified information gain and surprise by defining the mutual information appropriately. We have used this notation to reinterpret BALD as expected information gain and found an approximation for the information gain which allowed use to introduce CSD and show that it works on MNIST. However, we have not been able to provide good results for CIFAR-10 or successfully extend our approximation to the batch case. Moreover, the approximation we have used only works for MC dropout with dropout rate . Our approach requires an explicit model, otherwise. Importantly, unlike BALD, the information gain in CSD does also not seem to be submodular, and we cannot infer a optimality that way (Kirsch et al., 2019)—although BALD’s submodularity and optimality is not tied to its generalization loss anyway.

References