distillation
Pareto-optimal data compression for binary classification tasks
view repo
The goal of lossy data compression is to reduce the storage cost of a data set X while retaining as much information as possible about something (Y) that you care about. For example, what aspects of an image X contain the most information about whether it depicts a cat? Mathematically, this corresponds to finding a mapping X→ Z≡ f(X) that maximizes the mutual information I(Z,Y) while the entropy H(Z) is kept below some fixed threshold. We present a method for mapping out the Pareto frontier for classification tasks, reflecting the tradeoff between retained entropy and class information. We first show how a random variable X (an image, say) drawn from a class Y∈{1,...,n} can be distilled into a vector W=f(X)∈R^n-1 losslessly, so that I(W,Y)=I(X,Y); for example, for a binary classification task of cats and dogs, each image X is mapped into a single real number W retaining all information that helps distinguish cats from dogs. For the n=2 case of binary classification, we then show how W can be further compressed into a discrete variable Z=g_β(W)∈{1,...,m_β} by binning W into m_β bins, in such a way that varying the parameter β sweeps out the full Pareto frontier, solving a generalization of the Discrete Information Bottleneck (DIB) problem. We argue that the most interesting points on this frontier are "corners" maximizing I(Z,Y) for a fixed number of bins m=2,3... which can be conveniently be found without multiobjective optimization. We apply this method to the CIFAR-10, MNIST and Fashion-MNIST datasets, illustrating how it can be interpreted as an information-theoretically optimal image clustering algorithm.
READ FULL TEXT VIEW PDFPareto-optimal data compression for binary classification tasks
A core challenge in science, and in life quite generally, is data distillation: keeping only a manageably small fraction of our available data while retaining as much information as possible about something () that we care about. For example, what aspects of an image contain the most information about whether it depicts a cat () rather than a dog ()? Mathematically, this motivates finding a mapping that maximizes the mutual information while the entropy is kept below some fixed threshold. The tradeoff between (bits stored) and (useful bits) is described by a Pareto frontier, defined as
(1) |
and illustrated in Figure 1 (this is for a toy example described below; we compute the Pareto frontier for our cat/dog example in Section III). The shaded region is impossible because and . The colored dots correspond to random likelihood binnings into various numbers of bins, as described in the next section, and the upper envelope of all attainable points define the Pareto frontier. Its “corners”, which are marked by black dots and maximize for bins (), are seen to lie close to the vertical dashed lines , corresponding to all bins having equal size. We plot the -axis flipped to conform with the tradition that up and to the right are more desirable.
The core goal of this paper is to present a method for computing such Pareto frontiers.
This Pareto frontier challenge is part of the broader quest for data distillation: lossy data compression that retains as much as possible of the information that is useful to us. Ideally, the information can be partitioned into a set of independent chunks and sorted from most to least useful, enabling us to select the number of chunks to retain so as to optimize our tradeoff between utility and data size. Consider two random variables and which may each be vectors or scalars. For simplicity, consider them to be discrete with finite entropy^{1}^{1}1The discreteness restriction loses us no generality in practice, since since we can always discretize real numbers by rounding them to some very large number of significant digits.. For prediction tasks, we might interpret as the future state of a dynamical system that we wish to predict from the present state . For classification tasks, we might interpret as a class label that we wish to predict from an image, sound, video or text string . Let us now consider various forms of ideal data distillation, as summarized in Table 1.
Random | What is | Probability distribution | |
---|---|---|---|
vectors | distilled? | Gaussian | Non-Gaussian |
1 | Entropy | PCA | Autoencoder |
2 | Mutual information | CCA | Latent reps |
Data distillation: the relationship between Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), nonlinear autoencoders and nonlinear latent representations.
If we distill as a whole, then we would ideally like to find a function such that the so-called latent representation retains the full entropy , decomposed into independent^{2}^{2}2When implementing any distillation algorithm in practice, there is always a one-parameter tradeoff between compression and information retention which defines a Pareto frontier. A key advantage of the latent variables (or variable pairs) being statistically independent is that this allows the Pareto frontier to be trivially computed, by simply sorting them by decreasing information content and varying the number retained. parts with vanishing mutual infomation: For the special case where
is a vector with a multivariate Gaussian distribution, the optimal solution is Principal Component Analysis (PCA)
Pearson (1901), which has long been a workhorse of statistical physics and many other disciplines: here is simply a linear function mapping into the eigenbasis of the covariance matrix of . The general case remains unsolved, and it is easy to see that it is hard: if where implements some state-of-the-art cryptographic code, then finding (to recover the independent pieces of information and discard the useless parts) would generically require breaking the code. Great progress has nonetheless been made for many special cases, using techniques such as nonlinear autoencoders Vincent et al. (2008) and Generative Adversarial Networks (GANs) Goodfellow et al. (2014).Now consider the case where we wish to distill and separately, into and , retaining the mutual information between the two parts. Then we ideally have , This problem has attracted great interest, especially for time series where and for some sequence of states () in physics or other fields, where one typically maps the state vectors into some lower-dimensional vectors , after which the prediction is carried out in this latent space. For the special case where has a multivariate Gaussian distribution, the optimal solution is Canonical Correlation Analysis (CCA) Hotelling (1936): here both and
are linear functions, computed via a singular-value decomposition (SVD)
Eckart and Young (1936) of the cross-correlation matrix after prewhitening and . The general case remains unsolved, and is obviously even harder than the above-mentioned 1-vector autoencoding problem. The recent work Oord et al. (2018); Clark et al. (2019) review the state-of-the art as well as presenting Contrastive Predictive Coding and Dynamic Component Analysis, powerful new distillation techniques for time series, following the long tradition of setting even though this is provably not optimal for the Gaussian case as shown in Tegmark (2019).The goal of this paper is to make progress in the lower right quadrant of Table 1. We will first show that if
(as in binary classification tasks) and we can successfully train a classifier that correctly predicts the conditional probability distribution
, then it can be used to provide an exact solution to the distillation problem, losslessly distilling into a single real variable . We will generalize this to an arbitrary classification problem by losslessly distilling into a vector , although in this case, the components of the vector may not be independent. We will then we return to the binary classification case and provide a family of binnings that map into an integer , allowing us to scan the full Pareto frontier reflecting the tradeoff between retained entropy and class information, illustrating the end-to-end procedure with the CIFAR-10, MNIST and Fashion-MNIST datasets.This work is closely related to the Information Bottleneck (IB) method Tishby et al. (2000), which provides an insightful, principled approach for balancing compression against prediction Tan et al. (2019). Just as in our work, the IB method aims to find a random variable that loosely speaking retains as much information as possible about and as little other information as possible. The IB method implements this by maximizing the IB-objective
(2) |
where the Lagrange multiplier tunes the balance between knowing about and forgetting about . Strouse and Schwab (2017) considered the alternative Deterministic Information Bottleneck (DIB) objective
(3) |
to close the loophole where retains random information that is independent of both and (which is possible if is function that contains random components rather than fully deterministic^{3}^{3}3If for some deterministic function , which is typically not the case in the popular variational IB-implementation Alemi et al. (2016); Chalk et al. (2016); Fischer (2018), then , so , which means the two objectives (2) and (3) are identical.). However, there is a well-known problem with this objective that occurs when is continuous: is strictly speaking infinite, since it requires an infinite amount of information to store the infinitely many decimals of a generic real number. Although this infinity is normally regularized away by only defining up to an additive constant, which is irrelevant when minimizing (3), the problem is that we can define a new rescaled random variable
(4) |
for a constant and obtain^{4}^{4}4Throughout this paper, we take to denote the logarithm in base , so that entropy and mutual information are measured in bits.
(5) |
and
(6) |
This means that by choosing , we can make arbitrarily negative while keeping unchanged, thus making arbitrarily negative. The objective is therefore not bounded from below, and trying to minimize it will not produce an interesting result. We will eliminate this -rescaling problem by making discrete rather than continuous, so that is always well-defined and finite. Another challenge with the DIB objective of equation (3), which we will also overcome, is that it minimizes a linear combination of the two axes in Figure 1, and can therefore only discover concave parts of the Pareto frontier, not convex ones (which are seen to dominate in Figure 1).
The rest of this paper is organized as follows: In Section II.1, we will provide an exact solution for the binary classification problem where by losslessly distilling into a single real variable . We also generalize this to an arbitrary classification problem by losslessly distilling into a vector , although the components of the vector may not be independent. In Section II.2, we return to the binary classification case and provide a family a binnings that map into an integer, allowing us to scan the full Pareto frontier reflecting the tradeoff between retained entropy and class information. We apply our method to various image datasets in Section III and discuss our conclusions in Section IV
Our algorithm for mapping the Pareto frontier transforms our original data set in a series of steps which will be describe in turn below:
(7) |
As we will show, the first, second and fourth transformations retain all mutual information with the label , and the information loss about can be kept arbitrarily small in the third step. In contrast, the last step treats the information loss as a tuneable parameter that parameterizes the Pareto frontier.
Our first step is to compress (an image, say) into , a set of real numbers, in such a way that no class information is lost about .
(Lossless Distillation Theorem): For an arbitrary random variable and a categorical random variable , we have
(8) |
where is defined by^{5}^{5}5Note that we ignore the component since it is redundant: .
(9) |
Let denote the domain of , i.e., , and define the set-valued function
These sets form a partition of parameterized by , since they are disjoint and
(10) |
For example, if and , then the sets are simply contour curves of the conditional probability . This partition enables us to uniquely specify as the pair by first specifying which set it belongs to (determined by ), and then specifying the particular element within that set, which we denote . This implies that
(11) |
completing the proof. The last equal sign follows from the fact that the conditional probability is independent of , since it is by definition constant throughout the set . ∎
The following corollary implies that is an optimal distillation of the information has about , in the sense that it constitutes a lossless compression of said information: as shown, and the total information content (entropy) in cannot exceed that of since it is a deterministic function thereof.
With the same notation as above, we have
(12) |
For any two random variables, we have the identity , where is their mutual information and denotes conditional entropy. We thus obtain
(13) | |||||
which completes the proof. We obtain the second line by using from Theorem 1 and specifying by and , and the third line since is independent of , as above. ∎
In most situations of practical interest, the conditional probability distribution
is not precisely known, but can be approximated by training a neural-network-based classifier that outputs the probability distribution for
given any input . We present such examples in Section III. The better the classifier, the smaller the information loss will be, approaching zero in the limit of an optimal classifier.Let us now focus on the special case where , i.e., binary classification tasks. For example, may correspond to images of equal numbers of felines and canines to be classified despite challenges with variable lighting, occlusion, etc. as in Figure 2, and may correspond to the labels “cat” and “dog”. In this case, contains bit of information of which bit is contained in . Theorem II.1 shows that for this case, all of this information about whether an image contains a cat or a dog can be compressed into a single number which is not a bit like , but a real number between zero and one.
The goal of this section is find a class of functions that perform Pareto-optimal lossy compression of , mapping it into an integer that maximizes for a fixed entropy .^{6}^{6}6Throughout this paper, we will use the term “Pareto-optimal” or “optimal” in this sense, i.e., maximizing for a fixed . The only input we need for our work in this section is the joint probability distribution , whose marginal distributions are the discrete probability distribution for for and the probability distribution for , which we will henceforth assume to be continuous:
(14) | |||||
(15) |
For convenience and without loss of generality, we will henceforth assume that , i.e., that
has a uniform distribution on the unit interval
. We can do this because if were not uniformly distributed, we could make it so by using the standard statistical technique of applying its cumulative probability distribution function to it:(16) |
retaining all information — — since this procedure is invertible almost everywhere.
Given a set of bin boundaries grouped into a vector , we define the integer-value contiguous binning function
(17) |
can thus be interpreted as the ID of the bin into which falls. Note that is a monotonically increasing piecewise constant function of that is shaped like an -level staircase with steps at .
Let us now bin into equispaced bins, by mapping it into an integer (the bin ID) defined by
(18) |
where is the vector with elements , . As illustrated visually in Figure 3 and mathematically in Appendix A, binning corresponds to creating a new random variable for which the conditional distribution is replaced by a piecewise constant function , replacing the values in each bin by their average. The binned variable thus retains only information about which bin falls into, discarding all information about the precise location within that bin. In the limit of infinitesimal bins, , and we expect the above-mentioned discarded information to become negligible. This intuition is formalized by A.1 in Appendix A, which under mild smoothness assumptions ensuring that is not pathological shows that
(19) |
i.e., that we can make the binned data retain essentially all the class information from as long as we use enough bins.
In practice, such as for the numerical experiments that we will present in Section III, training data is never infinite and the conditional probability function is never known to perfect accuracy. This means that the pedantic distinction between and for very large is completely irrelevant in practice. In the rest of this paper, we will therefore work with the unbinned () and binned () data somewhat interchangeably below for convenience, occasionally dropping the apostrophy from when no confusion is caused.
For convenience and without loss of generality, we can assume that the conditional probability distribution is a monotonically increasing function. We can do this because if this were not the case, we could make it so by sorting the bins by increasing conditional probability, as illustrated in Figure 3, because both the entropy and the mutual information are left invariant by this renumbering/relabeling of the bins. The “cat” probability (the total shaded area in Figure 3) is of course also left unchanged by both this sorting and by the above-mentioned binning.
We are now finally ready to tackle the core goal of this paper: mapping the Pareto frontier of optimal data compression that reflects the tradeoff between and . While fine-grained binning has no effect on the entropy and negligible effect on , it dramatically reduces the entropy of our data. Whereas since is continuous^{7}^{7}7While this infinity, which reflects the infinite number of bits required to describe a single generic real number, is customarily eliminated by defining entropy only up to an overall additive constant, we will not follow that custom here, for the reason explained in the introduction., is finite, approaching infinity only in the limit of infinitely many infinitesimal bins. Taken together, these scalings of and imply that the leftmost part of the Pareto frontier , defined by equation (1) and illustrated in Figure 1, asymptotes to a horizontal line of height as .
To reach the interesting parts of the Pareto frontier further to the right, we must destroy some information about . We do this by defining
(20) |
where the function groups the tiny bins indexed by into fewer ones indexed by , . There are vast numbers of such possible groupings, since each group corresponds to one of the nontrivial subsets of the tiny bins. Fortunately, as we will now prove, we need only consider the contiguous groupings, since non-contiguous ones are inferior and cannot lie on the Pareto frontier. Indeed, we will see that for the examples in Section III, suffices to capture the most interesting information.
(Contiguous Binning Theorem): If has a uniform distribution and the conditional probability distribution is monotonically increasing, then all points on the Pareto frontier correspond to binning into contiguous intervals, i.e., if
(21) |
then there exists a set of bin boundaries such that the binned variable satisfies and .
We prove this by contradiction: we will assume that there is a point on the Pareto frontier to which we can come arbitrarily close with for for a compression function that is not a contiguous binning function, and obtain a contradiction by using to construct another compression function lying above the Pareto frontier, with and . The joint probability distribution for the and is given by the Lebesgue integral
(22) |
where is the joint probability distribution for and introduced earlier and is the set , i.e., the set of -values that are grouped together into the large bin. We define the marginal and conditional probabilities
(23) |
Figure 4 illustrates the case where the binning function corresponds to large bins, the second of which consists of two non-contiguous regions that are grouped together; the shaded rectangles in the bottom panel have width , height and area .
According to Theorem B.1 in the Appendix, we obtain the contradiction required to complete our proof (an alternative compression above the Pareto frontier with and ) if there are two different conditional probabilities , and we can change into
so that the joint distribution
of and changes in the following way:Only and change,
both marginal distributions remain the same,
the new conditional probabilities and are further apart.
Figure 4 shows how this can be accomplished for non-contiguous binning: let be a bin with non-contiguous support set (bin 2 in the illustrated example), let be a bin whose support (bin 4 in the example) contains a positive measure subset within two parts and of , and define a new binning function that differs from only by swapping a set against a subset of either or of measure (in the illustrated example, the binning function change implementing this subset is shown with dotted lines). This swap leaves the total measure of both bins (and hence the marginal distribution ) unchanged, and also leaves unchanged. If , we perform this swap between an (as in the figure), and if , we instead perform this swap between an , in both cases guaranteeing that and move further apart (since is monotonically increasing). This completes our proof by contradiction except for the case where ; in this case, we swap to entirely eliminate the discontiguity, and repeat our swapping procedure between other bins until we increase the entropy (again obtaining a contradiction) or end up with a fully contiguous binning (if needed, can be changed to eliminate any measure-zero subsets that ruin contiguity, since they leave the Lebesgue integral in equation (22) unchanged.) ∎
Theorem II.2 implies that we can in practice find the Pareto frontier for any random variable by searching the space of contiguous binnings of after uniformization, binning and sorting. In practice, we can first try the 2-bin case by scanning the bin boundary , then trying the 3-bin case by trying bin boundaries , then trying the 4-bin case, etc., as illustrated in Figure 1. Each of these cases corresponds to a standard multi-objective optimization problem aiming to maximize the two objectives and . We perform this optimization numerically with the AWS algorithm of Kim and de Weck (2005) as described in the next section.
Although the uniformization, binning and sorting procedures are helpful in practice as well as for for simplifying proofs, they are not necessary in practice. Since what we really care about is grouping into integrals containing similar conditional probabilities , not similar -values, it is easy to see that binning horizontally after sorting is equivalent to binning vertically before sorting. In other words, we can eliminate the binning and sorting steps if we replace “horizontal” binning by “vertical” binning
(24) |
where denotes the conditional probability as before.
We will now test our algorithm for Pareto frontier mapping using some well-known datasets: the CIFAR-10 image database Krizhevsky et al. (2014)
, the MNIST database of hand-written digits
LeCun et al. (2010) and the Fashion-MNIST database of garment images Xiao et al. (2017). Before doing this, however, let us build intuition for how it works by testing on a much simpler toy model that is analytically solvable, where the accuracy of all approximations can be exactly determined.Let the random variables and be defined by the bivariate probability distribution
(25) |
which corresponds to and being two independent and identically distributed random variables with triangle distribution if , but flipped if . This gives bit and mutual information
(26) |
The compressed random variable defined by equation (9) is thus
(27) |
After defining for a vector of bin boundaries, a straightforward calculation shows that the joint probability distribution of and the binned variable is given by
(28) |
where the cumulative distribution function
is given by(29) |
Computing using this probability distribution recovers exactly the same mutual information bits as in equation (26), as we proved in Theorem II.1.
Given any binning vector , we can plot a corresponding point
in Figure 1 by computing
,
, etc.,
where is given by equation (28).
The figure shows 6,000 random binnings each for bins; as we have proven, the upper envelope of points corresponding to all possible (contiguos) binnings defines the Pareto frontier. The Pareto frontier begins with the black dot at (the lower right corner), since bin obviously destroys all information. The bin case corresponds to a 1-dimensional closed curve parametrized by the single parameter that specifies the boundary between the two bins: it runs from when , moves to the left until when , and returns to when . The and branches are indistinguishable in Figure 1 because of the symmetry of our warmup problem, but in generic cases, a closed loop can be seen where only the upper part defines the Pareto frontier.
More generally, we see that the set of all binnings into bins maps the vector of bin boundaries into a contiguous region in Figure 1. The inferior white region region below can also be reached if we use non-contiguous binnings.
The Pareto Frontier is seen to resemble the top of a circus tent, with convex segments separated by “corners” where the derivative vanishes, corresponding to a change in the number of bins. We can understand the origin of these corners by considering what happens when adding a new bin of infinitesimal size . As long as is continuous, this changes all probabilites by amounts , and the probabilities corresponding to the new bin (which used to vanish) will now be . The function has infinite derivative at , blowing up as , which implies that the entropy increase . In contrast, a straightforward calculation shows that all -terms cancel when computing the mutual information, which changes only by . As we birth a new bin and move leftward from one of the black dots in Figure 1, the initial slope of the Pareto frontier is thus
(30) |
In other words, the Pareto frontier starts out horizontally to the left of each of its corners in Figure 1. Indeed, the corners are “soft” in the sense that the derivative of the Pareto Frontier is continuous and vanishes at the corners: for a given number of bins, by definition takes its global maximum at the corresponding corner, so the derivative vanishes also as we approach the corner from the right.^{8}^{8}8The first corner (the transition from 2 to 3 bins) can nonetheless look fairly sharp because the 2-bin curve turns around rather abruptly, and right derivative does not vanish in the limit where a symmetry causes the upper and lower parts of the 2-bin loop to coincide.
Our theorems imply that in the limit of infinitely many bins, successive corners become gradually less pronounced (with ever smaller derivative discontinuities), because the left asymptote of the Pareto frontier simply approaches the horizontal line .
For our toy example, we knew the conditional probability distribution and could therefore compute exactly. For practical examples where this is not the case, we can instead train a neural network to implement a function that approximates . For our toy example, we train a fully connected feedforward neural network to predict from
using cross-entropy loss; it has 2 hidden layers, each with 256 neurons with ReLU activation, and a final linear layer with softmax activation, whose first neuron defines
. A illustrated in Figure 5, the network prediction for the conditional probability is fairly accurate, but slightly over-confident, tending to err on the side of predicting more extreme probabilities (further from ). The average KL-divergence between the predicted and actual conditional probability distribution is about , which causes negligible loss of information about .For practical examples where the conditional joint probability distribution
cannot be computed analytically, we need to estimate it from the observed distribution of
-values output by the neural network. For our examples, we do this by fitting each probability distribution by a beta-distribution times the exponential of a polynomial of degree
:(31) |
where the coefficient is fixed by the normalization requirement . We use this simple parametrization because it can fit any smooth distribution arbitrarily well for sufficiently large , and provides accurate fits for the probability distributions in our examples using quite modest ; for example, gives for
which causes rather negligible loss of information about . For our examples below where we do not know the exact distribution and merely have samples drawn from it, one for each element of the data set, we instead perform the fitting by the standard technique of minimizing the cross entropy loss, i.e.,
(33) |
Table 2 lists the fitting coefficients used, and Figure 6 illustrates the fitting accuracy.
Experiment | Y | |||||||
---|---|---|---|---|---|---|---|---|
Analytic | 1 | 0.0668 | -4.7685 | 16.8993 | -25.0849 | 13.758 | 0.5797 | -0.2700 |
2 | 0.4841 | -5.0106 | 5.7863 | -1.5697 | -1.7180 | -0.3313 | -0.0030 | |
Fashion-MNIST | Pullover | 0.2878 | -12.9596 | 44.9217 | -68.0105 | 37.3126 | 0.3547 | -0.2838 |
Shirt | 1.0821 | -23.8350 | 81.6655 | -112.2720 | 53.9602 | -0.4068 | 0.4552 | |
CIFAR-10 | Cat | 0.9230 | 0.2165 | 0.0859 | 6.0013 | -1.0037 | 0.8499 | |
0.6795 | 0.0511 | 0.6838 | -1.0138 | 0.9061 | ||||
Dog | 0.8970 | 0.2132 | 0.0806 | 6.0013 | -1.0039 | 0.8500 | ||
0.7872 | 0.0144 | 0.7974 | -0.9440 | 0.7237 | ||||
MNIST | One | 3.1188 | -65.224 | 231.4 | -320.054 | 150.779 | 1.1226 | -0.6856 |
Seven | -1.0325 | -47.5411 | 189.895 | -269.28 | 127.363 | -0.8219 | 0.1284 |
The MNIST database consists of 28x28 pixel greyscale images of handwritten digits: 60,000 training images and 10,000 testing images LeCun et al. (2010). We use the digits 1 and 7, since they are the two that are most frequently confused, relabeled as (ones) and (sevens). To increase difficulty, we inject 30% of pixel noise, i.e., randomly flip each pixel with 30% probability (see examples in Figure 2). For easy comparison with the other cases, we use the same number of samples for each class.
The Fashion-MNIST database has the exact same format (60,000 + 10,000 28x28 pixel greyscale images), depicting not digits but 10 classes of clothing Xiao et al. (2017). Here we again use the two most easily confused classes: pullovers () and shirts (); see Figure 2 for examples.
The architecture of the neural network classifier we train on the above two datasets is adapted from here^{9}^{9}9We use the neural network architecture from github.com/pytorch/examples/blob/master/mnist/main.py; the only difference in architecture is that our output number of neurons is 2 rather than 10.
: two convolutional layers (kernel size 5, stride 1, ReLU activation) with 20 and 50 features, respectively, each of which is followed by max-pooling with kernel size 2. This is followed by a fully connected layer with 500 ReLU neurons and finally a softmax layer that produces the predicted probabilities for the two classes. After training, we apply the trained model to the test set to obtain
for each dataset.CIFAR-10 Krizhevsky and Hinton (2009)
is one of the most widely used datasets for machine learning research, and contains 60,000
color images in 10 different classes. We use only the cat () and dog () classes, which are the two that are empirically hardest to discriminate; see Figure 2 for examples. We use a ResNet18 architecture^{10}^{10}10The architecture is adapted from github.com/kuangliu/pytorch-cifar, for which we use its ResNet18 model; the only difference in architecture is that we use 2 rather than 10 output neurons. He et al. (2016). We train with a learning rate of 0.01 for the first 150 epochs, 0.001 for the next 100, and 0.0001 for the final 100 epochs; we keep all other settings the same as in the original repository.
Figure 6 shows observed cumulative distribution functions (solid curves) for the generated by the neural network classifiers, together with our above-mentioned analytic fits (dashed curves).^{11}^{11}11In the case of CIFAR-10, the observed distribution was so extremely peaked near the endpoints that we replaced equation (31) by the more accurate fit
The final result of our calculations is shown in Figure 8: the Pareto frontiers for our four datasets, computed using our method. We will return to discuss these curves extensively in the next section.
We have presented a method for mapping out the Pareto frontier for classification tasks (as in Figure 8), reflecting the tradeoff between retained entropy and class information. We first showed how a random variable (an image, say) drawn from a class can be distilled into a vector losslessly, so that . For the case of binary classification, we then showed how the Pareto frontier is swept out by a one-parameter family of binnings of into a discrete variable that corresponds to binning into bins, such that is maximized for each fixed entropy .
To build intuition for our results, let us consider our CIFAR-10 example of images depicting cats () and dogs () as in Figure 2 and ask what aspects of an image capture the most information about the species . Above, we estimated that bits, so what captures the largest fraction of this information for a fixed entropy? Given a good neural network classifier, a natural guess might be the single bit containing its best guess, say “it’s probably a cat”. This corresponds to defining if , otherwise, and gives the joint distribution )
corresponding to bits. But our results show that we can improve things in two separate ways.
First of all, if we only want to store one bit , then we can do better, corresponding to the first “corner” in Figure 8: moving the likelihood cutoff from to , i.e., redefining if , increases the mutual information to bits.
More importantly, we are still falling far short of the bits of information we had without data compression, capturing only 88% of the available species information. Our Theorem II.1 showed that we can retain all this information if we instead define as the cat probability itself: . For example, a given image might be compressed not into “It’s probably a cat” but into “I’m 94.2477796% sure it’s a cat”. However, it is clearly impractical to report the infinitely many decimals required to retain all the species information, which would make infinite. Our results can be loosely speaking interpreted as the optimal way to round , so that the information required to store it becomes finite. We found that simply rounding to a fixed number of decimals is suboptimal; for example, if we pick 2 decimals and say “I’m 94.25% sure it’s a cat”, then we have effectively binned the probability into 10,000 bins of equal size, even though we can often do much better with bins of unequal size, as illustrated in the bottom panel of Figure 1. Moreover, when the probability is approximated by a neural network, we found that what should be optimally binned is not but the conditional probability illustrated in Figure 7 (“vertical binning”).
It is convenient to interpret our Pareto-optimal data compression as clustering, i.e., as a method of grouping our images or other data into clusters based on what information they contain about . For example, Figure 2 illustrates CIFAR-10 images clustered by their degree of “cattiness” into 5 groups that might be nicknamed “1.9% cat”, “11.8% cat”, “31.4% cat”, “68.7% cat” and “96.7% cat”. This gives the joint distribution ) where
and gives , thus increasing the fraction of species information retained from 82% to 99%.
This is a striking result: we can group the images into merely five groups and discard all information about all images except which group they are in, yet retain 99% of the information we cared about. Such grouping may be helpful in many contexts. For example, given a large sample of labeled medical images of potential tumors, they can be used to define say five optimal clusters, after which future images can be classified into five degrees of cancer risk that collectively retain virtually all the malignancy information in the original images.
Given that the Pareto Frontier is continuous and corresponds to an infinite family of possible clusterings, which one is most useful in practice? Just as in more general multi-objective optimization problems, the most interesting points on the frontier are arguably its “corners”, indicated by dots in Figure 8, where we do notably well on both criteria. We see that the parts of the frontier between corners tend to be convex and thus rather unappealing, since any weighted average of and will be maximized at a corner. Our results show that these corners can conveniently be computed without numerically tedious multiobjective optimization, by simply maximizing the mutual information for bins. The first corner, at
bit, corresponds to the learnability phase transition for DIB,
i.e., the largest for which DIB is able to learn a non-trivial representation. In contrast to the IB learnability phase transition (Wu et al., 2019) where increases continuously from 0, here the has a jump from 0 to a positive value, due to the non-concave nature of the Pareto frontier.Moreover, all the examples in Figure 8 are seen to get quite close to the asymptote for , so the most interesting points on the Pareto frontier are simply the first handful of corners. For these examples, we also see that the greater the mutual information is, the fewer bins are needed to capture most of it.
An alternative way if interpreting the Pareto plane in Figure 8 is as a traveoff between two evils:
What we are calling the information bloat has also been called “causal waste” Thompson et al. (2018). It is simply the conditional entropy of given , and represents the excess bits we need to store in order to retain the desired information about . Geometrically, it is the horizontal distance to the impossible region to the right in Figure 8, and we see that for MNIST, it takes local minima at the corners for both 1 and 2 bins. The information loss is simply the information discarded by our lossy compression of . Geometrically, it is the vertical distance to the impossible region at the top of Figure 1. As we move from corner to corner adding more bins, we typically reduce the information loss at the cost of increased information bloat. For the examples in Figure 8, we see that going beyond a handful of bins essentially just adds bloat without significantly reducing the information loss.
We just discussed how lossy compression is a tradeoff between information bloat and information loss. Let us now elaborate on the latter, for the real-world situation where is approximated by a neural network.
If the neural network learns to become perfect, then the function that it implements will be such that satisfies , which corresponds to the dashed curves in Figure 7 being identical to the solid curves. Although we see that this is close to being the case for the analytic and MNIST examples, the neural networks are further from optimal for Fashion-MNIST and CIFAR-10. The figure illustrates that the general trend is for these neural networks to overfit and therefore be overconfident, predicting probabilities that are too extreme.
This fact that probably indicates that our Fashion-MNIST and CIFAR-10 classifiers destroy information about , but it does not prove this, because if we had a perfect lossless classifier satisfying , then we could define an overconfident lossless classifier by an invertible (and hence information-preserving) reparameterization such as that violates the condition .
So how much information does contain about ? One way to lower-bound uses the classification accuracy: if we have a classification problem where and compress into a single classification bit (corresponding to a binning of into two bins), then we can write the joint probability distribution for and the guessed class as
For a fixed total error rate , Fano’s Inequality implies that the mutual information takes a minimum
(39) |
when , so if we can train a classifier that gives an error rate , then the right-hand-side of equation (39) places a lower bound on the mutual information . The prediction accuracy is shown for reference on the right side of Figure 8. Note that getting close to one bit of mutual information requires extremely high accuracy; for example, 99% prediction accuracy corresponds to only 0.92 bits of mutual information.
We can obtain a stronger estimated lower bound on
from the cross-entropy loss function
used to train our classifiers:(40) |
where denotes the average KL-divergence between true and predicted conditional probability distributions, and denotes ensemble averaging over data points, which implies that
(41) | |||||
If as we discussed above, then and hence the loss can be further reduced be recalibrating as we have done, which increases the information bound from equation (41) up to the the value computed directly from the observed joint distribution .
Unfortunately, without knowing the true probability , there is no rigorous and practically useful upper bound on the mutual information other than the trivial inequality bit, as the following simple counterexample shows: suppose our images are encrypted with some encryption algorithm that is extremely time-consuming to crack, rendering the images for all practical purposes indistinguishable from random noise. Then any reasonable neural network will produce a useless classifier giving even though the true mutual information could be as large as one bit. In other words, we generally cannot know the true information loss caused by compressing , so the best we can do in practice is to pick a corner reasonably close to the upper asymptote in Figure 8.
As mentioned in the introduction, the Discrete Information Bottleneck (DIB) method Strouse and Schwab (2017) maximizes a linear combination of the two axes in Figure 8. We have presented a method solving a generalization of the DIB problem. The generalization lies in switching the objective from equation (3) to equation (1), which has the advantage of discovering the full Pareto frontier in Figure 8 instead of merely the corners and concave parts (as mentioned, the DIB objective cannot discover convex parts of the frontier). The solution lies in our proof that the frontier is spanned by binnings of the likelihood into bins, which enables it to be computed more efficiently than with the iterative/variational method of Strouse and Schwab (2017).
The popular original Information Bottleneck (IB) method Tishby et al. (2000) generalizes DIB by allowing the compression function to be non-deterministic, thus adding noise that is independent of . Starting with a Pareto-optimal and adding such noise will simply shift us straight to the left in Figure 8, away from the frontier (which is by definition monotonically decreasing) and into the Pareto-suboptimal region in the vs. plane. As shown in Strouse and Schwab (2017), IB-compressions tend to altogether avoid the rightmost part of Figure 8, with an entropy that never drops below some fixed value independent of .
Our results suggest a number of opportunities for further work, ranging from information theory to machine learning, neuroscience and physics.
As to information theory, it will be interesting to try to generalize our method from binary classification into classification into more than two classes. Also, one can ask if there is a way of pushing the general information distillation problem all the way to bits. It is easy to show that a discrete random variable
can always be encoded as independent random bits (Bernoulli variables) , defined by^{12}^{12}12The mapping from bit strings to integers is defined so that is the position of the last bit that equals one when is preceded by a one. For example, for , the mapping from length-3 bit strings to integers is , , , .
Comments
There are no comments yet.