How (Not) To Train Your Neural Network Using the Information Bottleneck Principle

02/27/2018 ∙ by Rana Ali Amjad, et al. ∙ IEEE 0

In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that, even if the joint distribution between continuous feature variables and the discrete class variable is known, the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, the IB functional is infinite for almost all weight matrices, making the optimization problem ill-posed. Second, the invariance of the IB functional under bijections prevents it from capturing desirable properties for classification, such as robustness, architectural simplicity, and simplicity of the learned representation. We argue that these issues are partly resolved for stochastic DNNs, DNNs that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. We conclude that recent successes reported about training DNNs using the IB framework must be attributed to such solutions. As a side effect, our results imply limitations of the IB framework for the analysis of DNNs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, the information bottleneck (IB) framework has been proposed for analyzing and understanding DNNs [1]

. The IB framework admits evaluating the optimality of the learned representation and has been used to make claims regarding properties of stochastic gradient descent optimization and the computational benefit of many hidden layers 

[2]. Whether these claims all hold true is the subject of an ongoing debate (cf. [3]).

Rather than adding to this debate, the purpose of this paper is to contribute to a different research area sparked by [1]: Training DNNs for classification by minimizing the IB functional. To be more precise, suppose that is a class variable, are features at the input of the DNN, and is either a latent representation or the output of the DNN for the input . The IB functional then is [4]

(1)

for some trade-off parameter . A DNN minimizing this functional thus has a maximally compressed latent representation or output (because the mutual information is small) that is informative about the class variable (because is large). The IB framework thus introduces a regularization term that depends on the representation rather than on the parameters of the DNN. This data-dependent regularization has the potential to capture properties in the latent representations or the output of the DNN desirable for the specific classification task; namely, robustness against noise and small distortions and simplicity of the representation (see Sec. 3 for details).

Subsequently, the IB framework has been used to train DNNs for discrete or continuous features [5, 6, 7, 8, 9]. These works report remarkable performance in classification tasks (see also Sec. 6), but only after slightly departing from the IB framework by claiming that (1) is hard to compute. As a remedy, they replace mutual information terms with bounds in order to obtain cost functions that can be computed and optimized using gradient-based methods.

In this work, we present a thorough analysis of using the IB functional for training DNNs. Specifically, we show that in deterministic DNNs the IB functional leads to an ill-posed optimization problem by either being infinite for almost all parameter settings (Sec. 4.1) or by being a piecewise constant function of the parameters (Sec. 4.2). Moreover, we show in Sec. 4.3 that the IB functional captures only a small subset of properties desirable for the representation when performing classification, and hence is not suitable as a cost function for training deterministic DNNs. We then show that the utility of the IB functional can partly be recovered by applying it for training stochastic DNNs or by including the decision rule (Sec. 5.1 through Sec. 5.3). Furthermore, we argue in Sec. 5.4 that replacing the IB functional by a more well-behaved cost function inspired by (1) may be an even better possibility. To utilize these considerations, we postpone discussing the related work until Sec. 6. We argue that the successes of [5, 6, 7, 8, 9] must be attributed to and provide experimental evidence for the validity of these steps – replacing the functional, making the DNN stochastic, data augmentation, including the decision rule – and not on the fact that these works are based on the IB principle.

In our analysis, we make the uncommon assumption that the joint distribution between the features and the class

is known. This not only admits more rigorous statements, but makes them independent of the optimization heuristic used for training and corresponds to a best-case scenario for training. Nevertheless, we make regular comments on how our analysis changes in case only a finite dataset is available.

Finally, we wish to mention that while the focus of this work is on training DNNs, our results have strong implications on recent efforts in analyzing DNNs using the IB framework. Specifically, while statements about the trade-off between compression and preservation of class information may still be possible if the functional is finite, without applying the remedies from Sec. 5, the IB framework does not admit making statements about the robustness, classification performance, or the representational simplicity of a given DNN.

2 Setup and Preliminaries

We consider a feature-based classification problem. Suppose that the joint distribution between the

-dimensional (random) feature vector

and the (random) class label is given by

. We denote realizations of random variables (RVs) with lower case letters, e.g.,

or . Unless otherwise specified, we assume that has an arbitrary distribution and that has a discrete distribution on some finite set of class labels.

Classification shall be performed by a feed-forward DNN. This is the same setup as the one discussed in [1, 6, 3, 5, 2]. The DNN accepts the RV at the input and responds with the transformed RV , based on which the class label

can be estimated with a decision rule (see 

[10, Chap. 6.2] for a discussion on decision rules). If the DNN has

hidden layers, then the column vector collecting all neuron outputs in the

-th hidden layer is the intermediate representation denoted by . We call the input and the output of the DNN, while for , we call a latent representation. Abusing notation, denotes the number of neurons in the -th layer, e.g., . Whenever we talk about intermediate representations of which the layer number is immaterial we write instead of .

If the DNN is deterministic, then and are related by a function that maps the former to the latter and that depends on a set of (weight and bias) parameters. E.g., if is the matrix of weights between the -th and -th layer, the vector of biases for the -th layer, and

an activation function, then

and

(2)

where the activation function is applied coordinate-wise. Whether the activation function is sigmoidal, ReLU, leaky ReLU, tanh, or softplus is immaterial for the results that follow, unless stated otherwise. We define

to be the -th component of the vector and . We have , where shall be called the encoder for . Similarly, , where is called the decoder of . If the DNN is stochastic, then is a stochastic map parameterized by , and the encoder and decoder are obtained in the same way as for deterministic DNNs by appropriately concatenating the stochastic maps .

We denote entropy, differential entropy, and mutual information by , , and , respectively (see [11, Ch. 2 & 9]). Specifically, if is continuous and is discrete, then

(3)

where all terms can be assumed to be finite. In contrast, whenever is not discrete and whenever is not continuous (see [12, Lemma 3.1 & p. 50]).

3 Learning Representations for Classification

The authors of [1] formulated supervised deep learning as the goal of finding maximally compressed representations of the features that preserve as much information about the class variable as possible. For classification tasks this goal is not sufficient. We will now present a list of properties of an intermediate representation desirable for the classification task. We do not accompany these properties with precise mathematical definitions – this is out of scope of this paper and left for future work. Nevertheless, taken as guiding principles, these properties are sufficient to point out the shortcomings of the IB principle for training DNNs and to discuss ways to remedy them. For classification, the representation should

  • inform about . This means that the representation should contain as much information about the class variable as was contained in the features . I.e., should be a sufficient statistic for .

  • be maximally compressed. The representation should not tell more about than is necessary to correctly estimate , i.e., it should attain invariance, in some sense, to nuisance factors which are not relevant to the class label . Compression, and consequently invariance, can, for example, be quantified statistically (e.g., is a minimal sufficient statistic for ) or geometrically (e.g., data points from different classes are mapped to different dense clusters in ).

  • admit a simple decision function. The successive intermediate representations should be such that the class can be estimated from them using successively “simpler” functions. The term “simple” here has to be taken relative to the capabilities of the information sink or the system processing . E.g., in DNNs, decisions are often made by searching for the output neuron with the maximum activation () or by binary quantization (for ) so the intermediate representation should be such that these simple decision functions suffice to predict the class label from the -dimensional RV .

  • be robust. This means that adding a small amount of noise to or transforming it with a well-behaved transform (e.g., affine transforms or small deformations) should not lead to big differences in the intermediate representation. E.g., the dense clusters in corresponding to different classes should be far apart and the small deformations should not change the cluster in that a data point is mapped to.

Historically, the primary goals of training have been extracting information about from the input such that this simplified extracted information can be effectively used by a simple decision mechanism to estimate (P and P). Traditionally, these goals have been achieved by using mean-squared error or cross-entropy as a cost function. Furthermore, robustness (P) has been linked to improved generalization capabilities of learning algorithms [13, 14, 15]; regularization measures such as dropout have been shown to instill robustness and improved generalization. Reference [1] has additionally introduced the idea of having maximally compressed intermediate representations (P). The intuition behind this requirement is that this should avoid overfitting by making the network forget about the specific details of the individual examples and by making it invariant to nuisances not relevant for the classification task.

In addition to achieving P through P, one may wish that the DNN producing these intermediate representations is architecturally economical. E.g., the DNN should consist of few hidden layers, of few neurons per layer, of few convolution filters or sparse weight matrices , or the inference process based on the DNN should be computationally economical. This goal becomes particularly important when deploying these DNNs on embedded/edge devices with limited computational resources and real-time processing constraints. While currently the network architectures leading to state-of-the-art performance in various classification tasks are highly over-parameterized, it has also been observed that a major portion of the network parameters can be pruned without significant deterioration in performance [16, 17, 18] and it has been suggested that the over-parameterization of the network just provides ease of optimization during training [19]. Hence one may wish for obtaining certain desired characteristics in intermediate representations that either help in training architecturally/computationally economical DNNs to achieve state-of-the-art performance or that admit significant pruning after training without performance degradation.

Of course, these goals are not completely independent. For example, if a representation is robust and compressed, e.g., if the different regions in input domain corresponding to different classes are mapped to clusters dense and far apart in the intermediate representation domain , then it may be easier to find a simple decision rule to estimate from . Such a representation , however, may require an encoder with significant architectural/computational complexity.

Since goals P-P are formulated as properties of the intermediate representation, achieving them can be accomplished by designing regularizers for based on, e.g., the joint distribution between , , and . Such regularization departs from classical regularization that depends on the parameters of the DNN and relates closely to representation learning. Representation learning is an active field of research and various sets of desired properties for representations have previously been proposed. These are similar to our proposal but differ in subtle and key aspects.

In [20], Bengio et al. discussed desired characteristics of representations of the input in terms of invariant, disentangled and smoothly varying factors whereas our focus is on learning representations for a specific classification task. Nevertheless, P-P have similarities to the properties discussed in [20]. For example, the hierarchical organization of explanatory factors discussed in [20] can lead to more abstract concepts at deeper layers. This subsequently may imply successively simpler decision functions required to estimate from (P). Similarly, [20] discusses invariance and manifold learning mainly in the context of auto-encoders, focusing primarily on . Our P goes one step further by including in the picture, i.e., it aims to remove all information from that is not useful for determining . In a geometric understanding of compression this could mean to collapse the input manifolds corresponding to different class labels to, for example, separate dense clusters in (as observed in [6, Fig. 2] and [5, Fig. 2]). Furthermore, in the context of representation learning, robustness is often related to denoising and contractive auto-encoders. However, our P aims to learn representations that are robust for the classification task, whereas for auto-encoders the aim is to learn robust representations to recover the input.

The authors of [21] focused on formulating a similar set of desired properties in terms of information-theoretic objectives. Their approach involved considering also network parameters as RVs, unlike [6, 5, 1] and our work where only , , and latent representations (which are transformations of ) are RVs. Their definitions share similar intuitive meaning as ours; e.g., sufficiency is equivalent to P, minimality and invariance follow the same spirit as P, and invariance can also be partially linked to robustness (P). However, as we discuss in Sec. 4 (at least for the case when only and are RVs, but not the network parameters), defining P-P in terms of information-theoretic quantities may not imply characteristics in DNNs that are desired for a classification task.

Both [20] and [21] have introduced an additional desired property of representations that they call disentanglement. In the context of classification, disentanglement is meant to complement invariance (P and P in our case). Invariance is achieved by keeping the robust features which are relevant to the classification task, whereas disentanglement requires making the extracted relevant features independent from one another (in the sense of total correlation [21] or some other metric). We have not included this property in our list for the following two reasons: First, disentangling features not necessarily improves classification or generalization performance. Second, features that are understandable for humans are not necessarily statistically independent when conditioned on the class variable (such as, e.g., size and weight of an object). We believe that more experiments are necessary to determine whether (and when) disentanglement, separated from other desirable properties, improves classification performance or human understandability of the internal representations. Thus, for now, disentanglement is not included in our list of desirable properties.

4 Why and How IB Fails for Training Deterministic DNNs

In this section, we investigate the problem of learning an intermediate representation (which can also be ) by a deterministic DNN with a given structure via minimizing the IB functional, i.e., we consider111Note that since the DNN is deterministic, we have . The IB functional thus coincides with the “deterministic” IB functional proposed in [22].

(4)

The IB functional applied to DNNs therefore focuses on P and P, defining them via the mutual information terms and , respectively. Such an approach has been proposed by [1, 2] and, subsequently, the IB framework has been suggested as a possible design principle for DNNs [1, 6, 5, 8]. It was claimed that on this basis compressed, simple, and robust representations can be obtained (see [6, Fig. 2] and [5, Fig. 2]).

Indeed, while the intermediate layers of a DNN with good performance are characterized by a high , they do not need to have small (cf. [23]), indicating that a small value of the IB functional is not necessary for good classification performance. Furthermore, since the IB framework was introduced to regularize intermediate representations rather than DNN parameters, a small value of does not imply low architectural/computational complexity, as was empirically observed in [2] . Finally, small values of do not relate causally to improved generalization performance, as it has been observed based on empirical evidence in [3].

We show that applying the IB framework for training DNNs in this way suffers from two more major issues: The first issue is that, in many practically relevant cases, the IB functional is either equal to infinity or a piecewise constant function of the set of parameters . This either makes the optimization problem ill-posed or makes solving it difficult. We investigate these issues in Secs 4.1 and 4.2. The second issue, which we investigate in Sec. 4.3, is connected to the invariance of mutual information under bijections and shows that focusing on goals P and P is not sufficient for a good classification system, at least when capturing P and P within the IB functional (4). Specifically, we show that minimizing the IB functional (4

) does not necessarily lead to classifiers that are robust (P

) or that allow using simple decision functions (P).

4.1 Continuous Features: The IB Functional is Infinite

Solving (4) requires that the IB functional can be evaluated for a set of parameters . Since is a discrete RV with finite support, the precision term is finite and can be computed (at least in principle). Suppose now that the distribution of the features has an absolutely continuous component. Under this assumption, the following theorem shows that, for almost every non-trivial choice of , the IB functional is infinite and, hence, its optimization is ill-posed. The proof is deferred to Sec. 8.

Theorem 1.

Let be an

-dimensional RV, the distribution of which has an absolutely continuous component with a probability density function

that is continuous on a compact set in . Consider a DNN as in the setup of Sec. 2. Suppose that the activation function is either bi-Lipschitz or continuously differentiable with strictly positive derivative. Then, for every and almost every choice of weight matrices , we have

(5)
(a) Discrete Distribution

(b) Data Set
(c) Continuous Distribution

(d) Neural Network Function

(e) Discrete Distribution

(f) Data Set

(g) Continuous Distribution
Fig. 1: (a)-(c): The line segment depicts the set , from which the feature RV takes its values. Red (black) color indicates feature values corresponding to class (). (a): The one-dimensional feature variable has a discrete distribution with mass points as indicated by the circles. The size of the circles is proportional to the probability mass. (b): Training based on a data set . Crosses indicate data points. (c): The one-dimensional feature variable has a continuous distribution with support indicated by the thick lines, the probability masses on each interval are identical to the probability masses of the points in (a). (d): The function implemented by a DNN with a one hidden layer with two neurons, ReLU activation functions, and a single output neuron. The parameters leading to this function are and . (e)-(g) show the mutual information as a function of the parameter , for , evaluated on a grid of ranging from 0 to 5 in steps of 0.05. It can be seen that the mutual information is piecewise constant. The missing values in (g) indicate that the mutual information is infinite at the respective positions.

In [3, Appendix C] it has been observed that the mutual information between the continuously distributed input and an intermediate representation becomes infinite if has a continuous distribution. This assumption is often not satisfied: For example, the output of a ReLU activation function is, in general, the mixture of a continuous and a discrete distribution. Also, if the number of neurons of some layer exceeds the number of neurons of any preceding layer or the dimension of the input , then cannot have a continuous distribution on if the activation functions satisfy the conditions of Theorem 1. Therefore, our Theorem 1 is more general than [3, Appendix C] in the sense that continuity of the distribution is not required.

Theorem 1 shows that the IB functional leads to an ill-posed optimization problem for, e.g., sigmoidal and tanh activation functions (which are continuously differentiable with strictly positive derivative) as well as for leaky ReLU activation functions (which are bi-Lipschitz). The situation is different for ReLU or step activation functions. For these activation functions, the intermediate representations may have purely discrete distributions, from which follows that the IB functional is finite (at least for a non-vanishing set of parameters). As we discuss in Sec. 4.2, in such cases other issues dominate, such as the IB functional being piecewise constant.

Note that the issue discussed in this section is not that the IB functional is difficult to compute, as was implied in [6, 5]. Indeed, Theorem 1 provides us with the correct value of the IB functional, i.e., infinity, for almost every choice of weight matrices. At the same time, Theorem 1 shows that in such a scenario it is ill-advised to estimate mutual information from a data sample, as such estimators are only valid if the true mutual information determined by the assumed underlying distribution is finite. Indeed, the estimate reveals more about the estimator and the dataset than it does about the true mutual information, as the latter is always infinite by Theorem 1; see also the discussion in [3, Sec. 2 & Appendix C]

4.2 Discrete Features or Learning from Data: The IB Functional is Piecewise Constant

We next assume that the features have a discrete distribution, i.e., can assume only a finite number of different points in . For example, one may assume that is a RV over black-and-white images with pixels, in which case the distribution of is supported on . In such a case, the entropy of is finite and, thus, so is the entropy of every intermediate representation . More precisely, since the DNN is deterministic, the distribution of is discrete as well, from which follows that can assume only finitely many values. Similarly, since both and are discrete, also can assume only finitely many different values. As a consequence, the IB functional is a piecewise constant function of the parameters and, as such, difficult to optimize. Specifically, the gradient of the IB functional w.r.t. the parameter values is zero almost everywhere, and one has to resort to other optimization heuristics that are not gradient-based.

The problem of piecewise continuity persists if the empirical joint distribution of and based on a dataset with finitely many data points is used to optimize the IB objective. The entropy of equals , and the IB functional remains piecewise constant. Indeed, and may only change when two different data points which were previously mapped to different values of intermediate representation now get mapped to the same value or vice-versa. It was shown empirically in [3, Fig. 15] that throughout training, i.e., for a large selection of weight matrices.

Finally, the IB functional can be piecewise constant also for a continuously distributed feature RV if step or ReLU activation functions are used. This can happen, for example, if the distribution of is supported on a disconnected set . Such a situation is depicted in Fig. 1 together with the scenarios of a discretely distributed feature RV and a dataset .

4.3 Invariance under Bijections: The IB Functional is Insufficient

Leaving aside the fundamental problems discussed in Sec. 4.1 and Sec. 4.2, we now show that the IB functional is insufficient to fully characterize classification problems using DNNs. Specifically, we show that training a DNN by minimizing (4) does not lead to representations that admit simple decision functions (P) or are robust to noise, well behaved transformations or small distortions (P). To this end, we give several examples comparing two DNNs whose intermediate representations are equivalent in terms of the IB functional, but where one of them is clearly a more desirable solution. Since the IB functional does not give preference to any of the two solutions, we conclude that it is insufficient to achieve intermediate representations satisfying the requirements stated in Sec. 3. For the sake of argument, we present simple, synthetic examples instead of empirical evidence on real-world datasets to illustrate these shortcomings. On the one hand, the examples have all the essential aspects associated with training a DNN for a practical classification task. On the other hand, because of the simplicity of the examples, they lend themselves to clearly highlighting and explaining different shortcomings in isolation. One can then easily extrapolate how one may encounter these issues in practical scenarios.

Fig. 2: Representational simplicity and robustness in binary classification: The top figure on the L.H.S. illustrates the two-dimensional input space and the support of the input in . The rest of the figures on L.H.S. show various functions of (since only depends on ) implementable using a ReLU-based DNN. The figures on R.H.S. show the output RVs when is transformed via the corresponding functions on the L.H.S. Red (black) color indicates feature values corresponding to class ().

We consider a binary classification problem (i.e., ) based on two-dimensional input shown in Fig 2. The input RV/samples take values in the four disjoint compact sets , and (marked red) corresponding to and and (marked black) corresponding to . We also define . Perfect classification is possible in principle, i.e., the distributions of given different classes have disjoint support222We refer the reader to [24] regarding additional issues under which the IB framework suffers in this scenario. and . The DNNs required to obtain the intermediate representations discussed in the examples can be easily implemented using ReLU activation functions (the examples can be modified to work with other activation functions). For the sake of simplicity, in this example the class label depends only on one dimension of the input. Moreover, in the examples we evaluate the IB functional for the output . The considerations are equally valid if the presented functions are encoders for a latent representation instead of . Finally, one can extend the examples where the intermediate representation has more than one dimension.

First, consider the two functions and on the left-hand side (L.H.S.) in Fig. 2, implemented by two DNNs. The corresponding figures on the right-hand side (R.H.S.) show the support of the distribution of and , i.e., to which is mapped by and , respectively. It is easy to see that the IB functional evaluates to the same value for both DNNs. Indeed, both DNNs have identical compression terms, i.e., and perfect precision, i.e., . However, while admits a simple decision by thresholding at , the representation requires a more elaborate decision rule. This holds true regardless if the input has a continuous or discrete distribution supported on a subset of . It also holds if the computations are done based on a dataset with input samples lying in .

The same phenomenon can be observed when we compare the two functions and in Fig. 2, implemented by two DNNs. If the input has a continuous distribution supported on , this leads to the continuous output RVs and shown on the R.H.S. Again both DNNs have perfect precision and identical compression terms, where and are both infinite in this case due to a continuously distributed and . However, admits a simple decision by thresholding at , whereas requires a more elaborate decision rule.

We finally turn to the question of robustness against noisy inputs. This, in general, cannot be answered by looking at the intermediate representations alone as we show in the following two examples. To this end, first consider the situation depicted in Fig. 3. As it can be seen, perfect classification is possible with a single neuron with a ReLU activation function. We consider two different DNNs with no hidden layers and single output neuron with parameterizations and . Both parameterizations are equivalent in terms of the IB functional, leading to identical precision and compression terms. Note, however, that the DNN parameterized by is more robust to small amount of noise or distortion than . This can be seen by the blue dot in Fig. 3 indicating a noisy input generated (with high probability) by class label . While does not admit distinguishing this point from features generated by class label , does (see R.H.S. of Fig. 3). Indeed, thresholding at and at yields the decision regions indicated by dashed and dotted lines on the left of Fig. 3.

As second example, consider the two DNNs implementing the functions and in Fig. 2. The corresponding RVs and for the given are shown on the R.H.S. respectively. For the given distribution of , we notice that , i.e., the two DNNs implement the same function over the support of . Adding noise to or distorting has the potential to enlarge the support of its distribution. In this case, will be more robust to such noise and distortions when compared to , due to the sharp transitions of outside the current support of the distribution of . This holds true whether the input is continuous or discrete with support over a subset of . It also holds if the computations are done based on a dataset with input samples lying in .

Fig. 3: Robustness in binary classification. The L.H.S. shows the feature space, with on the horizontal and on the vertical axis. One can see that is distributed on if and on if , for . The R.H.S. shows the supports of the distributions and , obtained by two different DNNs with identical IB functionals. The blue dot represents a noisy feature or a data point not in the training set. See text for details.

In conclusion, the IB framework may serve to train a DNN the output of which is maximally compressed (in an information-theoretic sense) and maximally informative about the class. However, one cannot expect that the DNN output admits taking a decision with a simple function or that the DNN is robust against noisy or distorted features. This holds true also for DNNs with discrete-valued features , such as those discussed in [1], for which the issue of Sec. 4.1 does not appear. Based on the link between robustness and generalization, this suggests that the IB functional also cannot be used to quantify generalization of a DNN during training.

5 How to Use IB-Like Cost Functions for Training DNNs

The issues we discussed in Sec. 4 apply to training deterministic DNNs. In this section we discuss possible remedies for these problems, such as forcing the intermediate representation to be discrete, training stochastic DNNs, and replacing the IB functional by a more well-behaved cost function inspired by the IB framework. With these approaches one can guarantee that the IB functional is finite and that specific pairs of intermediate representations, related by invertible transforms, are not equivalent anymore. However, some of the proposed approaches lead to the IB functional being piecewise constant, similar to the scenario in Sec. 4.2.

5.1 Including the Decision Rule

One approach to successfully apply the IB framework for DNN training is to include a simple decision rule. Specifically when applied to the output representation , for a fixed function and , the goal shall be to solve

(6)

For example, for binary and a single output neuron, one could set , where is the indicator function; for output neurons, could be the index of the output neuron with the maximum value.

Since has a discrete distribution with as many mass points as the class variable, the compression term appears useless: compression is enforced by including the decision rule . Similarly, the simplicity of the representation is automatically enforced by the simplicity of the decision function when solving (6). Moreover, the IB functional becomes computable for the output layer because we have . Finally, for the example depicted in Fig. 2, assuming that for and setting and clearly favors the first option over the latter: While we still get (finite now due to quantization), we have but .

Including the decision rule, however, does not lead to improved robustness. Indeed, consider Fig. 3. Then, the resulting RVs and are identical (and, thus, so are the IB functionals), with the former suffering from reduced robustness. The same can be observed by looking at and . Furthermore, this shows that including a decision rule, due to the coarse quantization of , leads to large equivalence classes of DNNs that evaluate to the same value in (6), which is conceptually similar to the IB functional being piecewise constant (cf. Sec. 4.2 and [2, Sec. 3.5]).

To apply this method to train intermediate representations other than , one possible approach is to feed the latent representation to an auxiliary decision rule and minimizing (6) for . This subsequently leads to layer-wise training in a greedy manner, similar to one discussed in [25]. Such a layer-wise greedy training procedure must be carefully designed in order to exploit the full benefits of deeper DNN architectures.

5.2 Probabilistic Interpretation of the Neuron Outputs

Another option related to Sec. 5.1 is to introduce a soft decision rule. For example, in a one-vs-all classification problem with a softmax output layer with neurons, the -th entry of can be interpreted as the probability that . Thus, is a discrete RV with alphabet that depends stochastically (and not deterministically) on the feature vector . Using this approach not only guarantees that the functional in (6) is finite but also, unlike in Sec. 5.1, admits applying gradient-based optimization techniques even for finite datasets. Moreover, using a soft decision rule makes the precision term sensitive to simplification, encouraging this property in the output. The precision term also promotes P in the sense of encouraging dense clusters in . These claims can be verified, for example, by looking at , , and in Fig. 2 (assuming , , and are uniform over their support), and identifying values of as probabilities that . The utility of the compression term becomes even more questionable than in Sec. 5.1. For one, is discrete which automatically enforces implicit compression. Moreover, is smaller for than for in Fig. 2, hence minimizing now prefers over rather than evaluating them equally. Therefore, one either should choose in (6) more carefully or drop the compression term altogether.

To apply this method to a latent representation, one possible approach is to feed the latent representation to a linear layer of size

followed by a softmax layer to generate

. Similar to the case in Sec. 5.1, this subsequently leads to layer-wise training in a greedy manner and hence must be be carefully designed in order to exploit the full benefits of deeper DNN architectures.

5.3 Stochastic DNNs

A further approach is to use the IB functional to train stochastic DNNs rather than deterministic ones. A DNN can be made stochastic by, for example, introducing noise to the intermediate representation(s). The statistics of the introduced noise can also be considered trainable parameters or adapted to (the statistics of) the intermediate representation(s). The objective function to be optimized remains (4). For to be finite, it suffices to add noise with an absolutely continuous distribution to . This approach can be used for layer-wise training as well as for training the DNN as a whole. Depending upon where and what type of noise is introduced, can encourage robust representations, for which does not degrade by introduction of noise and/or deformations. Similarly may also promote intermediate representation with well separated (sub-)regions corresponding to different labels which in turn admit simpler decision functions for the stochastic DNN. For example, a small amount of uniform noise added to the intermediate representation leads to a better IB functional for than and for than in Fig. 2. For stochastic DNNs, the compression term can encourage more compact representations. For example, again in Fig. 2, adding a small amount of noise to the output makes larger than , making the latter representation more desirable. Note that in case of stochastic DNNs, the noisy intermediate representation, such as or , is fed as input to the next layer of the DNN.

In addition to resolving the issues associated with IB functional mentioned in Sec. 4, training a stochastic DNN in such a way also provides a novel way of data augmentation. Sampling the intermediate representation multiple times during training for each input sample can be viewed as a way of dataset augmentation, which may lead to improved robustness. Introducing noise in a latent (bottleneck) representation thus presents an alternative to the data augmentation approach proposed in [26], which requires training a separate auto-encoder to obtain latent representations to be perturbed by noise.

5.4 Replacing the IB Functional

A final approach is to replace the IB functional by a cost function that is more well-behaved, but motivated by the IB framework. Specifically, by replacing mutual information by (not necessarily symmetric) quantities and , we replace (4) by

(7)

This approach can be used for training the DNN both as a whole and layer-wise.

We first consider setting and , where , , and are quantizers , that are adapted according to the statistics of the latent representation and w.r.t. one another333It is important to adapt the quantizers (or the noise levels) to (the statistics of) the latent representation in order to rule out ways to decrease the cost without fundamentally changing the characteristics of , e.g., by simple scaling.. It is important to note that the quantization is not performed inside the DNN, but only for computing the function in (7). This is the typical approach performed when mutual information is estimated from finite datasets using histogram-based methods. Unlike [2], we argue that the design of , , and should not only be guided by the goal of estimating the true mutual information (which is bound to fail according to our analysis in Sec. 4.1), but also by the aim to instill the desired properties from Sec. 3 into the cost function (7).

The effect of quantization is that becomes finite. Moreover, if is set appropriately, solving (7) leads to simpler representations. Considering again Fig 2, setting prefers over (and similarly over ); however, the finer the quantization is, the less sensitive is (7) to the simplicity of the intermediate representation .

The fact that and need not coincide yields an advantage over the solution proposed in Sec. 5.1 in the sense that the compression term can become useful now. With the above choice of we see that the precision term is the same for and . However, if is the identity function and a uniform quantizer with four quantization levels in , then leads to a larger compression term than , thus favoring . However, similarly as we observed in Sec. 5.1, the quantized IB functional partitions DNNs into large equivalence classes that do not necessarily distinguish according to robustness. Additionally, the quantized IB functional is piecewise constant when used for finite datasets. Finally, choosing appropriate quantizers is not trivial; e.g., the effect of this choice has been empirically evaluated in [3], with focus only on the compression term. For the quantized IB functional, choosing quantizers becomes even more complicated.

Without going into details, we note that computing a noisy IB functional, for example by setting and for noise variables and that are adapted according to (the statistics of) and w.r.t. each otherfootnotemark: , can lead to simplified and compact representations. In contrast to the quantized IB functional, the noisy IB functional can even lead to robust representations and, for appropriately chosen noise models, is not piecewise constant for finite datasets, hence admitting efficient optimization using gradient-based methods. Again, we note that noise is not introduced inside the DNN, but only in the computation of (7); hence, the DNN is still deterministic.

Other than quantization and introducing noise in the computation of the mutual information terms, one may go one step further and replace these terms with different quantities. For example, it is common to replace the precision term by the cross-entropy between the true conditional distribution of given and, e.g., a parametric surrogate distribution. Moreover, also the compression term can be replaced by terms that are inspired by , but differ in essential details. These changes to the optimization problem often directly enforce goals such as P-P, even though they have not been captured by the original optimization problem.

Finally, when replacing the two terms in (4) with different quantities, one may even choose to select different intermediate representations for the precision term and for the compression term. For example the compression term can be defined based on a latent representation and the precision term can be defined based on . The compression term can then enforce desired properties on the latent representation whereas the precision term can ensure that the output of the DNN admits simple decisions and predicts well enough. In contrast, evaluating (4) only for an internal representation trains only the encoder, failing to instill desired properties into ; evaluating (4) only for the output trains the whole DNN, but does not necessarily lead to internal representations with the desirable properties from Sec. 3.

It is worth mentioning that the approaches in this section are not completely independent. For example, on the one hand, a probabilistic interpretation of the output (Sec. 5.2) can be considered a special type of stochastic DNN in which the stochasticity appears only in the output neurons. On the other hand, evaluating the IB functional for this probabilistic interpretation can be considered as replacing the IB functional with a different cost function. This is in line with the reasoning in [27], illustrating that the same problem may be solved equivalently by adapting the optimization method, the feasible set, or the cost function.

A common theme in our remedies from Secs. 5.1 to 5.4 is that they encourage latent representations in which data points from different classes are represented in some geometrically compact manner. In other words, the proposed remedies encourage compression (P) in a geometric sense rather than in the sense of a minimal sufficient statistic. This is intuitive, since representing classes by clusters tight and far apart allows using simple decision rules for classification (P). While such clustering does not immediately ensure robustness (P), the injection of noise (either directly or only in the computation of the IB functional, cf. Sec. 5.3 and 5.4) does.

All this certainly does not imply that measuring P in information-theoretic terms is inadequate. Rather, it illustrates that measuring P in information-theoretic terms is insufficient to instill desirable properties such as simple decision functions or robustness, while understanding P in geometric terms has the potential to do so.

6 Critical Discussion of and Experimental Evidence from Related Work

In this section, we not only critically assess and provide insights into the related work in light of our Sec. 4, but we also discuss, where relevant, how some of these works provide experimental evidence to support the approaches we propose in Sec. 5. Some of these works report successes in terms of different operational goals, such as generalization, adversarial robustness, and out-of-distribution data detection, that are directly relevant for applying classifiers in practice. Therefore, this shows that our proposed remedies are not only successful in instilling desired characteristics in intermediate representations, such as P and P, but are also, via these characteristics of the intermediate representations, able to successfully achieve various operational goals. Hence, although this work is mainly focused on providing analytical understanding via theory and intuitive examples, we also rely on the experimental evidence from other works to support our claims.

The idea of using the IB framework for DNNs was first introduced in [1]. They proposed using the IB functional to analyze DNNs in terms of performance as well as architectural compactness and argued that this can be done not only for the output but also for hidden layer representations of a DNN. This, purportedly, leads to a deeper insight into the inner workings of a DNN than an evaluation based on the output or network parameters could. They also suggested the IB functional as an optimization criterion for training DNNs.

In [2], the authors applied these ideas to analyze DNNs trained using cross-entropy without regularization. They empirically observed that compression (in the sense of a small ) cannot be linked to architectural simplicity. Furthermore, based on their empirical observations, they claimed that training includes a compression phase that, they believe, is causally linked to improved generalization performance of the DNN. The authors of [3] present empirical evidence contradicting this claim, which initiated a debate that is still ongoing. Moreover, the authors of [3] discussed analytically and empirically that the compression phase observed in [2] is an artifact of the quantization strategy used to approximate the compression term in connection with the activation function used. They also briefly looked at the computability issues of the compression term in the IB functional [3, Sec. 2 & Appendix C], recognized that this term is infinite if the intermediate representation is continuous, and suggested to replace the compression term by the mutual information between and a noisy or quantized version of (cf. Sec. 5.4). Besides, recently invertible DNN architectures [23, 28] have been proposed that achieve state-of-the-art performance. For such invertible networks stays the same regardless of the network parameters. The success of such networks also casts doubt on the claims in [2] that information-theoretic compression can be linked to generalization capabilities. The discussion in Sec. 4, although done in the context of training DNNs, also holds for analysis of DNNs using IB functional. Hence the shortcomings and issues mentioned in Sec. 4 shed new light in this debate and are in line with the observations made in [3].

The authors of [29]

studied the latent representations (obtained via training a DNN using a standard loss function) in the context of

. They inject Gaussian noise at the output of every neuron and show that, in this case, a geometric clustering of the latent representation is captured well by both the compression term and the entropy of the quantized latent representation, . As we mention at the end of Sec. 5, encouraging geometric clustering has the potential to directly instill the desirable properties of simple decision rules (P) and robustness (P) into latent representations. The observations in [29] therefore support our proposal to either use a stochastic DNN (Sec. 5.3) or to replace the cost function (e.g., via quantized entropy) to instill desirable properties into the latent representations (Sec. 5.4).

Reference [30] uses to bound the generalization gap from above. Although theoretically interesting, their bound relies on strong assumptions such as and being discrete and the use of an “optimal” decoder for the latent representation . Furthermore, the upper bound only accounts for the generalization gap and not the actual performance (which can be bad despite a small generalization gap). Finally, the upper bound is infinite, and hence not useful, in the setting of deterministic DNNs with continuous and that was discussed in Sec. 4.

We next turn to works that train DNNs using cost functions inspired by the IB principle. The authors of [27] proposed minimizing the IB functional based on parametric distributions for the (stochastic) encoder and decoder, i.e., combining approaches from Secs. 5.3 and 5.4. They showed that minimizing the cost function (regularized by total correlation to encourage disentangled representations) is equivalent to minimizing cross-entropy over DNNs with multiplicative noise (dubbed information dropout). They also discovered that, for a certain choice of the parameter and for the goal of reconstruction, i.e., , the regularized cost function is equivalent to the one for variational auto-encoding.

Reference [6] trained a stochastic DNN using a variational upper bound on IB functional and showed that the resulting DNN has state-of-the-art generalization performance as well as improved robustness to adversarial examples. They introduce noise at a dedicated bottleneck layer (cf. Sec. 5.3), leading to a stochastic DNN with finite IB functional for the bottleneck and the subsequent layers. The authors then replace the compression term for the bottleneck layer with a variational upper bound (cf. Sec. 5.4) to make the compression term tractable; the resulting term is no longer invariant under bijections and encourages bottleneck representations that are compact in a geometric sense. They further replace the precision term by cross-entropy loss. This can be interpreted as two steps applied sequentially, namely first lower bounding by and then lower bounding by cross-entropy loss with a probabilistic interpretation of the output 444See Sec. 9 for a detailed discussion of this two step perspective. (cf. Sec. 5.2). Combining the bounds on compression and precision terms thus instill desirable properties in both the bottleneck representation and the output representation (cf. end of Sec. 5.4). Unlike and , cross-entropy applied to a probabilistic interpretation of the output is no longer insensitive to bijections and, in conjunction with noise introduced at the bottleneck layer, enforces simplicity of the decision rule and robustness in the trained DNN (see [6, Fig. 2]).

The authors of [5] optimized the IB functional using stochastic DNNs, which is closely related to training stochastic DNNs using the IB functional. They followed a very similar approach as the authors of [6], with the main difference that they replace the variational bound on the compression term by a non-parametric bound. They show that the intermediate representations they obtain form geometrically dense clusters as compared to the representations of DNNs trained using traditional cost functions (see  [5, Fig. 2]).

The authors of [8] used the same technique as in [6] to train stochastic neural networks but they measure the performance of the DNNs in terms of classification calibration as well the DNN’s ability to detect out-of-distribution data. Since the training technique is the same as in [6], our discussion of [6] also applies here.

Also [9] uses a setup similar to [6]. For example, they introduce Gaussian noise at the latent representation (Sec. 5.3) and approximate similarly. However, they replace the precision term

in the IB functional with a term that quantifies channel deficiency based on Kullback-Leibler divergence and use a tractable approximation for this new precision term (Sec 

5.4).

In summary, the authors of [6, 5, 8, 9] propose cost functions motivated by the IB framework but depart from it by using a combination of techniques from Sec. 5. Their promising performance in terms of various operational goals is therefore experimental evidence of the success of our proposed remedies. Note that they do not explicitly mention that the IB functional leads to an ill-posed optimization problem the solution of which lacks desirable properties such as representational simplicity and robustness, rather they introduced the aforementioned modifications to obtain tractable bounds on the IB functional that are optimizable using gradient based methods.

7 Concluding Remarks

We have shown in Sec. 4 that training deterministic DNNs using the IB functional suffers from serious problems. Aside from the optimization problem (4) being ill-posed (Sec. 4.1) or inaccessible to gradient-based optimization (Sec. 4.2), the IB functional does not capture desirable properties of intermediate representations, such as allowing simple decisions and robustness to noise (Sec. 4.3). Including a simple decision rule while computing the IB functional solves some of these problems, but may lead to cost functions that are piecewise constant (Sec. 5.1). Similarly, training stochastic DNNs with the IB functional (Sec. 5.3) solves some problems and additionally provides a new mechanism for task-specific data augmentation. Most obviously, replacing the IB functional with a cost function that is more well-behaved also leads to robust and simple representations (Sec. 5.4). To achieve this goal it is not even necessary to depart far from the IB framework: Replacing by and by , for noise terms and can lead to DNNs that are robust and the output of which allows using a simple decision rule. Sec. 6 not only critically assesses the related work but also utilizes it to provide empirical evidence for the success of the remedies we propose in Sec. 5.

We wish to mention that the discussion in Sec. 4.3 holds equally well for the analysis of DNNs using the IB framework: A good result in terms of the IB functional does neither admit statements about the robustness of the DNN nor about the simplicity of the required decision function. The ill-posed or piecewise constant nature of IB functional for the classification task using DNNs (cf. Sec. 4.1 and Sec. 4.2) further complicates the situation and makes it an unfit tool for analysis. Our results regarding generalization are thus in line with the observations in [3].

We believe that the idea of regularization introduced by the IB functional, i.e., to regularize the intermediate representations rather than the parameters of the DNN, has great potential. Traditional complexity measures focus on what a DNN can do based on the network architecture while ignoring the (estimated) data statistics and the actual function implemented by the DNN, i.e., the network parameters after training on the actual data. For example, VC dimension is independent of and the learned network parameters, and the generalization error bound based on Rademacher complexity only depends on (estimated) , ignoring and the learned network parameters. It has been noted that these traditional measures fail to explain why largely over-parameterized DNNs generalize well although they are capable of memorizing the whole dataset [14, 31, 21]. The discussion in [32] suggests that a perspective of understanding the DNN capacity, which also involves the (estimated) relation between and as well as the learned network parameters, can lead to more meaningful insights; the methods employed in [14, 15, 23] for explaining and understanding the success of such networks explicitly or implicitly focus on the learned representations for the specific task.

All this hints at the fact that regularization based on properties of intermediate representations can be beneficial. The experiments in [29] also support this claim by arguing that, e.g., geometric clustering of the latent representations is a valid goal for training DNNs for classification. Such a direct design can also ensure compatibility with standard optimization tools used in deep learning, such as gradient-based training methods.

In the recent literature, several regularizers trying to instill desired characteristics directly in an intermediate representation (without necessarily being motivated by the IB principle) have been proposed. Reference [33] introduces , where is some nuisance factor or discriminatory trait, as a regularizer. Minimizing it makes the latent representation and the performance of the DNN invariant to . The authors evaluate this regularization in a variational auto-encoding setup and as an additional term in the variational IB setup of [6]. Depending upon the relation between and , the term may also suffer from the issues discussed in Sec. 4.1 or Sec. 4.2 for deterministic DNNs. In [34]

, the authors regularized the final softmax output, interpreted as a probability distribution over

(cf. in Sec. 5.2), by to penalize overly confident output estimates. The experiments in [35] suggest that back-propagation of classification error or a similar loss function from the output does not lead to latent representations with properties such as discrimination and invariance (which are intuitively similar to P and P and to P and P from Sec. 3, respectively). They propose a “hint penalty” that encourages latent representations being similar if they correspond to the same class. Similarly, [36] defined the regularization on latent representations as a label-based clustering objective, which is conceptually similar to defining the goal of compression (P in Sec. 3) in a geometric sense. The authors of [36] discuss the performance of such regularizers for various problems including auto-encoder design, classification, and zero shot learning.

In addition to improved generalization, carefully designing representation-based regularizers may have additional advantages. Enhanced adversarial robustness for such networks has been demonstrated in [6]. Such regularization also provides a more flexible data augmentation method for training and inference as compared to the fixed transformations done at the input currently in practice, e.g. rotation, translation etc. For example, the authors of [6, 5] sample the noisy bottleneck representation multiple times for each input training example. This data augmentation mechanism also yields an advantage over the one introduced in [26] by obviating the need to train a separate auto-encoder and gets automatically adapted to the classification task at hand. Regularizers can also be used to enforce privacy guarantees or to ensure insensitivity to transformations such as rotations, translations, etc. which is attributed to be one cause of the superior performance of DNNs [21].

All of these works and those discussed in Sec. 6 provide empirical evidence that regularizing latent representation(s) is a promising endeavor, achieving generalization, robustness, fairness, classification calibration, and data augmentation. On the one hand, the discussion at the end of Sec. 5 and in [29] along with the empirical evidence from [35, 36, 23] leads us to believe that defining a latent representation regularizer in a geometric sense in conjunction with noise/stochasticity is a promising domain of future research. On the other hand, although regularizing latent representations is a key feature of the IB framework, it fails to instill desired properties in the latent representations. The success of invertible DNNs, e.g., iRevnet [23], and our analyses suggest that the information-theoretic compression-based regularization term either becomes obsolete or has to be replaced. Similarly, although intuitively often attractive, if one aims to define the latent regularizer via some other information-theoretic cost (e.g., as in [33]), it is important to mitigate issues including, but not limited to, the ones discussed in Sec. 4.1, Sec. 4.2, and Sec. 4.3. In contrast, it has been shown (e.g., in [5, 6]) that a restriction to a specific prior distribution and approximations being used to evaluate the information-theoretic cost lead to a more direct and intuitive geometric interpretation which can be utilized. Thus, in light of the discussion in this work and in [14, 29, 23] along with the empirical evidence in [35, 36], we conclude that designing regularizers directly with the aim of instilling certain properties desirable for the intermediate representation (such as discussed in Sec. 3) may be a more fruitful approach than trying to repair the problems inherent in the IB functional (or other information-theoretic cost functions) in the context of classification.

8 Proof of Theorem 1

We denote vectors by lower-case letters, i.e., we write . Moreover, we define the -dimensional cube with side length and bottom-left corner at as . For example, the RV , where the floor operation is applied element-wise, is obtained by quantizing with a quantizer that has quantization bins , .

Let denote the entropy of the discrete RV with probability mass function and alphabet , and let

(8)

denote the Rényi entropy of second order of . The correlation dimension of a general RV is defined as [37]

(9)

provided the limit exists. The information dimension is defined accordingly, with Rényi entropy of second order replaced by entropy [38].

Proof of the Theorem.

The proof consists of four ingredients. Assuming that the distribution of has a continuous PDF supported on a compact set, we first show that the input has positive correlation dimension. Then, we show that the correlation dimension remains positive throughout the DNN. Afterwards, we show that the output has positive information dimension, from which follows that . Finally, we relax the condition that has a continuous PDF supported on a compact set, but require that its distribution has at least such a component.

We start by assuming that has a continuous PDF that is supported on a compact set in .

Lemma 1.

Let be an -dimensional RV with a PDF that is continuous and supported on a compact set in . Then, .

This result generalizes [37, Th. 3.I.c] to higher-dimensional RVs.

Proof.

Since is continuous, so is its square . Since both and are continuous and supported on a compact set, they are Riemann integrable. Hence, the differential Rényi entropy of second order,

(10)

exists. We can sandwich by using the upper and lower Darboux sums, i.e., we can write

Note further that by the mean value theorem we can find such that

(11)

This allows us to write as

(12)

Since lies between the infimum and the supremum can assume on the cube , we obtain