## 1 Introduction

Information Theory acts as a powerful complement to traditional statistical machine learning theory. It provides bounds that are agnostic to parameter size, yields interpret-able results, and is physically related to several fields of application. However, the rigorous study of information-theoretic properties relating to the architecture of neural networks remains somewhat immature. In this paper, we will study, for a large family of architectures, the maximum mutual information (MMI) that the network’s hidden representation

can hold from a feature space with distribution .That is, we are studying the supremum where is the parameter space of our architecture,

is the random variable denoting the output of our final hidden layer (our representation), and

is the mutual information between the representation and the input. The reason for studying this quantity is due to the prevalence of the following concept in Information-Theoretic Machine Learning Theory (ITMLT): acts as a data-sensitive measure of complexity for the representation. Indeed, we can often obtain bounds on machine learning performance in a way that is highly dependent on . In this sense, is a data-sensitive measure of complexity for the network itself. Thus its study can give us insight into why certain architectures perform well on certain datasets and may be useful for finding architectures that will work well with new datasets.In a much more practical sense, Information Theoretic Machine Learning Theory establishes the existence of a “best” value of , which maximizes the potential of obtaining a good representation. While this target always exists, it typically will not be known entirely. A sub-branch of this field argues that neural networks will attempt to find this target naturally through training shwartzziv2017opening; tishby2015deep; achille2018emergence, but there is no guarantee that the training process will find it. It is, however, often possible to approximate this target with an upper bound. We can then choose our architecture to enforce this upper bound naturally via its maximum mutual information - significantly reducing the search space for the target in the training process.

While the capacity of neural networks has been analytically studied in a different sense [i.e. via storage of ’patterns’] mackay2003information (chapter 5), prados1989neural

, and numerical methods of estimation of mutual information in deep networks exist

gabrie2018entropy; paninski2003estimation; bahl1986maximum; pmlr-v80-belghazi18a; hjelm2018learning; gao2017estimating, no analytical studies of this type exist in literature.## 2 Notation, Background, and Motivation

Our primary motivation for studying maximum mutual information comes from some modern theoretical work in ITMLT, which studies the classification of a discrete variable from an input

, jointly distributed as

. The theory studies the potential losses in the quality of a learning machine’s representation of its input, , when trained from just a sample of data .A natural quality measure for a representational variable is its mutual information with the classification variable, cover2012elements. This is a measure of how well predicts the class variable

, and is a component in the objective functions of several deep learning methods

chen2016infogan; foggo2019improving; alemi2016deep; kolchinsky2019nonlinear; bang2019explaining. We can then conceive of a “best possible” representation being that which maximizes when the natural distribution is known perfectly. This “best possible” representation is denoted . Of course, we typically don’t know perfectly, but we can still obtain a representation from our samples. If we constrain both variables, and , to have a fixed mutual information with (i.e. ), then we then have the following bound on the quality loss of against Foggo_2019:(1) |

Note that the constraint is a traditional choice first made in the early literature on the Information Bottleneck method tishby2000information; slonim2000agglomerative; friedman2013multivariate - a method which helped spark the existence of ITMLT. The constraint just exists to normalize the two representations to a fixed level of "forgetfullness" from the input. The under-braces on the right hand side of this inequality - implying that is data dependent and is architecture dependent in a decoupled way - is justified by the work introducing that bound Foggo_2019. In that work, was upper bounded in a way that did not depend on the architecture’s complexity by terms on the order of where is a small integer, and is the size of the training sample. Further research has verified similar data dependent bounds br2019analyzing. Since these bound did not depend on architectural complexity, all of the dependence on architecture must be contained in the other term: .

On the other side of the coin, we have strong data processing inequalities polyanskiy2017strong; polyanskiy2015dissipation such as where is the maximum correlation coefficient of and , i.e. , the supremum being taken over all functions and all functions with , and .

A qualitative combination of these two contrasting inequalities is summarized as follows: increases quickly with up to its maximum of [Strong Data Processing Inequality], and follows this trend due to the low information losses at low . But eventually the risk of information loss by inequality (1) forces our best estimate of to decrease linearly from this maximum value. The location of this behavioral change is problem-specific. But since can be estimated via a small sample of data, these equations can be used to approximate a target - yielding an interval in which it is likely to be contained. Setting our architecture to have a maximum mutual information near the supremum of this estimated interval is desirable.

## 3 The MMI Bottleneck and Parallel Structures

We begin with a key takeaway in terms of series and parallel components. The MMI over channels in series is the smallest MMI of the series, and the MMI over channels in parallel is the sum of the parallel MMI values. One consequence of this takeaway is that the information theoretic properties of fully connected architectures are strongly dominated by the dimension of the smallest layer.

We will often be able to identify a dominant structural parameter that limits the MMI of a series calculation. When we identify the parameter, we will call it the *MMI bottleneck*. In a fully connected network, this will be given by the dimension of the smallest hidden layer. Special focus should be given to the MMI bottleneck when attempting to control .

## 4 Single Layer Linear Networks

We will first study networks consisting of a single layer, and with no activation function. While this is a highly specialized case, the results and methods of obtaining those results generalize quite well to other cases. This section will consider both fully connected architectures and convolutional architectures.

### 4.1 Fully Connected Case

We begin by deriving the MMI of a linear network with a standardized Gaussian input. This is a highly specialized case, but we will see that it generalizes quite nicely to a large family of architectures including convolutional architectures and architectures with relu activation functions.

We consider the constrained problem in which the weight matrices are constrained by Frobenius norm. We will see that the Maximum Mutual Information of this family of architectures is discontinuous in the Frobenius norm constraint with at most points of discontinuity. We will thus first specialize to the case where our Frobenius norm constraint is larger than the largest discontinuity point before moving to the more general scenarios.

###### Theorem 1.

Let be a positive definite matrix and let . Let and be natural numbers representing the input and hidden dimensions. Let

denote the Gaussian distribution with mean

and covariance matrix . Let , , , , where ,is the identity matrix in

dimensions, andis the bias vector. Let

. Let denote diagonal matrix containing thelargest eigenvalues of

. Let denote the smallest eigenvalue of , and let . Let , , and define . Then:(2) |

###### Proof.

Since is Gaussian and the network is linear, is Gaussian for all . Thus, we can express as . Now, by the matrix determinant lemma, we have that , and so we can condense the dependence of the MMI optimization problem on to obtain:

(3) |

where . Due to the positive definiteness of and Hadamard’s inequality, we can cast this constrained maximization problem into the realm of eigenvalues since the optimal matrix will be diagonal, so

will have the same eigenvectors as

:(4) |

where is the largest eigenvalue of , and is the un-ordered eigenvalue of . The final constraint comes from the fact that is only rank . Furthermore, it must be the case that the mandatory -valued eigenvalues of , when they exist (), must be placed on the indices , as these correspond to the largest values of . Indeed, suppose that we have placed a nonzero eigenvalue on one of these indices (without loss of generality, say index , and that we set this eigenvalue to ) in such a way that all of the constraints are met. Then we must have placed a zero-valued eigenvalue on another index (which we will denote as without loss of generality, ). Then the objective function can be increased without violating any constraints by taking units of eigenvalue off of index and placing it on index , and so this cannot be a solution to our optimization problem. To see this, observe that:

(5) |

We are thus left with the following optimization problem:

(6) |

This is a classic ‘water-filling’ problem with heights given by scaled versions of the inverses of the first eigenvalues of . Thus, for a given ‘water level’, , a solution is readily available, being given by [optimality]. However, finding the relationship between and requires additional work, as must satisfy [consistency]. We will show that our assumption, , yields a consistent solution in which all maximums of the optimality equation are obtained in the second argument. To see this, note that under such a solution, the consistency equation yields , which coincides with the optimality equation because, for each , . Thus this solution holds and, in all, we have an MMI of:

(7) |

completing the proof. ∎

Some takeaways from Theorem 7 are now in order. First, there is an MMI bottleneck given by - the minimum of the input dimension and the hidden dimension. Secondly, if , then controls the number of principal components used to maximize the mutual information. Thirdly, the largest discontinuity point of is dominated by the difference between the *largest* reciprocal eigenvalue and the *average* reciprocal eigenvalue of those principal components that are used. We also see that smaller principal components are removed first. Furthermore, if is large, then we have the following approximation where is the average reciprocal eigenvalue of those components that are used.

We now move on to defining the rest of the discontinuity points. They are defined in the following Lemma.

###### Lemma 1.

Take all of the assumptions from Theorem 7 except for the assumption that . Let the largest eigenvalue of be denoted by . Let be a natural number, , and let denote the diagonal matrix containing the largest eigenvalues of . Now, let . Then:

(8) |

Proofs of lemmas can be found in the supplementary material accompanying this paper.

Note that each discontinuity point is calculated in the same way as the largest one, but with successive removals of the smallest principal components from our dataset. With all of the discontinuities defined, we can calculate the maximum mutual information for the case when our Frobenius norm constraint is contained in any of the corresponding intervals of continuity.

###### Theorem 2.

###### Proof.

We can follow the proof of Theorem 7 up until the optimization problem given by (6), whose solution will now be different because has changed. Again, we need to find a solution consistent with [optimality] and [consistency]. We claim that our assumption, yields a consistent solution in which the are zero for and nonzero otherwise. Under such a solution, the consistency equation yields . We will show that this coincides with the optimality equation. First, define . Then, if , we have:

(10) |

On the other hand, if , then:

(11) |

And so optimality is achieved. Under this solution, the objective function value is given by:

(12) |

Thus, in all, we have that MMI is given by:

(13) |

Where the factors of and the term have come from equation (3). Finally, the only factors remaining in post-cancellation of the second term are the eigenvalues of with indices smaller than or equal to , transforming that term into what is presented in equation (27). This completes the proof. ∎

Theorem 8 is a straightforward generalization of Theorem 7. The only additional insight is that the Frobenius norm constraint acts to remove the smallest principal components from our maximum mutual information calculation. This role is similar to that of the hidden dimension.

The MMI calculations in Theorems 7 and 8 will be seen to be very important - nearly every other case is a generalization of these two theorems! We’ve plotted some sample MMI curves as a function of the Frobenius norm for this family of architectures in Figure 1.

### 4.2 Convolutional Case

We will now discuss the case where our fully connected layer is replaced with a convolutional layer. We will keep the Gaussian data assumption and the linear activation assumption. For notational simplicity, we will also assume that our convolutional calculations involve non-overlapping strides, and that we have translation invariant statistics in our dataset. These assumptions can be dropped without effecting the insight that follows.

###### Theorem 3.

Let be a positive definite matrix and let . Let , , and be natural numbers such that divides , Let , , , Where and with denoting the slice of on indices through . Thus is a convolution applied to a vectorized input with non-overlapping stride and filters. Suppose is block diagonal with identical blocks (translation invariant statistics) given by the matrix . Let denote the maximum mutual information of the linear fully connected network given by Theorems 7 and 8. Let be fixed and let Then .

###### Proof.

We can view the output of the convolution as a matrix product of and where is a block diagonal matrix in which every nonzero block is the matrix . We can then follow the proof of Theorem 7 until the matrix determinant lemma step with in place of . From here, we can further factor as follows by noting that and are both block diagonal:

(14) |

from which we are back to the original optimization problem in the proof of Theorems 7 and 8, just multiplied by a factor of . ∎

We see that each convolution operation acts as a parallel calculation, and are therefore summed in the MMI calculation. Furthermore, each convolution operation has an MMI bottleneck of which is the minimum of the (vectorized) block-size and the number of output channels. Note that the principal components used in the calculation correspond only to features that occur in a single convolution operation.

## 5 Single Layer Relu Networks

We will now move on to studying what happens to the MMI values when we place relu activations on the hidden layers. The answer is quite nice: nothing changes at all! Thus we can take all of the insight from the previous sections and apply them to relu networks. Unfortunately, it takes quite a bit of setup and rigor to prove this fact.

We can provide some insight into the rigor that follows before diving into the proofs. Essentially, we will show that the mutual information between and in a relu network is always bounded above by that of a corresponding linear network. However, we will be able to construct a sequence of relu networks whose marginal distributions on converge (weakly) to that of the maximum mutual information solution of the linear architectures. All we will need to do then is have some notion of continuity of mutual information with respect to these marginals and we will have our proof. This penultimate step is taken care of primarily in Lemma 6.

###### Lemma 2.

Take all of the assumptions of either Theorem 7, Theorem 8, or Theorem 9. Let be fixed and define through a new model given by for the fully connected case, or for the convolutional case. Let denote the covariance matrix of as defined in Theorem 7 and let denote the covariance matrix of . Then:

(15) |

###### Lemma 3.

Take all of the assumptions of either Theorem 7 or Theorem 8 and all definitions from lemma 5

. Denote the marginal probability laws of

and as and with densities denoted and . Let be the set of Borel measurable sets on , and let denote the total variation distance between and . That is, . Finally, let denote the binary entropy function. Let . Then for all such that , there exists a non-negative function which is continuous in from the right at , has , and :(16) |

Proofs of lemmas can be found in the supplementary material accompanying this paper.

###### Theorem 4.

###### Proof.

First, we note that, for all , we have:

(18) |

where the inequality follows from lemma 5. It follows immediately that . We now show that is an achievable value for given the constraint . Fix . Then given any value of , we can set large enough in each dimension such that is less than for all satisfying . When this is the case, the total variation between and is also bounded above by . Then by lemma 6, there exists such that:

(19) |

for all satisfying this constraint, and where the right hand side of this inequality is a continuous function of and satisfies . Thus we can achieve for all satisfying the constraint. Since can be made arbitrary small via continuity, we can achieve:

(20) |

Inputting the MMI achieving matrix from Theorems 7 and 8 into (20) yields the result. ∎

We see that all of the insights that we obtained for linear activated networks hold for relu activated networks as well.

## 6 Single Layer Fully Connected Networks with Bijective Activation Functions

We will now move on to deriving the MMI for one final family of architectures - single layer networks with bijective activation functions. This family includes sigmoidal activations, tanh activations, selu activations, and much more. Conveniently, we once again find ourselves looking back to the linear case for its calculation.

###### Theorem 5.

Let denote the pre-activated representation variable of a single layer neural network. Let be a bijective activation whose log-derivative has finite expectation, and let the representation be given by . Then:

(21) |

###### Proof.

This follows immediately from the following set of equalities:

(22) |

where is the determinant of the Jacobian matrix of . ∎

From this theorem, we can immediately see that, if we take the linear case and place our noise injection on the pre-activated variable, and then use , we will have the same MMI as we did in the linearly activated case.

## 7 Multilayer Fully Connected Linear, Relu, and Bijective Networks

We finally move on to the multi-layer case for linear and relu fully connected networks.

###### Theorem 6.

Take all assumptions and definitions from the previous theorems corresponding to a fully connected network (linear or relu), but assume that we are using a layer neural network instead of a single layer network, with the noise placed on the layer. Let denote the number of hidden units in each layer. Redefine to . Then the results of those previous theorems hold.

###### Proof.

In the linear case, we can take the proof of theorem 7 by replacing with (the biases have no effect on the mutual information). We will only need to note that the corresponding inner-product matrix, has rank (as redefined in this theorem’s hypothesis). In the relu case we can follow the exact sequence of steps that were performed in the single layer case, noting that we can get the marginal total variation (for any fixed ) by fixing each bias to be large enough such that sufficiently small amounts of marginal probability are contained in the saturated regions of each layer. ∎

We see that all of our previous insights for these families of fully connected networks hold. However, a new MMI bottleneck can be identified: - the dimension of the smallest layer of the network.

## 8 Conclusion

We have rigorously derived the Maximum Mutual Information for a large number of neural network architectures, and provided several insights along the way. Nearly every case generalizes from Theorems 7 and 8. All of the studied fully connected single layer architectures have the exact same MMI expressions as those studied in Theorems 7 and 8, and single layer convolutional architectures require only a small adjustment from those calculations (Theorem 9). Multi-layer networks generalize from these as well, but with the primary architectural parameter being given by the smallest hidden dimension in the network. Thus great care should be given to the design of this layer when attempting to control the network’s mutual information.

While we have not provided MMI calculations for every existing setup (doing so in one paper would be nearly impossible given the ever-expanding set of neural architectures in existence), we hope that these calculations and insights provide enough detail to approximate any architecture the reader may be interested in studying, and that the proofs provided are generalizable enough for when an approximation is not enough. More architectures are the subject of future work. Of particular future interest are the inherently lossy dropout mechanisms, pooling strategies, normalization, and skip layers.

## References

## 9 Proofs of Lemmas

We repeat the theorem statements (without their proofs) alongside these lemmas to minimize the amount of back and fourth jumps the reader must perform to read them, as we refer to the theorems several times in the lemmas.

###### Theorem 7.

Let be a positive definite matrix and let . Let and be natural numbers representing the input and hidden dimensions. Let denote the Gaussian distribution with mean and covariance matrix . Let , , , , where , is the identity matrix in dimensions, and is the bias vector. Let . Let denote diagonal matrix containing the largest eigenvalues of . Let denote the smallest eigenvalue of , and let . Let , , and define . Then:

(23) |

###### Lemma 4.

Take all of the assumptions from Theorem 7 except for the assumption that . Let the largest eigenvalue of be denoted by . Let be a natural number, , and let denote the diagonal matrix containing the largest eigenvalues of . Now, let . Then:

(24) |

###### Proof.

First, is zero since:

(25) |

Next, we note that the difference is given by:

(26) |

completing the proof. ∎

###### Theorem 8.

###### Theorem 9.

Let be a positive definite matrix and let . Let , , and be natural numbers such that divides , Let , , , Where and with denoting the slice of on indices through . Thus is a convolution applied to a vectorized input with non-overlapping stride and filters. Suppose is block diagonal with identical blocks (translation invariant statistics) given by the matrix . Let denote the maximum mutual information of the linear fully connected network given by Theorems 7 and 8. Let be fixed and let

Comments

There are no comments yet.