## Introduction

A major area of research is to understand
deep neural networks’
remarkable ability to generalize to unseen examples.
One promising research direction is to view deep neural networks through the lens of information
theory tishbydeep. Abstractly, deep connections exist between the information
a learning algorithm extracts and its generalization
capabilities littlebits, bayesianbounds. Inspired by these general results,
recent papers have attempted to measure information-theoretic quantities in ordinary
deterministic neural networks blackbox,emergence,whereinfo.
Both practical and theoretical problems arise
in the deterministic case hownot, saxe, brendan.
These difficulties stem from the fact that
mutual information (MI) is reparameterization independent coverthomas.^{1}^{1}1

This implies that if we send a random variable through an invertible function, its MI with respect to any other variable remains unchanged.

One workaround is to make a network explicitly stochastic, either in its activations vib or its weights emergence. Here we take an alternative approach, harnessing the stochasticity in our choice of initial parameters. That is, we consider an*ensemble*of neural networks, all trained with the same training procedure and data. This will generate an ensemble of predictions. Characterizing the generalization properties of the ensemble should characterize the generalization of individual draws from this ensemble. Infinitely-wide neural networks behave as if they are linear in their parameters widelinear. Their evolution is fully described by the

*neural tangent kernel*

(NTK). The NTK is constant in time and can be tractably computed neuraltangents. For our purposes, it can be considered to be a function of the network’s architecture, e.g. the number and the structure of layers, nonlinearity, initial parameters’ distributions, etc. All told, the output of an infinite ensemble of infinitely-wide neural networks initialized with Gaussian weights and biases and trained with gradient flow to minimize a square loss is simply a conditional Gaussian distribution:

() |

where is the output of the network and is its input. The mean and covariance functions can be computed neuraltangents. For more background on the NTK and NNGP as well as full forms of and , see sec:ntk. This simple form allows us to bound several interesting information-theoretic quantities including: the MI between the representation and the targets (, sec:izy), the MI between the representation and the inputs after training (, sec:izx), and the MI between the representations and the training set, conditioned on the input (, sec:izd), We are also able to compute in closed form: the Fisher information metric (sec:fisher), the distance the parameters move (sec:dist), and the MI between the parameters and the data (, sec:itd). Because infinitely-wide neural networks are linear in their parameters, their information geometry in parameter space is very simple. The Fisher information metric is constant and flat, so the trace of the Fisher does not evolve as in whereinfo. While the Euclidean distance the parameters move is small widelinear, the distance they move according to the Fisher metric is finite. Finally, the MI between the data and the parameters tends to infinity, rendering PAC Bayes style bounds on generalization vacuous emergence,bayesianbounds,littlebits.

## Experiments

For jointly Gaussian data (inputs and targets ), the Gaussian Information Bottleneck gaussib gives an exact characterization of the optimal tradeoff between and , where is a stochastic representation,

, of the input. Below we fit infinite ensembles of infinitely-wide neural networks to jointly Gaussian data and measure estimates of these mutual informations. This allows us to assess how close to optimal these networks perform. The Gaussian dataset we created (for a details, see sec:gaussian) has

and. We trained a three-layer FC network with both and activation functions. fig:loss_vs_time_gauss shows the test set loss as a function of time for different choices of initial weight variance (

). For both the and networks, at the highest shown (darkest purple), the networks*underfit*. For lower initial weight variances, they all show signs of

*overfitting*in the sense that the networks would benefit from early stopping. This overfitting is worse for the non-linearity where we see a divergence in the final test set loss as decreases. For all of these networks the training loss goes to zero.

In fig:inf_plane_gauss we show the performance of these networks on the information plane. The -axis shows a variational lower bound on the complexity of the learned representation: . The -axis shows a variational lower bound on learned relevant information: . For details on the calculation of the MI estimates see sec:info. The curves show trajectorites of the networks’ representation as time varies from to for different weight variances (the bias variance in all networks was fixed to 0.01). The red line is the optimal theoretical IB bound. There are several features worth highlighting. First, we emphasize the somewhat surprising result that, as time goes to infinity, the MI between an infinite ensemble of infinitely-wide neural networks output and their input is finite and quite small. Even though every individual network provides a seemingly rich deterministic representation of the input, when we marginalize over the random initialization, the ensemble compresses the input quite strongly. The networks overfit at late times. For networks, the more complex representations () overfit more. With optimal early stopping, over a wide range, these models achieve a near optimal trade-off in prediction versus compression. Varying the initial weight variance controls the amount of information the ensemble extracts.

Next, we repeat the result of the previous section on the MNIST dataset mnist. Unlike the normal setup we turn MNIST into a binary regression task for the parity of the digit (even or odd). The network this time is a standard two-layer convolutional neural network with

filters and either or activation functions. fig:mnist shows the results. Unlike in the jointly Gaussian dataset case, here both networks show some region of initial weight variances that do not overfit in the sense of demonstrating any advantage from early stopping. The network at higher variances does show overfitting at low initial weight variances, but the network does not. Notice that in the information plane, the network shows overfitting at higher representational complexities ( large), while the network does not.## Conclusion

Infinite ensembles of infinitely-wide neural networks provide an interesting model family. Being linear in their parameters they permit a high number of tractable calculations of information-theoretic quantities and their bounds. Despite their simplicity, they still can achieve good generalization performance cando. This challenges existing claims for the purported connections between information theory and generalization in deep neural networks. In this preliminary work, we laid the ground work for a larger-scale empirical and theoretical study of generalization in this simple model family. Given that real networks approach this family in their infinite width limit, we believe a better understanding of generalization in the NTK limit will shed light on generalization in deep neural networks.

Comments

There are no comments yet.