1 Introduction
Although Deep Neural Networks (DNNs) are at the core of most state–of–the art systems in computer vision, the theoretical understanding of such networks is still not at a satisfactory level
(ShwartzZiv & Tishby, 2017). In order to provide insight into the inner workings of DNNs, the prospect of utilizing the Mutual Information (MI), a measure of dependency between two random variables, has recently garnered a significant amount of attention
(Cheng et al., 2018; Noshad et al., 2019; Saxe et al., 2018; ShwartzZiv & Tishby, 2017; Yu et al., 2018; Yu & Principe, 2019). Given the input variable and the desired outputfor a supervised learning task, a DNN is viewed as a transformation of
into a representation that is favorable for obtaining a good prediction of . By treating the output of each hidden layer as a random variable , one can model the MI between and . Likewise, the MI between and can be modeled. The quantities and span what is referred to as the Information Plane (IP). Several works have demonstrated that one may unveil interesting properties of the training dynamics by analyzing DNNs in the form of the IP. Figure 1, produced using our proposed estimator, illustrates one such insight that is similar to the observations of ShwartzZiv & Tishby (2017), where training can be separated into two distinct phases, the fitting phase and the compression phase. This claim has been highly debated as subsequent research has linked the compression phase to saturation of neurons (Saxe et al., 2018)or clustering of the hidden representations
(Goldfeld et al., 2019).Contributions We propose a novel approach for estimating MI, wherein a kernel tensorbased estimator of Rényi’s entropy allows us to provide the first analysis of largescale DNNs as commonly found in stateoftheart methods. We further highlight that the multivariate matrix–based approach, proposed by Yu et al. (2019), can be viewed as a special case of our approach. However, our proposed method alleviates numerical instabilities associated with the multivariate matrix–based approach, which enables estimation of entropy for highdimensional multivariate data. Further, using the proposed estimator, we investigate the claim of Cheng et al. (2018) that the entropy and in high dimensions (in which case MIbased analysis would be meaningless) and illustrate that this does not hold for our estimator. Finally, our results indicate that the compression phase is apparent mostly for the training data, particularly for more challenging datasets. By utilizing a technique such as earlystopping, a common technique to avoid overfitting, training tends to stop before the compression phase occurs (see Figure 1). This may indicate that the compression phase is linked to the overfitting phenomena.
2 Related Work
Analyzing DNNs in the IP was first proposed by Tishby & Zaslavsky (2015) and later demonstrated by ShwartzZiv & Tishby (2017). Among other results, the authors studied the evolution of the IP during the training process of DNNs and noted that the process was composed of two different phases. First, an initial fitting phase where increases, followed by a phase of compression where decreases. These results were later questioned by Saxe et al. (2018)
, who argued that the compression phase is not a general property of the DNN training process, but rather an effect of different activation functions. However, a recent study by
Noshad et al. (2019) seems to support the claim of a compression phase, regardless of the activation function. The authors argue that the base estimator of MI utilized in Saxe et al. (2018) might not be accurate enough and demonstrate that a compression phase does occur, but the amount of compression can vary between different activation functions. Another recent study by (Chelombiev et al., 2019) also reported a compression phase, but highlighted the importance of adaptive MI estimators. They also showed that when L2regularization was included in the training compression was observed, regardless of activation function. Also, some recent studies have discussed the limitations of the IP framework for analysis and optimization for particular types of DNN (Kolchinsky et al. (2019); Amjad & Geiger (2019)).On a different note, Cheng et al. (2018) proposed an evaluation framework for DNNs based on the IP and demonstrated that MI can be used to infer the capability of DNNs to recognize objects for an image classification task. Furthermore, the authors argue that when the number of neurons in a hidden layer grows large, and barely change and are approximately deterministic, i.e. and . Therefore, they only model the MI between and the last hidden layer, that is the output of the network, and the last hidden layer and .
Yu & Principe (2019)
initially proposed to utilize empirical estimators for Rényi’s MI for investigating different data processing inequalities in stacked autoencoders (SAEs). They also claimed that the compression phase in the IP of SAEs is determined by the values of the SAE bottleneck layer size and the intrinsic dimensionality of the given data.
Yu et al. (2019) also extended the empirical estimators for Rényi’s MI to the multivariate scenario and applied the new estimator to simple but realistic Convolutional Neural Networks (CNNs) (Yu et al., 2018). However, the results so far suffer from high computational burden and are hard to generalize to deep and largescale CNNs.3 Matrix–Based Mutual Information
Here, we review Rényi’s order entropy and its multivariate extension proposed by Yu et al. (2019).
3.1 Matrix–Based Rényi’s alphaOrder Entropy
Rényi’s order entropy is a generalization of Shannon’s entropy (Shannon, 1948; rényi1961). For a random variable
with probability density function (PDF)
over a finite set , Rényi’s order entropy is defined as:(1) 
Equation 1 has been widely applied in machine learning (Principe, 2010), and the particular case of , combined with Parzen window density estimation (Parzen, 1962), form the basis for Information Theoretic Learning (Principe, 2010)
. However, accurately estimating PDFs in highdimensional data, which is typically the case for DNNs, is a challenging task. To avoid the problem of highdimensional PDF estimation,
Giraldo et al. (2012) proposed a nonparametric framework for estimating entropy directly from data using infinitely divisible kernels (Giraldo & Príncipe, 2013) with similar properties as Rényi’s order entropy.Definition 3.1.
(Giraldo et al., 2012) Let denote data points and let be an infinitely divisible positive definite kernel (Bhatia, 2006). Given the kernel matrix with elements and the matrix , the matrix–based Rényi’s –order entropy is given by
(2) 
where denotes the trace, and denotes the eigenvalue of .
Not all estimators of entropy have the same property (Paninski:2003:EEM:795523.795524)
. The ones developed for Shannon suffer from the curse of dimensionality
(Kwak & ChongHo Choi, 2002). In contrast, the matrix–based Renyi’s entropy shown in Equation 2 have the same functional form of the statistical quantity in a Reproducing Kernel Hilbert Space (RKHS). Essentially, we are projecting marginal distribution to an RKHS to measure entropy and mutual information. This is similar to the approach of maximum mean discrepancy and the kernel mean embedding (Gretton et al., 2012; Muandet et al., 2017). Moreover, Giraldo et al. (2012) showed that under certain conditions the difference between the estimators and the true quantity can be bounded, as shown in Proposition A.1 in Appendix A. Notice that the dimensionality of the data does not appear in Proposition A.1. This makes the matrixbased entropy estimator more robust for highdimensional data compared to KNN and KDE based information estimators used in previous works
(Saxe et al., 2018; Chelombiev et al., 2019), which have difficulties handling highdimensional data (Kwak & ChongHo Choi, 2002). Also, there is no need for any binning procedure utilized in previous works (ShwartzZiv & Tishby, 2017), which are known to struggle with the relu activation function commonly used in DNN
(Saxe et al., 2018).Remark 1.
In addition to the definition of matrix based entropy, Giraldo et al. (2012) define the joint entropy between and as
(4) 
where and are two different representations of the same object and denotes the Hadamard product. Finally, the MI is, similar to Shannon’s formulation, defined as
(5) 
3.2 Multivariate MatrixBased Rényi’s alphaentropy Functionals
The matrixbased Rényi’s order entropy functional is not suitable for estimating the amount of information of the features produced by a convolutional layer in a DNN as the output consists of feature maps, each represented by their own matrix, that characterize different properties of the same sample. Yu et al. (2019) proposed a multivariate extension of the matrix–based Rényi’s order entropy, which computes the jointentropy among variables as
(6) 
where , …, . Yu et al. (2018) also demonstrated how Equation 6 could be utilized for analyzing synergy and redundancy of convolutional layers in DNN, but noted that this formulation can encounter difficulties when the number of feature maps increases, such as in more complex CNNs. Difficulties arise due to the Hadamard products in Equation 6, given that each element of , takes on a value between 0 and , and the product of such elements thus tends towards 0 as grows. Yu et al. (2018) reported such challenges when attempting to model the IP of the VGG16 (Simonyan & Zisserman, 2015).
4 TensorBased Mutual Information
To estimate information theoretic quantities of features produced by convolutional layers and address the limitations discussed above we introduce our tensorbased approach for estimating entropy and MI in DNNs, and show that the multivariate approach in section 3.2 arises as a special case.
4.1 Tensor Kernels for Mutual Information Estimation
The output of a convolutional layer is represented as a tensor for a data point . As discussed above, the matrix–based Rényi’s entropy estimator can not include tensor data without modifications. To handle the tensor based nature of convolutional layers we propose to utilize tensor kernels (Signoretto et al., 2011)
to produce a kernel matrix for the output of a convolutional layer. A tensor formulation of the radial basis function (RBF) kernel can be stated as
(7) 
where denotes the Frobenius norm and is the kernel width parameter. In practice, the tensor kernel in Equation 7
can be computed by reshaping the tensor into a vectorized representation while replacing the Frobenius norm with a Euclidean norm. We estimate the MI in Equation
5 by replacing the matrix with(8)  
4.2 TensorBased Approach Contains Multivariate Approach as Special Case
Let denote the vector representation of data point in layer and let denote its representation produced by filter . In the following, we omit the layer index for ease of notation, but assume it is fixed. Consider the case when is an RBF kernel with kernel width parameter . That is . In this case, and
since . Thus, element is given by
If we let , this expression is reduced to
(9)  
Accordingly, implying that the tensor method is equivalent to the multivariate matrix–based joint entropy when the width parameter is equal within a given layer. However, the tensorbased approach eliminates the effect of numerical instabilities one encounters in layers with many filters, thereby enabling training of complex neural networks.
4.3 Choosing the Kernel Width
With methods involving RBF kernels, the choice of the kernel width parameter, , is always critical. For supervised learning problems, one might choose this parameter by crossvalidation based on validation accuracy, while in unsupervised problems one might use a rule of thumb (Shi & Malik, 2000; Shi et al., 2009; Silverman, 1986). However, in the case of estimating MI in DNNs, the data is often high dimensional, in which case unsupervised rules of thumb often fail (Shi et al., 2009).
In this work, we choose based on an optimality criterion. Intuitively, one can make the following observation: A good kernel matrix should reveal the class structures present in the data. This can be accomplished by maximizing the so–called kernel alignment loss (Cristianini et al., 2002) between the kernel matrix of a given layer, , and the label kernel matrix, . The kernel alignment loss is defined as
(10) 
where and denotes the Frobenius norm and inner product, respectively. Thus, we choose our optimal as
To stabilize the values across mini batches, we employ an exponential moving average, such that in layer at iteration , we have
where and .
5 Experimental Results
We evaluate our approach by comparing it to previous results obtained on small networks by considering the MNIST dataset and an Multi Layer Perceptron (MLP) architecture that was inspired by
Saxe et al. (2018). We further compare to a small CNN architecture similar to that of Noshad et al. (2019), before considering large networks, namely VGG16, and a more challenging dataset, namely CIFAR10. Note, unless stated otherwise, we use CNN to denote the small CNN architecture. Details about the MLP and the CNN utilized in these experiments can be found in Appendix C. All MI estimates were computed using Equation 3, 4 and 5 and the tensor approach described in Section 4.Since the MI is computed at the minibatch level a certain degree of noise is present. To smooth the MI estimates we employ a moving average approach where each sample is averaged over minibatches. For the MLP and CNN experiments we use and for the VGG16 we use . We use a batch size of 100 samples, and determine the kernel width using the kernel alignment loss defined in Equation 10. For each hidden layer, we chose the kernel width that maximizes the kernel alignment loss in the range 0.1 and 10 times the mean distance between the samples in one minibatch. Initially, we sample 75 equally spaced values for the kernel width in the given range for the MLP and CNN and 300 values for the VGG16 network. During training, we dynamically reduce the number of samples to 50 and 100 respectively in to reduce computational complexity and motivated by the fact that the kernel width remains relatively stable during the latter part of training (illustrated in Appendix F). We chose the range 0.1 and 10 times the mean distance between the samples in one minibatch to avoid the kernel width becoming too small and to ensure that we cover a wide enough range of possible values. For the input kernel width we empirically evaluated values in the range 216 and found consistent results for values in the range 412. All our experiments were conducted with an input kernel width of 8. For the label kernel matrix, we want a kernel width that is as small as possible to approach an ideal kernel matrix, while at the same time large enough to avoid numerical instabilities. For all our experiments we use a value of 0.1 for the kernel width of the label kernel matrix.
Comparison to Previous Approaches First, we study the IP of the MLP examined in previous works on DNN analysis using information theory (Noshad et al., 2019; Saxe et al., 2018)
. We utilize stochastic gradient descent with a learning rate of 0.09, a crossentropy loss function, and repeat the experiment 5 times. Figure
1 displays the IP of the MLP with a ReLU activation function in each hidden layer. MI was estimated using the training data of the MNIST dataset. A similar experiment was performed with the tanh activation function, obtaining similar results. The interested reader can find these results in Appendix D.From Figure 1 one can clearly observe a fitting phase, where both and increases rapidly, followed by a compression phase where decrease and remains unchanged. Also note that for the output layer (layer 5 in Figure 1) stabilizes at an approximate value of , which is to be expected. This can be seen by noting that when the network achieves approximately accuracy, , where denotes the output of the network, since and will be approximately identical and the MI between a variable and itself is just the entropy of the variable. The entropy of is estimated using Equation 3, which requires the computation of the eigenvalues of the label kernel matrix . For the ideal case, where if and zero otherwise, is a rank matrix, where is the number of classes in the data. Thus, has non–zero eigenvalues which are given by , where is the number of datapoints in class . Furthermore, if the dataset is balanced we have . Then, , which gives us the entropy estimate
(11)  
Next we examine the IP of a CNN, similar to that studied by Noshad et al. (2019), with a similar experimental setup as for the MLP experiment. Figure 2 displays the IP of the CNN with a ReLU activation function in all hidden layers. A similar experiment was conducted using the tanh activation function and can be found in Appendix E. While the output layer behaves similarly to that of the MLP, the preceding layers show much less movement. In particular, no fitting phase is observed, which might be a result of the convolutional layers being able to extract the necessary information in very few iterations. Note that the output layer is again settling at the expected value , similar to the MLP, as it also achieves close to accuracy.
Increasing DNN size Finally, we analyze the IP of the VGG16 network on the CIFAR10 dataset, with the same experimental setup as in the previous experiments. To our knowledge, this is the first time that the full IP has been modeled for such a largescale network. Figure 3 and 4 show the IP when estimating the MI for the training dataset and the test dataset respectively. For the training dataset, we can clearly observe the same trend as for the smaller networks, where layers experience a fitting phase during the early stages of training and a compression phase in the later stage. Note, that the compression phase is less prominent for the testing dataset. Also note the difference between the final values of for the output layer estimated using the training and test data, which is a result of the different accuracy achieved on the training data () and test data (. Cheng et al. (2018) claim that and for high dimensional data, and highlight particular difficulties with estimating the MI between convolutional layers and the input/output. However, this statement is dependent on their particular estimator for the MI, and the results presented in Figure 3 and 4 demonstrate that neither nor is deterministic for our proposed estimator.
Effect of Early Stopping
We also investigate the effect of using early stopping on the IP described above. Early stopping is a regularization technique where the validation accuracy is monitored and training is stopped if the validation accuracy does not increase for a set number of iteration, often referred to as the patience hyperparameter. Figure
1 displays the results of monitoring where the training would stop if the early stopping procedure was applied for different values of patience. For a patience of 5 iterations the network training would stop before the compression phase takes place for several of the layers. For larger patience values, the effects of the compression phase can be observed before training is stopped. Early stopping is a procedure intended to prevent the network from overfitting, which may imply that the compression phase observed in the IP of DNNs can be related to overfitting.Data Processing Inequality
A DNN consists of a chain of mappings from the input, through the hidden layers and to the output. One can interpret a DNN as a Markov chain
(ShwartzZiv & Tishby, 2017; Yu & Principe, 2019) that defines an information path (ShwartzZiv & Tishby, 2017), which should satisfy the following Data Processing Inequality (Cover & Thomas, 2006):(12) 
where is the number of layers in the network. An indication of a good MI estimator is that it tends to uphold the DPI. Figure 8 in Appendix G illustrates the mean difference in MI between two subsequent layers in the MLP and VGG16 network. Positive numbers indicate that MI decreases, thus indicating compliance with the DPI. We observe that our estimator complies with the DPI for all layers in the MLP and all except one in the VGG16 network.
6 Conclusion
In this work, we propose a novel framework for analyzing DNNs from a MI perspective using a tensorbased estimate of the Rényi’s Order Entropy. Our experiments illustrate that the proposed approach scales to large DNNs, which allows us to provide insights into the training dynamics. We observe that the compression phase in neural network training tends to be more prominent when MI is estimated on the training set and that commonly used earlystopping criteria tend to stop training before or at the onset of the compression phase. This could imply that the compression phase is linked to overfitting. Furthermore, we showed that, for our tensorbased approach, the claim that and does not hold. We believe that our proposed approach can provide new insight and facilitate a more theoretical understanding of DNNs.
References
 Amjad & Geiger (2019) R. A. Amjad and B. C. Geiger. Learning representations for neural networkbased classification using the information bottleneck principle. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2019. ISSN 01628828. doi: 10.1109/TPAMI.2019.2909031.
 Bhatia (2006) Rajendra Bhatia. Infinitely divisible matrices. The American Mathematical Monthly, 113(3):221–235, 2006. ISSN 00029890, 19300972. URL http://www.jstor.org/stable/27641890.
 Chelombiev et al. (2019) Ivan Chelombiev, Conor Houghton, and Cian O’Donnell. Adaptive estimators show information compression in deep neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkeZisA5t7.
 Cheng et al. (2018) Hao Cheng, Dongze Lian, Shenghua Gao, and Yanlin Geng. Evaluating capability of deep neural networks for image classification via information plane. In The European Conference on Computer Vision (ECCV), September 2018.
 Cover & Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). WileyInterscience, New York, NY, USA, 2006. ISBN 0471241954.
 Cristianini et al. (2002) Nello Cristianini, John ShaweTaylor, Andre Elisseeff, and Jaz S Kandola. On kerneltarget alignment. In Advances in neural information processing systems, pp. 367–373, 2002.
 Giraldo & Príncipe (2013) Luis Gonzalo Sánchez Giraldo and José C. Príncipe. Information theoretic learning with infinitely divisible kernels. CoRR, abs/1301.3551, 2013. URL http://arxiv.org/abs/1301.3551.
 Giraldo et al. (2012) Luis Gonzalo Sanchez Giraldo, Murali Rao, and José Carlos Príncipe. Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory, 61:535–548, 2012.

Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In Yee Whye Teh and Mike Titterington (eds.),
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, volume 9 of Proceedings of Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/glorot10a.html.  Goldfeld et al. (2019) Ziv Goldfeld, Ewout Van Den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, and Yury Polyanskiy. Estimating information flow in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2299–2308, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/goldfeld19a.html.
 Gretton et al. (2012) Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. J. Mach. Learn. Res., 13:723–773, March 2012. ISSN 15324435. URL http://dl.acm.org/citation.cfm?id=2188385.2188410.

He et al. (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, 2015.  Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 Kolchinsky et al. (2019) Artemy Kolchinsky, Brendan D. Tracey, and Steven Van Kuyk. Caveats for information bottleneck in deterministic scenarios. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rke4HiAcY7.

Kwak & ChongHo Choi (2002)
N. Kwak and ChongHo Choi.
Input feature selection by mutual information based on parzen window.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12):1667–1671, Dec 2002. ISSN 01628828. doi: 10.1109/TPAMI.2002.1114861.  Muandet et al. (2017) Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10:1–141, 01 2017. doi: 10.1561/2200000060.
 Nielsen & Chuang (2011) Michael A. Nielsen and Isaac L. Chuang. Quantum Computation and Quantum Information: 10th Anniversary Edition. Cambridge University Press, New York, NY, USA, 10th edition, 2011. ISBN 1107002176, 9781107002173.
 Noshad et al. (2019) M. Noshad, Y. Zeng, and A. O. Hero. Scalable mutual information estimation using dependence graphs. In ICASSP 2019  2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2962–2966, May 2019. doi: 10.1109/ICASSP.2019.8683351.
 Parzen (1962) Emanuel Parzen. On estimation of a probability density function and mode. Ann. Math. Statist., 33(3):1065–1076, 09 1962. doi: 10.1214/aoms/1177704472. URL https://doi.org/10.1214/aoms/1177704472.

Paszke et al. (2017)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.
In NIPSW, 2017.  Principe (2010) Jose C. Principe. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives. Springer Publishing Company, Incorporated, 1st edition, 2010. ISBN 1441915699, 9781441915696.
 Renyi (1961) Alfred Renyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pp. 547–561, Berkeley, Calif., 1961. University of California Press. URL https://projecteuclid.org/euclid.bsmsp/1200512181.

Saxe et al. (2018)
Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy
Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox.
On the information bottleneck theory of deep learning.
In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ry_WPGA.  Shannon (1948) C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(4):623–656, 1948. doi: 10.1002/j.15387305.1948.tb00917.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/j.15387305.1948.tb00917.x.
 Shi & Malik (2000) Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 2000.

Shi et al. (2009)
Tao Shi, Mikhail Belkin, Bin Yu, et al.
Data spectroscopy: Eigenspaces of convolution operators and clustering.
The Annals of Statistics, 37(6B):3960–3984, 2009.  ShwartzZiv & Tishby (2017) Ravid ShwartzZiv and Naftali Tishby. Opening the black box of deep neural networks via information. CoRR, abs/1703.00810, 2017. URL http://arxiv.org/abs/1703.00810.
 Signoretto et al. (2011) Marco Signoretto, Lieven De Lathauwer, and Johan AK Suykens. A kernelbased framework to tensorial data analysis. Neural networks, 24(8):861–874, 2011.
 Silverman (1986) Bernard W Silverman. Density Estimation for Statistics and Data Analysis, volume 26. CRC Press, 1986.
 Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, 2015.
 Tishby & Zaslavsky (2015) N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5, April 2015. doi: 10.1109/ITW.2015.7133169.
 Yu et al. (2019) S. Yu, L. G. Sanchez Giraldo, R. Jenssen, and J. C. Principe. Multivariate extension of matrixbased renyi’s αorder entropy functional. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2019. doi: 10.1109/TPAMI.2019.2932976.
 Yu & Principe (2019) Shujian Yu and Jose C. Principe. Understanding autoencoders with information theoretic concepts. Neural Networks, 117:104 – 123, 2019. ISSN 08936080. doi: https://doi.org/10.1016/j.neunet.2019.05.003. URL http://www.sciencedirect.com/science/article/pii/S0893608019301352.
 Yu et al. (2018) Shujian Yu, Kristoffer Wickstrøm, Robert Jenssen, and José C. Príncipe. Understanding convolutional neural network training with information theory. CoRR, abs/1804.06537, 2018. URL http://arxiv.org/abs/1804.06537.
Appendix A Bound on Matrix–Based Entropy Estimator
Not all estimators of entropy have the same properties. The ones developed for Shannon suffer from the curse of dimensionality (Kwak & ChongHo Choi, 2002). In contrast, Renyi’s entropy estimators have the same functional form of the statistical quantity in a Reproducing Kernel Hilbert Space (RKHS), thus capturing properties of the data population. Essentially, we are projecting marginal distribution to a RKHS in order to measure entropy and MI. This is similar to the approach of maximum mean discrepancy and the kernel mean embedding (Gretton et al., 2012; Muandet et al., 2017). The connection with the data population can be shown via the theory of covariance operators. The covariance operator is defined through the bilinear form
where is a probability measure and . Based on the empirical distribution , the empirical version of obtained from a sample of size N is given by:
By analyzing the spectrum of and , Giraldo et al. (2012) showed the the different between and can be bounded, as stated in the following proposition:
Proposition A.1.
Let be the empirical distribution. Then, as a consequence of Proposition 6.1 in Giraldo et al. (2012), . The difference between and can be bounded under the conditions of Theorem 6.2 in Giraldo et al. (2012) and for , with probability 1
(13) 
where C is a compact selfadjoint operator.
Appendix B Proof of Equation 3 in Section 3
Proof.
since . L’Hôpital’s rule yields
(14)  
∎
Appendix C Detailed Description of Networks from Section 5
We provide a detailed description of the architectures utilized in Section 5 of the main paper. Weights were initialized according to He et al. (2015) when the ReLU activation function was applied and initialized according to Glorot & Bengio (2010) for the experiments conducted using the tanh activation function. Biases were initialized as zeros for all networks. All networks were implemented using the deep learning framework Pytorch (Paszke et al., 2017).
c.1 Multilayer Perceptron Utilized in Section 5
The MLP architecture used in our experiments is the same architecture utilized in previous studies on the IP of DNN (Noshad et al., 2019; Saxe et al., 2018), but with Batch Normalization (Ioffe & Szegedy, 2015) included after the activation function of each hidden layer. Specifically, the MLP in Section 5 includes (from input to output) the following components:

Fully connected layer with 784 inputs and 1024 outputs.

Activation function.

Batch normalization layer

Fully connected layer with 1024 inputs and 20 outputs.

Activation function.

Batch normalization layer

Fully connected layer with 20 inputs and 20 outputs.

Activation function.

Batch normalization layer

Fully connected layer with 20 inputs and 20 outputs.

Activation function.

Batch normalization layer

Fully connected layer with 784 inputs and 10 outputs.

Softmax activation function.
c.2 Convolutional Neural Network Utilized in Section 5
The CNN architecture in our experiments is a similar architecture as the one used by Noshad et al. (2019). Specifically, the CNN in Section 5 includes (from input to output) the following components:

Activation function.

Batch normalization layer

Convolutional layer with 4 input channels and 8 filters, filter size , stride of 1 and no padding.

Activation function.

Batch normalization layer

Max pooling layer with filter size , stride of 2 and no padding.

Convolutional layer with 8 input channels and 16 filters, filter size , stride of 1 and no padding.

Activation function.

Batch normalization layer

Max pooling layer with filter size , stride of 2 and no padding.

Fully connected layer with 400 inputs and 256 outputs.

Activation function.

Batch normalization layer

Fully connected layer with 256 inputs and 10 outputs.

Softmax activation function.
Appendix D IP of MLP with tanh activation function from Section 5
Figure 5 displays the IP of the MLP described above with a tanh activation function applied in each hidden layer. Similarly to the ReLU experiment in the main paper, a fitting phase is observed, where both and increases rapidly, followed by a compression phase where decrease and remains unchanged. Also note that, similar to the ReLU experiment, stabilizes close to the theoretical maximum value of .
Appendix E IP of CNN with tanh activation function from Section 5
Figure 6 displays the IP of the CNN described above with a tanh activation function applied in each hidden layer. Just as for the CNN experiment with ReLU activation function in the main paper, no fitting phase is observed for the majority of the layers, which might indicate that the convolutional layers can extract the essential information after only a few iterations.
Appendix F Kernel width sigma
We further evaluate our dynamic approach of finding the kernel width . Figure 7 shows the variation of in each layer for the MLP, the small CNN and the VGG16 network. We observe that the optimal kernel width for each layer (based on the criterion in Section 4.3), stabilizes reasonably quickly and remains relatively constant during training. This illustrates that decreasing the sampling range is a meaningful approach to decrease computational complexity.
Appendix G Data Processing Inequality
Figure 8 illustrates the mean difference in MI between two subsequent layers in the MLP and VGG16 network. Positive numbers indicate that MI decreases, thus indicating compliance with the DPI. We observe that our estimator complies with the DPI for all layers in the MLP and for all except one in the VGG16 network.
Note, the difference in MI is considerably lower for the early layers in the network, which is further shown by the grouping of the early layers for our convolutional based architectures (Figure 24).
Comments
There are no comments yet.