I Introduction
There has been a growing interest in understanding deep neural networks (DNNs) mapping and training using information theory [1, 2, 3]. According to SchwartzZiv and Tishby [4], a DNN should be analyzed by measuring the information quantities that each layer’s representation preserves about the input signal with respect to the desired signal (i.e., with respect to , where
denotes mutual information), which has been called the Information Plane (IP). Moreover, they also empirically show that the common stochastic gradient descent (SGD) optimization undergoes two separate phases in the IP: an early “fitting” phase, in which both
and increase rapidly along with the iterations, and a later “compression” phase, in which there is a reversal such that and continually decrease. However, the observations so far have been constrained to a simple multilayer perceptron (MLP) on toy data, which were later questioned by some counterexamples in [5].In our most recent work [6], we use a novel matrixbased Rényi’s entropy [7] to analyze the information flow in stacked autoencoders (SAEs). We observed that the existence of “compression” phase associated with and in IP is predicated to the proper dimension of the bottleneck layer size of SAEs: if is larger than the intrinsic dimensionality [8] of training data, the mutual information values start to increase up to a point and then go back approaching the bisector of IP; if is smaller than , the mutual information values increase consistently up to a point, and never go back.
Despite the great potential of earlier works [4, 5, 6], there are several open questions when it comes to the applications of information theoretic concepts to convolutional neural networks (CNNs). These include but are not limited to:
1) The accurate and tractable estimation of information quantities in CNNs. Specifically, in the convolutional layer, the input signal
is represented by multiple feature maps, as opposed to a single vector in the fully connected layers. Therefore, the quantity we really need to measure is the
multivariate mutual information (MMI) between a single variable (e.g., ) and a group of variables (e.g., different feature maps)^{1}^{1}1By variable, we mean a random element, which can be vector valued random variable for instance.
. Unfortunately, the reliable estimation of MMI is widely acknowledged as an intractable or infeasible task in machine learning and information theory communities [9], especially when each variable is in a highdimensional space.2) A systematic framework to analyze CNN layer representations. By interpreting a feedforward DNN as a Markov chain, the existence of data processing inequality (DPI) is a general consensus
[4, 6]. However, it is necessary to identify more inner properties on CNN layer representations using principled approach or framework, beyond DPI.In this paper, we answer these questions and make the following contributions:
1) By suggesting the multivariate extension of the matrixbased Rényi’s entropy functional [10]
, we show that the information flow, especially the MMI, in CNNs can be easily measured without approximation or accurate probability density function (PDF) estimation.
2) By introducing the partial information decomposition (PID) framework [11], we develop three quantities that enable identifying the synergy and redundancy tradeoff amongst different feature maps in convolutional layers. Our result has direct impact on the design of CNNs.
Ii Information Quantity Estimation in CNNs
In this section we give a brief introduction to the recently proposed matrixbased Rényi’s entropy functional estimator [7] and its multivariate extension [10]. Benefiting from the novel definition, we present a simple method to measure MMI between any pairwise layer representations in CNNs. The theoretical foundations for our estimators are proved in [7, 10].
Iia Matrixbased Rényi’s entropy functional and its multivariate extension
In information theory, a natural extension of the wellknown Shannon’s entropy is Rényi’s order entropy [12]. For a random variable with probability density function (PDF) in a finite set , the entropy is defined as:
(1) 
Rényi’s entropy functional evidences a long track record of usefulness in machine learning and its applications [13]. Unfortunately, the accurate PDF estimation impedes its more widespread adoption in data driven science. To solve this problem, [7, 10] suggest similar quantities that resembles quantum Rényi’s entropy [14] in terms of the normalized eigenspectrum of the Hermitian matrix of the projected data in RKHS, thus estimating the entropy, joint entropy among two or more variables directly from data without PDF estimation. For brevity, we directly give the definition.
Definition 1
Let be a real valued positive definite kernel that is also infinitely divisible [15]. Given and the Gram matrix obtained from evaluating a positive definite kernel on all pairs of exemplars, that is , a matrixbased analogue to Rényi’s entropy for a normalized positive definite (NPD) matrix of size , such that , can be given by the following functional:
(2) 
where and denotes the
th eigenvalue of
.Definition 2
Given a collection of samples , where the superscript denotes the sample index, each sample contains () measurements , , , obtained from the same realization, and the positive definite kernels , , , , a matrixbased analogue to Rényi’s order jointentropy among variables can be defined as:
(3) 
where , , , , and denotes the Hadamard product.
IiB Multivariate mutual information estimation in CNNs
Suppose there are filters in the convolutional layer, given an input image, it is represented by different feature maps, each characterizing a specific property of the input. This suggests that the amount of information that the convolutional layer gained from input is preserved in information sources , , , . Therefore, the amount of information that input gained from feature maps is:
(4) 
where denotes entropy for a single variable or joint entropy for a group of variables.
Here, ,
denote Gram matrices evaluated on input tensor and
feature maps tensors, respectively. Specifically, (in Definition 2) refers to the feature map generated from the th input sample using the th () filter, and is evaluated exactly on . Obviously, instead of estimating the joint PDF on which is typically unattainable, one just needs to compute Gram matrices using a real valued positive definite kernel that is also infinitely divisible [15].Iii Main Results
This section presents two sets of experiments to validate the existence of two DPIs in CNNs, and the novel nonparametric information theoretic estimators put forth in this work. Specifically, Section IIIA validates the existence of two DPIs in CNNs, whereas Section IIIB illustrate, via the application of PID framework, some interesting observations associated with different CNN topologies in the training phase. Following this, we present two implications to the design and training of CNNs motivated by these results. We finally point out, in Section IIIC, an advanced interpretation to the information plane (IP) that deserve more (theoretical) investigations. Two benchmark datasets, namely MNIST [16] and FashionMNIST [17], are selected for evaluation. To avoid influencing the flow of presentation, the results on FashionMNIST are demonstrated in Appendix A.
The baseline CNN architecture to be considered in this work is a LeNet [16] like network with two convolutional layers, two pooling layers, and two fully connected layers (thus constituting hidden layers). Each convolutional layer consists of filters. We train the CNN using the basic SGD with momentum and minibatch size . In both datasets, we select learning rate and
training epochs. Both “sigmoid” and “ReLU” activation functions are tested. For the estimation of MMI, we fix
to approximate Shannon’s definition, and use the radial basis function (RBF) kernel
to obtain the Gram matrices. The kernel size is determined based on the Silverman’s rule of thumb [18] , where is the number of samples in the minibatch ( in this work), is the sample dimensionality andis an empirical value selected experimentally by taking into consideration the data’s average marginal variance. In this paper, we select
for the input signal forward propagation chain andfor the error backpropagation chain.
Iiia Two DPIs and their validation
We expect the existence of two DPIs in any feedforward CNNs with hidden layers, i.e., and , where , , , are successive hidden layer representations from the first hidden layer to the output layer and , , , are errors from the output layer to the first hidden layer. This is because both and form a Markov chain [4, 6].
Fig. 1 shows the DPIs at the initial training stage, after epochs’ training and at the final training stage, respectively. As can be seen, DPIs hold in most of the cases. Note that, there are a few disruptions in the error backpropagation chain. One possible reason is that when training converges, the error becomes tiny such that Sliverman’s rule of thumb is no longer a reliable choice to select scale parameter in our estimator.
IiiB Redundancy and Synergy in Layer Representations
In this section, we explore some hidden properties, with the help of the PID framework, associated with different information theoretic quantities of convolutional layer representations in the training phase of CNNs. Particularly, we are interested in determining the redundancy and synergy amongst different feature maps and how their tradeoffs evolve with training in different CNN topologies. Moreover, we are also interested in identifying some upper or lower limits (if they exist) for these quantities.
Given input signal and two feature maps and , the PID framework indicates that the MMI can be decomposed into four nonnegative components: the synergy that measures the information about provided by the coalition or combination of and (i.e., the information that cannot be captured by either or alone); the redundancy that measures the shared information about that can be provided by either or ; the unique information (or ) that measures the information about that can only be provided by (or ). Moreover, the unique information, the synergy and the redundancy satisfy (see Fig. 2 for better understanding):
(6) 
(7) 
(8) 
The intuitive framework for can be straightforwardly extended for more than three variables, thus decomposing into much more components. For example, if , there will be individual nonnegative items. Admittedly, the PID framework coupled with its Lattice decomposition offer us an intuitive manner to understand the interactions between input and different feature maps, the reliable estimation of each PID term still remains a big challenge. In fact, there is no universal agreement on the definition of synergy and redundancy among onedimensional way interactions, let alone the estimation of each synergy or redundancy item among numerous variables in highdimensional space [19, 20]. To this end, we develop three quantities, that avoid the direct computation of synergy and redundancy, to characterize intrinsic properties of CNN layer representations. They are:
1) , which is exactly the MMI. This quantity measures the amount of information about that is captured by all feature maps (in one convolutional layer).
2) , which is referred to redundancysynergy tradeoff. This quantity measures the (average) redundancysynergy tradeoff in different feature maps. This is because, by Eqs. (6)(8),
(9) 
Obviously, a positive value of this tradeoff implies redundancy, whereas a negative value signifies synergy [21]. Here, instead of measuring all PID terms that increase polynomially with , we sample pairs of feature maps, calculate the information quantities for each pair, and finally compute averages over all pairs to determine if synergy dominates in the training phase. Note that, the pairwise sampling procedure has been widely used in neuroscience [22].
3) , which is referred to weighted nonredundant information. This quantity measures the (average) amount of nonredundant information about that is captured by pairs of feature maps. Again, by Eqs. (6)(8),
(10) 
We call this quantity “weighted” because we overemphasized the role of synergy. Note that, the actual amount of nonredundant information is , rather than .
We compute these three quantities in the training phase, and compare their values with respect to different CNN topologies. Fig. 3(a)3(d) demonstrate the MMI values in two convolutional layers. By DPI, the maximum amount of information that each layer representation can capture is exactly the entropy of input. As can be seen, with the increase of the number of filters, the total amount of information that each convolutional layer captured also increases correspondingly. However, it is interesting to find that MMI values approach their theoretical maximum (i.e., the ensemble average entropy of minibatch input) with only filters in the first convolutional layer and filters in the second convolutional layer. More filters (in a reasonable range) can improve the classification performance. However, if we blindly increase the number of filters, the classification accuracy cannot increase anymore or even becomes worse.
We argue that this phenomenon can be explained by the percentage that the redundancysynergy tradeoff or the weighted nonredundant information accounts for the MMI in each pair of feature maps, i.e., or . In fact, by referring to Fig. 3(e)3(h), it is obvious that more filters can push the network towards an improved redundancysynergy tradeoff, i.e., the synergy gradually dominates in each pair of feature maps with the increase of filter numbers. That is perhaps one of the main reasons why the increased number of filters can lead to better classification performance, even though the total multivariate mutual information stays the same. However, if we look deeper, it seems that the redundancy is always larger than the synergy such that their tradeoff can never cross the xaxis. This may suggest a (virtual) lower bound on the redundancysynergy tradeoff. On the other hand, one should note that the amount of nonredundant information is always less than (or upper bounded by) the MMI no matter the number of filters, therefore it is impossible to improve the classification performance by blindly increasing the number of filters.
Having illustrated the DPIs and the redundancysynergy tradeoffs, it is easy to summarize some implications concerning the design and training of CNNs. First, as a possible application of DPI in the error backpropagation chain, one has to realize that the DPI provides an indicator on where to perform the “bypass” in the recently proposed Relay backpropagation [23]. Second, the DPIs and the redundancysynergy tradeoff may give some guidelines on the depth and width of CNNs. Indeed, we need multiple layers to denoise the input and to extract representations from different abstract levels. However, more layers will lead to severe information loss. The same interpretation goes for the number of filters in convolutional layers, we need sufficient number of filters to ensure the layer representations can extract and transfer input information as much as possible and to learn a good redundancysynergy tradeoff. However, too many filters do not always lead to the increased amount of the nonredundant information, as the minimum probability of classification error is upper bounded by the mutual information expressed in different forms (e.g., [24, 25]).
IiiC Revisiting the Information Plane (IP)
The behaviors of curves in the IP is currently a controversial issue. Recall the discrepancy reported by Saxe . [5], the existence of compression phase observed by ShwartzZiv and Tishby [4] depends on the adopted nonlinearity functions: doublesided saturating nonlinearities like “tanh” or “sigmoid” yield a compression phase, but linear activation functions and singlesided saturating nonlinearities like the “ReLU” do not. Interestingly, Noshad . [26] employed dependence graphs to estimate mutual information values and observed the compression phase even using “ReLU” activation functions. On the other hand, Goldfeld . [27] argued that compression is due to layer representations clustering, but it is hard to observe the compression in large network. We disagree with this attribution of different behavior to the nonlinear activation functions. Instead, we often forget that, rarely, estimators share all the properties of the statistically defined quantities [28]. Hence, variability in the displayed behavior is mostly likely attributed to different estimators^{2}^{2}2ShwartzZiv and Tishby [4]
use the basic Shannon’s definition and estimate mutual information by dividing neuron activation values into
equalinterval bins, whereas the base estimator used by Saxe . [5]provides Kernel Density Estimator (KDE) based lower and upper bounds on the true mutual information
[29, 26]., although this argument is rarely invoked in the literature. This is the reason we suggest that a first step before analyzing the information plane curves, is to show that the employed estimators meet the expectation of the DPI (or similar known properties of the statistical quantities). We show above that our Rényi’s entropy estimator passes this test.The IPs for different CNN topologies on MNIST dataset are shown in Fig. 4. From the first row, both and increase rapidly up to a certain point with the SGD iterations, independently of the adopted activation functions or the number of filters in the convolutional layers. This result conforms to the description in [27], suggesting that the behaviour of CNNs in the IP not being the same as that of the MLPs in [4, 5, 26] and our intrinsic dimensionality hypothesis in [6] is specific to SAEs. However, if we remove the redundancy in and , and only preserve the unique information and the synergy (i.e., substituting and with their corresponding (average) weighted nonredundant information defined in Section IIIB), it is easy to observe the compression phase in the modified IP. Moreover, it seems that “sigmoid” is more likely to incur the compression, compared with “ReLU”, where this intensity can be attributed to the nonlinearity. Our result shed light on the discrepancy in [4] and [5], and refined the argument in [26].
Iv Conclusions and Future Work
This paper presents a systematic method to analyze convolutional neural networks (CNNs) mapping and training from an information theoretic perspective. Using the multivariate extension of the matrixbased Rényi’s entropy functional, we validated two data processing inequalities in CNNs. The introduction of partial information decomposition (PID) framework enables us to pin down the redundancysynergy tradeoff in layer representations. We also analyzed the behaviors of curves in the information plane, aiming at clarify the debate on the existence of compression in DNNs. Future works are twofold:
1) All the information quantities mentioned in this paper are estimated based on a vector rastering of samples, i.e., each layer input (e.g., an input image, a feature map) is first converted to a single vector before entropy or mutual information estimation. Albeit its simplicity, we distort spatial relationships amongst neighboring pixels. Therefore, a question remains on the reliable information theoretic estimation that is feasible to a tensor structure.
2) We look forward to evaluating our estimators on more complex CNN architectures, such as VGGNet [30] and ResNet [31]. According to our observation, it is easy to validate the DPI and the rapid increase of mutual information (in top layers) in VGG on CIFAR dataset [32] (see Fig. 5). However, it seems that the MMI values in bottom layers are likely to be “saturated”. The problem arises when we try to take the Hadamard product of the kernel matrices of each feature map in Eq. (5). The elements in these (normalized) kernel matrices have values between and , and taking the entrywise product of, e.g., such matrices like in the convolutional layer of VGG, will tend towards a matrix with diagonal entries and nearly zero everywhere else. The eigenvalues of the resulting matrix will quickly have almost the same value across training epochs. We aim to solve this limitation in future works.
References

[1]
N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in
IEEE ITW, 2015, pp. 1–5.  [2] A. Achille and S. Soatto, “Emergence of invariance and disentanglement in deep representations,” JMLR, vol. 19, no. 1, pp. 1947–1980, 2018.
 [3] T. Tax, P. A. Mediano, and M. Shanahan, “The partial information decomposition of generative neural network models,” Entropy, vol. 19, no. 9, p. 474, 2017.
 [4] R. ShwartzZiv and N. Tishby, “Opening the black box of deep neural networks via information,” arXiv preprint arXiv:1703.00810, 2017.
 [5] A. M. Saxe et al., “On the information bottleneck theory of deep learning,” in ICLR, 2018.
 [6] S. Yu and J. C. Principe, “Understanding autoencoders with information theoretic concepts,” arXiv preprint arXiv:1804.00057, 2018.
 [7] L. G. Sanchez Giraldo, M. Rao, and J. C. Principe, “Measures of entropy from data using infinitely divisible kernels,” IEEE Transactions on Information Theory, vol. 61, no. 1, pp. 535–548, 2015.
 [8] F. Camastra and A. Staiano, “Intrinsic dimension estimation: Advances and open problems,” Information Sciences, vol. 328, pp. 26–41, 2016.

[9]
G. Brown, A. Pocock, M.J. Zhao, and M. Luján, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,”
JMLR, vol. 13, no. Jan, pp. 27–66, 2012.  [10] S. Yu, L. G. Sanchez Giraldo, R. Jenssen, and J. C. Principe, “Multivariate extension of matrixbased renyi’s order entropy functional,” arXiv preprint arXiv:1808.07912, 2018.
 [11] P. L. Williams and R. D. Beer, “Nonnegative decomposition of multivariate information,” arXiv preprint arXiv:1004.2515, 2010.
 [12] A. Rényi, “On measures of entropy and information,” in Proc. of the 4th Berkeley Sympos. on Math. Statist. and Prob., vol. 1, 1961, pp. 547–561.
 [13] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010.
 [14] M. MüllerLennert, F. Dupuis, O. Szehr, S. Fehr, and M. Tomamichel, “On quantum rényi entropies: A new generalization and some properties,” J. Math. Phys., vol. 54, no. 12, p. 122203, 2013.
 [15] R. Bhatia, “Infinitely divisible matrices,” The American Mathematical Monthly, vol. 113, no. 3, pp. 221–235, 2006.
 [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [17] H. Xiao, K. Rasul, and R. Vollgraf, “Fashionmnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
 [18] B. W. Silverman, Density estimation for statistics and data analysis. CRC press, 1986, vol. 26.
 [19] N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay, “Quantifying unique information,” Entropy, vol. 16, no. 4, pp. 2161–2183, 2014.
 [20] V. Griffith and C. Koch, “Quantifying synergistic mutual information,” in Guided SelfOrganization: Inception. Springer, 2014, pp. 159–190.

[21]
A. J. Bell, “The coinformation lattice,” in
Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: ICA
, vol. 2003, 2003.  [22] N. Timme, W. Alford, B. Flecker, and J. M. Beggs, “Synergy, redundancy, and multivariate information measures: an experimentalist s perspective,” J. Comput. Neurosci., vol. 36, no. 2, pp. 119–140, 2014.
 [23] L. Shen and Q. Huang, “Relay backpropagation for effective learning of deep convolutional neural networks,” in ECCV, 2016, pp. 467–482.
 [24] M. Hellman and J. Raviv, “Probability of error, equivocation, and the chernoff bound,” IEEE Transactions on Information Theory, vol. 16, no. 4, pp. 368–372, 1970.
 [25] I. Sason and S. Verdú, “Arimoto–rényi conditional entropy and bayesian ary hypothesis testing,” IEEE Transactions on Information Theory, vol. 64, no. 1, pp. 4–25, 2018.
 [26] M. Noshad and A. O. Hero III, “Scalable mutual information estimation using dependence graphs,” arXiv preprint arXiv:1801.09125, 2018.
 [27] Z. Goldfeld et al., “Estimating information flow in neural networks,” arXiv preprint arXiv:1810.05728, 2018.
 [28] L. Paninski, “Estimation of entropy and mutual information,” Neural computation, vol. 15, no. 6, pp. 1191–1253, 2003.
 [29] A. Kolchinsky and B. Tracey, “Estimating mixture entropy with pairwise distances,” Entropy, vol. 19, no. 7, p. 361, 2017.
 [30] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in ICLR, 2015.
 [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
 [32] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.