Understanding Convolutional Neural Network Training with Information Theory

04/18/2018 ∙ by Shujian Yu, et al. ∙ University of Tromsø the Arctic University of Norway University of Florida 0

Using information theoretic concepts to understand and explore the inner organization of deep neural networks (DNNs) remains a big challenge. Recently, the concept of an information plane began to shed light on the analysis of multilayer perceptrons (MLPs). We provided an in-depth insight into stacked autoencoders (SAEs) using a novel matrix-based Renyi's α-entropy functional, enabling for the first time the analysis of the dynamics of learning using information flow in real-world scenario involving complex network architecture and large data. Despite the great potential of these past works, there are several open questions when it comes to applying information theoretic concepts to understand convolutional neural networks (CNNs). These include for instance the accurate estimation of information quantities among multiple variables, and the many different training methodologies. By extending the novel matrix-based Renyi's α-entropy functional to a multivariate scenario, this paper presents a systematic method to analyze CNNs training using information theory. Our results validate two fundamental data processing inequalities in CNNs, and also have direct impacts on previous work concerning the training and design of CNNs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

There has been a growing interest in understanding deep neural networks (DNNs) mapping and training using information theory [1, 2, 3]. According to Schwartz-Ziv and Tishby [4], a DNN should be analyzed by measuring the information quantities that each layer’s representation preserves about the input signal with respect to the desired signal (i.e., with respect to , where

denotes mutual information), which has been called the Information Plane (IP). Moreover, they also empirically show that the common stochastic gradient descent (SGD) optimization undergoes two separate phases in the IP: an early “fitting” phase, in which both

and increase rapidly along with the iterations, and a later “compression” phase, in which there is a reversal such that and continually decrease. However, the observations so far have been constrained to a simple multilayer perceptron (MLP) on toy data, which were later questioned by some counter-examples in [5].

In our most recent work [6], we use a novel matrix-based Rényi’s -entropy [7] to analyze the information flow in stacked autoencoders (SAEs). We observed that the existence of “compression” phase associated with and in IP is predicated to the proper dimension of the bottleneck layer size of SAEs: if is larger than the intrinsic dimensionality  [8] of training data, the mutual information values start to increase up to a point and then go back approaching the bisector of IP; if is smaller than , the mutual information values increase consistently up to a point, and never go back.

Despite the great potential of earlier works [4, 5, 6], there are several open questions when it comes to the applications of information theoretic concepts to convolutional neural networks (CNNs). These include but are not limited to:

1) The accurate and tractable estimation of information quantities in CNNs. Specifically, in the convolutional layer, the input signal

is represented by multiple feature maps, as opposed to a single vector in the fully connected layers. Therefore, the quantity we really need to measure is the

multivariate mutual information (MMI) between a single variable (e.g., ) and a group of variables (e.g., different feature maps)111

By variable, we mean a random element, which can be vector valued random variable for instance.

. Unfortunately, the reliable estimation of MMI is widely acknowledged as an intractable or infeasible task in machine learning and information theory communities [9], especially when each variable is in a high-dimensional space.

2) A systematic framework to analyze CNN layer representations. By interpreting a feedforward DNN as a Markov chain, the existence of data processing inequality (DPI) is a general consensus 

[4, 6]. However, it is necessary to identify more inner properties on CNN layer representations using principled approach or framework, beyond DPI.

In this paper, we answer these questions and make the following contributions:

1) By suggesting the multivariate extension of the matrix-based Rényi’s -entropy functional [10]

, we show that the information flow, especially the MMI, in CNNs can be easily measured without approximation or accurate probability density function (PDF) estimation.

2) By introducing the partial information decomposition (PID) framework [11], we develop three quantities that enable identifying the synergy and redundancy tradeoff amongst different feature maps in convolutional layers. Our result has direct impact on the design of CNNs.

Ii Information Quantity Estimation in CNNs

In this section we give a brief introduction to the recently proposed matrix-based Rényi’s -entropy functional estimator [7] and its multivariate extension [10]. Benefiting from the novel definition, we present a simple method to measure MMI between any pairwise layer representations in CNNs. The theoretical foundations for our estimators are proved in [7, 10].

Ii-a Matrix-based Rényi’s -entropy functional and its multivariate extension

In information theory, a natural extension of the well-known Shannon’s entropy is Rényi’s -order entropy [12]. For a random variable with probability density function (PDF) in a finite set , the -entropy is defined as:

(1)

Rényi’s entropy functional evidences a long track record of usefulness in machine learning and its applications [13]. Unfortunately, the accurate PDF estimation impedes its more widespread adoption in data driven science. To solve this problem, [7, 10] suggest similar quantities that resembles quantum Rényi’s entropy [14] in terms of the normalized eigenspectrum of the Hermitian matrix of the projected data in RKHS, thus estimating the entropy, joint entropy among two or more variables directly from data without PDF estimation. For brevity, we directly give the definition.

Definition 1

Let be a real valued positive definite kernel that is also infinitely divisible [15]. Given and the Gram matrix obtained from evaluating a positive definite kernel on all pairs of exemplars, that is , a matrix-based analogue to Rényi’s -entropy for a normalized positive definite (NPD) matrix of size , such that , can be given by the following functional:

(2)

where and denotes the

-th eigenvalue of

.

Definition 2

Given a collection of samples , where the superscript denotes the sample index, each sample contains () measurements , , , obtained from the same realization, and the positive definite kernels , , , , a matrix-based analogue to Rényi’s -order joint-entropy among variables can be defined as:

(3)

where , , , , and denotes the Hadamard product.

Ii-B Multivariate mutual information estimation in CNNs

Suppose there are filters in the convolutional layer, given an input image, it is represented by different feature maps, each characterizing a specific property of the input. This suggests that the amount of information that the convolutional layer gained from input is preserved in information sources , , , . Therefore, the amount of information that input gained from feature maps is:

(4)

where denotes entropy for a single variable or joint entropy for a group of variables.

Given Eq. (2) and Eq. (3), in a mini-batch of size can be estimated with:

(5)

Here, ,

denote Gram matrices evaluated on input tensor and

feature maps tensors, respectively. Specifically, (in Definition 2) refers to the feature map generated from the -th input sample using the -th () filter, and is evaluated exactly on . Obviously, instead of estimating the joint PDF on which is typically unattainable, one just needs to compute Gram matrices using a real valued positive definite kernel that is also infinitely divisible [15].

Iii Main Results

This section presents two sets of experiments to validate the existence of two DPIs in CNNs, and the novel nonparametric information theoretic estimators put forth in this work. Specifically, Section III-A validates the existence of two DPIs in CNNs, whereas Section III-B illustrate, via the application of PID framework, some interesting observations associated with different CNN topologies in the training phase. Following this, we present two implications to the design and training of CNNs motivated by these results. We finally point out, in Section III-C, an advanced interpretation to the information plane (IP) that deserve more (theoretical) investigations. Two benchmark datasets, namely MNIST [16] and Fashion-MNIST [17], are selected for evaluation. To avoid influencing the flow of presentation, the results on Fashion-MNIST are demonstrated in Appendix -A.

The baseline CNN architecture to be considered in this work is a LeNet- [16] like network with two convolutional layers, two pooling layers, and two fully connected layers (thus constituting hidden layers). Each convolutional layer consists of filters. We train the CNN using the basic SGD with momentum and mini-batch size . In both datasets, we select learning rate and

training epochs. Both “sigmoid” and “ReLUactivation functions are tested. For the estimation of MMI, we fix

to approximate Shannon’s definition, and use the radial basis function (RBF) kernel

to obtain the Gram matrices. The kernel size is determined based on the Silverman’s rule of thumb [18] , where is the number of samples in the mini-batch ( in this work), is the sample dimensionality and

is an empirical value selected experimentally by taking into consideration the data’s average marginal variance. In this paper, we select

for the input signal forward propagation chain and

for the error backpropagation chain.

Iii-a Two DPIs and their validation

We expect the existence of two DPIs in any feedforward CNNs with hidden layers, i.e., and , where , , , are successive hidden layer representations from the first hidden layer to the output layer and , , , are errors from the output layer to the first hidden layer. This is because both and form a Markov chain [4, 6].

Fig. 1 shows the DPIs at the initial training stage, after epochs’ training and at the final training stage, respectively. As can be seen, DPIs hold in most of the cases. Note that, there are a few disruptions in the error backpropagation chain. One possible reason is that when training converges, the error becomes tiny such that Sliverman’s rule of thumb is no longer a reliable choice to select scale parameter in our estimator.

(a) initial iteration
(b) 3 epochs later
(c) 10 epochs later
(d) initial iteration
(e) 3 epochs later
(f) 10 epochs later
Fig. 1: Two DPIs in CNNs. (a)-(c) show the validation results, using a CNN with filters in the first convolutional layer and filters in the second convolutional layer; (d)-(f) show the validation results, using a CNN with filters in the first convolutional layer and filters in the second convolutional layer. In each subfigure, the blue curves show the MMI values between input and different layer representations, whereas the green curves show the MMI values between errors in the output layer and different hidden layers.

Iii-B Redundancy and Synergy in Layer Representations

In this section, we explore some hidden properties, with the help of the PID framework, associated with different information theoretic quantities of convolutional layer representations in the training phase of CNNs. Particularly, we are interested in determining the redundancy and synergy amongst different feature maps and how their tradeoffs evolve with training in different CNN topologies. Moreover, we are also interested in identifying some upper or lower limits (if they exist) for these quantities.

Given input signal and two feature maps and , the PID framework indicates that the MMI can be decomposed into four non-negative components: the synergy that measures the information about provided by the coalition or combination of and (i.e., the information that cannot be captured by either or alone); the redundancy that measures the shared information about that can be provided by either or ; the unique information (or ) that measures the information about that can only be provided by (or ). Moreover, the unique information, the synergy and the redundancy satisfy (see Fig. 2 for better understanding):

(6)
(7)
(8)
(a)
(b) PID to
Fig. 2: Synergy and redundancy amongst different feature maps. (a) shows the interactions between input signal and two feature maps. The shadow area indicates the MMI . (b) shows the PID to .
(a) 2 filters in C1, 2 filters in C2; ACA:
(b) 6 filters in C1, 32 filters in C2; ACA:
(c) 128 filters in C1, 32 filters in C2; ACA:
(d) 6 filters in C1, 128 filters in C2; ACA:
(e) The percentage that Redundancy-Synergy tradeoff accounts for the multivariate mutual information in each pair of feature maps. The network differ in the number of filters in C1.
(f) The percentage that weighted non-redundant information accounts for the multivariate mutual information in each pair of feature maps. The network differ in the number of filters in C1.
(g) The percentage that Redundancy-Synergy tradeoff accounts for the multivariate mutual information in each pair of feature maps. The network differ in the number of filters in C2.
(h) The percentage that weighted non-redundant information accounts for the multivariate mutual information in each pair of feature maps. The network differ in the number of filters in C2.
Fig. 3: The multivariate mutual information (MMI), the redundancy-synergy tradeoff, and the weighted non-redundant information in CNNs trained on MNIST dataset. (a)-(d) show the MMI in the first and the second convolutional layer representations and , respectively. The dashed black line indicates the upper bound of MMI, i.e., the average mini-batch input entropy. We also report the average classification accuracy (ACA) on testing set (over Monte-Carlo simulations) in each subtitle. (e) and (f) demonstrate the percentages of redundancy-synergy tradeoff and the weighted non-redundant information, with respect to MMI in each pair of feature maps, for CNNs with different number of filters in , but filters in . (g) and (h) demonstrate the percentages of redundancy-synergy tradeoff and the weighted non-redundant information, with respect to MMI in each pair of feature maps, for CNNs with filters in , but different number of filters in .

The intuitive framework for can be straightforwardly extended for more than three variables, thus decomposing into much more components. For example, if , there will be individual non-negative items. Admittedly, the PID framework coupled with its Lattice decomposition offer us an intuitive manner to understand the interactions between input and different feature maps, the reliable estimation of each PID term still remains a big challenge. In fact, there is no universal agreement on the definition of synergy and redundancy among one-dimensional -way interactions, let alone the estimation of each synergy or redundancy item among numerous variables in high-dimensional space [19, 20]. To this end, we develop three quantities, that avoid the direct computation of synergy and redundancy, to characterize intrinsic properties of CNN layer representations. They are:

1) , which is exactly the MMI. This quantity measures the amount of information about that is captured by all feature maps (in one convolutional layer).

2) , which is referred to redundancy-synergy tradeoff. This quantity measures the (average) redundancy-synergy tradeoff in different feature maps. This is because, by Eqs. (6)-(8),

(9)

Obviously, a positive value of this tradeoff implies redundancy, whereas a negative value signifies synergy [21]. Here, instead of measuring all PID terms that increase polynomially with , we sample pairs of feature maps, calculate the information quantities for each pair, and finally compute averages over all pairs to determine if synergy dominates in the training phase. Note that, the pairwise sampling procedure has been widely used in neuroscience [22].

3) , which is referred to weighted non-redundant information. This quantity measures the (average) amount of non-redundant information about that is captured by pairs of feature maps. Again, by Eqs. (6)-(8),

(10)

We call this quantity “weighted” because we overemphasized the role of synergy. Note that, the actual amount of non-redundant information is , rather than .

We compute these three quantities in the training phase, and compare their values with respect to different CNN topologies. Fig. 3(a)-3(d) demonstrate the MMI values in two convolutional layers. By DPI, the maximum amount of information that each layer representation can capture is exactly the entropy of input. As can be seen, with the increase of the number of filters, the total amount of information that each convolutional layer captured also increases correspondingly. However, it is interesting to find that MMI values approach their theoretical maximum (i.e., the ensemble average entropy of mini-batch input) with only filters in the first convolutional layer and filters in the second convolutional layer. More filters (in a reasonable range) can improve the classification performance. However, if we blindly increase the number of filters, the classification accuracy cannot increase anymore or even becomes worse.

We argue that this phenomenon can be explained by the percentage that the redundancy-synergy tradeoff or the weighted non-redundant information accounts for the MMI in each pair of feature maps, i.e., or . In fact, by referring to Fig. 3(e)-3(h), it is obvious that more filters can push the network towards an improved redundancy-synergy tradeoff, i.e., the synergy gradually dominates in each pair of feature maps with the increase of filter numbers. That is perhaps one of the main reasons why the increased number of filters can lead to better classification performance, even though the total multivariate mutual information stays the same. However, if we look deeper, it seems that the redundancy is always larger than the synergy such that their tradeoff can never cross the x-axis. This may suggest a (virtual) lower bound on the redundancy-synergy tradeoff. On the other hand, one should note that the amount of non-redundant information is always less than (or upper bounded by) the MMI no matter the number of filters, therefore it is impossible to improve the classification performance by blindly increasing the number of filters.

Having illustrated the DPIs and the redundancy-synergy tradeoffs, it is easy to summarize some implications concerning the design and training of CNNs. First, as a possible application of DPI in the error backpropagation chain, one has to realize that the DPI provides an indicator on where to perform the “bypass” in the recently proposed Relay backpropagation [23]. Second, the DPIs and the redundancy-synergy tradeoff may give some guidelines on the depth and width of CNNs. Indeed, we need multiple layers to denoise the input and to extract representations from different abstract levels. However, more layers will lead to severe information loss. The same interpretation goes for the number of filters in convolutional layers, we need sufficient number of filters to ensure the layer representations can extract and transfer input information as much as possible and to learn a good redundancy-synergy tradeoff. However, too many filters do not always lead to the increased amount of the non-redundant information, as the minimum probability of classification error is upper bounded by the mutual information expressed in different forms (e.g., [24, 25]).

Iii-C Revisiting the Information Plane (IP)

The behaviors of curves in the IP is currently a controversial issue. Recall the discrepancy reported by Saxe [5], the existence of compression phase observed by Shwartz-Ziv and Tishby [4] depends on the adopted nonlinearity functions: double-sided saturating nonlinearities like “tanh” or “sigmoid” yield a compression phase, but linear activation functions and single-sided saturating nonlinearities like the “ReLU” do not. Interestingly, Noshad [26] employed dependence graphs to estimate mutual information values and observed the compression phase even using “ReLU” activation functions. On the other hand, Goldfeld [27] argued that compression is due to layer representations clustering, but it is hard to observe the compression in large network. We disagree with this attribution of different behavior to the nonlinear activation functions. Instead, we often forget that, rarely, estimators share all the properties of the statistically defined quantities [28]. Hence, variability in the displayed behavior is mostly likely attributed to different estimators222Shwartz-Ziv and Tishby [4]

use the basic Shannon’s definition and estimate mutual information by dividing neuron activation values into

equal-interval bins, whereas the base estimator used by Saxe [5]

provides Kernel Density Estimator (KDE) based lower and upper bounds on the true mutual information 

[29, 26]., although this argument is rarely invoked in the literature. This is the reason we suggest that a first step before analyzing the information plane curves, is to show that the employed estimators meet the expectation of the DPI (or similar known properties of the statistical quantities). We show above that our Rényi’s entropy estimator passes this test.

The IPs for different CNN topologies on MNIST dataset are shown in Fig. 4. From the first row, both and increase rapidly up to a certain point with the SGD iterations, independently of the adopted activation functions or the number of filters in the convolutional layers. This result conforms to the description in [27], suggesting that the behaviour of CNNs in the IP not being the same as that of the MLPs in [4, 5, 26] and our intrinsic dimensionality hypothesis in [6] is specific to SAEs. However, if we remove the redundancy in and , and only preserve the unique information and the synergy (i.e., substituting and with their corresponding (average) weighted non-redundant information defined in Section III-B), it is easy to observe the compression phase in the modified IP. Moreover, it seems that “sigmoid” is more likely to incur the compression, compared with “ReLU”, where this intensity can be attributed to the nonlinearity. Our result shed light on the discrepancy in [4] and [5], and refined the argument in [26].

(a) IP, sigmoid ( filters in C1, filters in C2)
(b) IP, sigmoid ( filters in C1, filters in C2)
(c) IP, sigmoid ( filters in C1, filters in C2)
(d) M-IP, sigmoid ( filters in C1, filters in C2)
(e) M-IP, sigmoid ( filters in C1, filters in C2)
(f) M-IP, sigmoid ( filters in C1, filters in C2)
(g) M-IP, ReLU ( filters in C1, filters in C2)
(h) M-IP, ReLU ( filters in C1, filters in C2)
(i) M-IP, ReLU ( filters in C1, filters in C2)
Fig. 4: The Information Plane (IP) and modified Information Plane (M-IP) of different CNN topologies trained on MNIST dataset. The of filters in , the of filters in , and the adopted activation function are indicated in the subtitle of each plot. The curves in IP increase rapidly up to a point without compression (see (a)-(c)). By contrast, it is easy to observe the compression in M-IP (see (d), (e), (g) and (h)). Moreover, compared with ReLU, sigmoid is more likely to incur the compression (e.g., comparing (e) with (h), or (f) with (i)).

Iv Conclusions and Future Work

This paper presents a systematic method to analyze convolutional neural networks (CNNs) mapping and training from an information theoretic perspective. Using the multivariate extension of the matrix-based Rényi’s -entropy functional, we validated two data processing inequalities in CNNs. The introduction of partial information decomposition (PID) framework enables us to pin down the redundancy-synergy tradeoff in layer representations. We also analyzed the behaviors of curves in the information plane, aiming at clarify the debate on the existence of compression in DNNs. Future works are twofold:

1) All the information quantities mentioned in this paper are estimated based on a vector rastering of samples, i.e., each layer input (e.g., an input image, a feature map) is first converted to a single vector before entropy or mutual information estimation. Albeit its simplicity, we distort spatial relationships amongst neighboring pixels. Therefore, a question remains on the reliable information theoretic estimation that is feasible to a tensor structure.

2) We look forward to evaluating our estimators on more complex CNN architectures, such as VGGNet [30] and ResNet [31]. According to our observation, it is easy to validate the DPI and the rapid increase of mutual information (in top layers) in VGG- on CIFAR- dataset [32] (see Fig. 5). However, it seems that the MMI values in bottom layers are likely to be “saturated”. The problem arises when we try to take the Hadamard product of the kernel matrices of each feature map in Eq. (5). The elements in these (normalized) kernel matrices have values between and , and taking the entrywise product of, e.g., such matrices like in the convolutional layer of VGG-, will tend towards a matrix with diagonal entries and nearly zero everywhere else. The eigenvalues of the resulting matrix will quickly have almost the same value across training epochs. We aim to solve this limitation in future works.

Fig. 5: DPI in VGG-16 on CIFAR-10: . Layer 1 to Layer 13 are convolutional layers, whereas Layer 14 to Layer 16 are fully-connected layers.

References