Understanding Autoencoders with Information Theoretic Concepts

03/30/2018 ∙ by Shujian Yu, et al. ∙ University of Florida 0

Despite their great success in practical applications, there is still a lack of theoretical and systematic methods to analyze deep neural networks. In this paper, we illustrate an advanced information theoretic methodology to understand the dynamics of learning and the design of autoencoders, a special type of deep learning architectures that resembles a communication channel. By generalizing the information plane to any cost function, and inspecting the roles and dynamics of different layers using layer-wise information quantities, we emphasize the role that mutual information plays in quantifying learning from data. We further propose and also experimentally validate, for mean square error training, two hypotheses regarding the layer-wise flow of information and intrinsic dimensionality of the bottleneck layer, using respectively the data processing inequality and the identification of a bifurcation point in the information plane that is controlled by the given data. Our observations have direct impact on the optimal design of autoencoders, the design of alternative feedforward training methods, and even in the problem of generalization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 17

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have drawn significant interest from the machine learning community, especially due to their recent empirical success in various applications such as image recognition

krizhevsky2012imagenet , speech recognition graves2013speech

, natural language processing

mesnil2013investigation , etc. Despite the overwhelming advantages achieved by deep neural networks over the classical machine learning models, the theoretical and systematic understanding of deep neural networks still remain limited and unsatisfactory. Consequently, deep models themselves are typically regarded as “black boxes” alain2016understanding .

This is an unfortunate terminology that the second author has disputed since the late ’s principe2000neural

. In fact, most neural architectures are homogeneous in terms of processing elements (PEs), e.g., sigmoid nonlinearities, therefore no matter if they are used in the first layer, in the middle layer or the output layer they always perform the same function: they create ridge functions in the space spanned by the previous layer outputs, i.e., training will only control the steering of the ridge, while the bias controls the aggregation of the different partitions. Moreover, it is also possible to provide geometric interpretations to the projections, extending well known work of Kolmogorov for optimal filtering in linear spaces

kolmogorov1939interpolation . What has been missing is a framework that can provide an assessment of solution quality during training besides the quantification of the “external” error.

More recently, there has been a growing interest in understanding deep neural networks using information theory. Information theoretic learning (ITL) principe2010information has already been successfully applied to machine learning as cost functions, but its role can be extended to create a framework to help design optimally deep learning architectures, as explained in this paper. Recently, Tishby proposed the Information Plane (IP) as an alternative to understand the role of learning in deep architectures shwartz2017opening . The use of information theoretic ideas is an excellent addition because Information Theory is essentially a theory of bounds mackay2003information . Entropy and mutual information quantify properties of data and the results of functional transformations applied to data at a sufficient abstract level that can lead to optimal performance as illustrated by Stratonovich’s three variational problems stratonovich1965value . These recent works demonstrate the potential that various information theory concepts hold to open the “black box” of deep neural networks.

As an application we will concentrate on the design of stacked autoencoders (SAE), a fully unsupervised deep architecture. Autoencoders have a remarkable similarity with a transmission channel yu2017autoencoders

and so they are a very good choice to evaluate the appropriateness of using ITL in understanding the architectures and the dynamics of learning in DNNs. We are interested in unveiling the role of the layer-wise mutual information during the autoencoder training phase, and investigating how its dynamics through learning relate to different information theoretic concepts (e.g., different data processing inequalities). We propose to do this for arbitrary topologies using empirical estimators of Renyi’s mutual information, as explained in

giraldo2015measures

. Moreover, we are also interested in how to use our observations to benefit the design and implementation of deep neural networks, such as optimizing the neural networks topology or training the neural networks in a feedforward greedy-layer manner, as an alternative to the standard backpropagation.

The rest of this paper is organized as follows. In section 2, we briefly introduce background and related works, including a review of the geometric projection view of multilayer systems, elements of Renyi’s entropy and their matrix-based functional as well as previous works on understanding deep neural networks. Following this, we suggest three fundamental properties associated with the layer-wise mutual information and also give our reasoning in section 3. We then carry out experiments on three real-world datasets to validate these properties in section 4. An experimental interpretation is also presented. We conclude this paper and present our insights in section 5.

2 Background and Related work

In this section, we start with a review of the geometric interpretation of multilayer systems mappings as well as the basic autoencoder that provide a geometric underpinning for the Information Plane (IP) quantification. After that, we give a brief introduction to Renyi’s entropy and its associated matrix functional defined on the normalized eigenspectrum of the Hermitian matrix of the projected data in reproducing kernel Hilbert spaces (RKHS). Finally, we briefly review the related work to the understanding and interpretation of DNNs.

2.1 A geometric perspective to deep neural network projections

The famed optimal linear model , where

is a vector in

, is a one-layer system that provides the basis for understanding multilayer nonlinear models. One can consider that the dimensional input creates a projection plane in a -space, with bases given by . Each one of these vectors is formed by the samples of the input training data, i.e., . The output of the linear model must exist in this space since it is a linear combination of the inputs with parameters . Let be a set of measurements that correspond to the function we want to approximate. Most often the vector

does not belong to the hyperplane defined by the input, so the problem of regression is to find the best projection. Since Legendre and Gauss we know that the optimal solution

minimizes the error power (the norm of the error vector) by approximating the cloud of measurements in that passes through the data, where is the autocorrelation of the input and is the crosscorrelation vector between and . Geometrically this corresponds to finding the orthogonal projection of into the space spanned by the input (see Fig. 1).

Figure 1: Illustration of the problem of regression in the joint space. The input space created by the input signal vectors (assumed horizontal) is used to find the best approximation (the orthogonal projection) of the data into this space. is the dimensionality of .

The problem with the linear solution is that the output must exist in the space spanned by the input, and when this is not the case (which is the norm), the optimal solution can provide an error that may be still too large to make the solution practical. This is the reason why we commonly use nonlinear mapping functions, which are not restricted anymore to provide outputs in the span of the input space, and can therefore provide smaller approximation errors. In (principe2000neural, , Chapter 5 & 10) and more recently in principe2015universal

we show that Kolmogorov interpretation can provide insights on the understanding the inner working of any multilayer perceptron (MLP), and we repeat it here for completeness.

Let be a continuous function in . The goal is to approximate in by a function that is built in the following way:

(1)

where are smooth nonlinear functions and and are real value parameters (weights and bias), is an index over the input dimension and is the index over the number of functions in the composition, implemented by the network processing elements (PEs). The number of layers in (1) can be expanded and give rise to deep networks that have emerged as the big topic in neural networks. But from a function approximation perspective the single hidden layer is quite adequate as the basic topology to understand deep architectures. For simplicity, we are going to drop the external nonlinearity, yielding:

(2)

Let us denote as:

(3)

and substituting in (3) we obtain:

(4)

If the same geometric interpretation of regression is used here, we see that the output of the one hidden layer machine is nothing but a projection on the space created by the outputs () of the hidden layer PEs (the multidimensional internal projection space or MIPS). The only problem is that MIPS bases are controlled by the input data as well as by the parameters as shown in (3), so they change during learning. Moreover, because of the nonlinear PEs, the space spanned by these bases is no longer limited to the span of the input. The MIPS can be placed anywhere to fulfill the approximation to the target function, depending on the first layer weights , which is exactly the reason why the one hidden layer machine is an universal approximator.

Since the optimization problem remains the same, we can now understand better the role of each one of the layers of our learning machine (Fig. 2): the output weights are still finding the orthogonal projection on the MIPS subspace spanned by the , and this optimization is convex in the parameters provided the output PE is linear. Moreover, the MIPS is no longer the input space and it is dynamically changing during learning, because the bases are themselves a function of the weights of the first layer parameters, which change during training. This also shows that in the beginning of training, MIPS coincides with the input data space (when the weights are started with small random values that put the sigmoid in the linear region), but progressively the mappings become much more dependent upon the goal of the processing dictated by the desired response. In a deep layer network, this perspective of using pairs of layers to understand the internal mechanism of finding representations remains essentially the same (the nonlinearity at the output becomes part of the next pair of layers). Understanding this mechanism also saves precious adaptation time, because it is obvious that the optimal projection on the top layer can only be determined when the previous MIPS stabilize, which calls for different learning rates in each layer. This is textbook material principe2000neural that was never assimilated by practitioners and non-practitioners alike, who keep declaring that MLPs are black boxes, which they are not! The difficult part is to predict the effect of the bias in the inner layers, which have the ability to aggregate subsets of previously created partitions, as required for the overall mapping. However, this geometric picture still tells us little about how information flows inside the network as its parameters are being adapted, it just tells us how the mappings are implemented.

Figure 2: (a) shows a single hidden layer MLP with PEs in the hidden layer and (b) shows the interpretation of the hidden layer PE outputs as the projection space where the optimal solution of the input-output map is obtained.

To summarize, it is obvious that the pair-wise interactions among the variables in the original data space, the MIPS and the output space play significant roles in understanding learning (or mapping) systems that include but are not limited to DNNs. From this perspective, it makes sense to infer the system properties by inspecting the interactions between input and hidden representations and the mutual information between hidden representation and output. A straightforward example comes from the domain of system identification based on mutual information criterion

(chen2013system, , Chapter 6), in which either the minimum mutual information criterion or the maximum mutual information criterion consistently optimize such interaction between variables in two interesting data spaces. However, the motivation of this paper is not limited to the mutual information cost functions as is the norm in neural networks training. We believe the above analysis provides a geometric perspective for the changes in representations during learning, which can be further quantified by the information flow quantification provided by IP in shwartz2017opening .

2.2 Autoencoder and its geometric perspective

This section gives a brief introduction to the basic architecture of the autoencoder, also from a geometric perspective. The autoencoder is a special type of MLP that aims to transform inputs into outputs with the least possible amount of distortion. It consists of two modules: a feedforward encoder module that maps the input to a code vector or hidden representation in the bottleneck layer and a decoder module that tends to reconstruct the input sample from . In this sense, the autoencoder is a supervised version of a clustering algorithm in a projected space, where the bottleneck layer PEs implement global projections.

Specifically, we are given a (mini) batch of samples in matrix , where each row is an input vector. The output of autoencoder (i.e., ) is enforced to equal to with high fidelity by minimizing the squared reconstruction error . For simplicity, we assume there is only one linear PE in the bottleneck layer and the weights are symmetric in encoder and decoder (see Fig. 3(a)). Over the batch, the encoder does and the obtained vector lies in the column space spanned by as emphasized in Section 2.1. Because of symmetry, the decoder does . Then the objective is to minimize , where

denotes trace. This is exactly the Principal Component Analysis (PCA)

kokiopoulou2011trace , and the optimal solution is given by

, the top eigenvector of the matrix

baldi1989neural .

The point is that we have an analytical solution to this problem, i.e., . Similar to the regression case in Section 2.1, we end up in the span of the input batch, defined by the column space of . Of course, PEs will lead to , the matrix of top eigenvectors of (see Fig. 3(b)). Actually, if we have multiple hidden layers with nonlinear PEs in both encoder and decoder, this interpretation still holds in the bottleneck (or innermost) layer. The only difference is that the eigenvectors are now embedded in the span of nonlinear MIPS, rather than the input batch.

Figure 3: (a) shows a standard autoencoder with one PE. A sample is compressed to one component in the bottleneck layer by the encoder. The decoder reconstructs from . The sample is usually a noise-reduced representation of . The network can be extended to extract more than one component by additional PEs in the bottleneck layer as shown in (b). The code lies in column space spanned by the input batch if the PEs in the bottleneck layer is linear.

2.3 Elements of Renyi’s entropy and their matrix-based functional

In information theory, a natural extension of the well-known Shannon’s entropy is Renyi’s -entropy renyi1961measures

. For a random variable

with probability density function (PDF)

in a finite set , the -entropy is defined as:

(5)

For , (5) is defined in the limit , which reduces to the Shannon (differential) entropy. It also turns out that for any real , the above quantity can be expressed, as function of inner products between PDFs. In particular, the -order (or quadratic) entropy can be expressed as:

(6)

In order to apply this expression to any PDF, Information Theoretic Learning (ITL) principe2010information uses Parzen-window density estimation with Gaussian kernel to estimate -norm PDF directly from data. More specifically, the estimator of Renyi’s quadratic entropy is given by:

(7)

where represents a realization of . Note that this estimator has a free parameter (the kernel size

) and that its scalability is constrained by its origin on kernel density estimation

silverman1986density .

Recently, a novel non-parametric estimator for the matrix based Renyi’s entropy was developed under the ITL framework giraldo2015measures . The new estimator is a smooth matrix functional on the manifold of the normalized positive definite (NPD) matrices over the real numbers, and has been shown to be effective in autoencoders giraldo2013rate , MLPs huang2016flow and dimensionality reduction alvarez2017kernel .

Let be a real valued positive definite kernel that is also infinitely divisible bhatia2006infinitely . Given , the Gram matrix obtained from evaluating a positive definite kernel on all pairs of exemplars, that is , can be employed to define a quantity with properties similar to those of an entropy functional, for which the PDF of does not need to be estimated.

More specifically, a matrix-based analogue to Renyi’s -entropy for a NPD matrix of size , such that , can be given by the functional:

(8)

where denotes the

-th eigenvalue of

, a normalized version of :

(9)

Furthermore, based on the product kernel, the joint-entropy can be defined as:

(10)

where denotes the Hadamard product between the matrices and . It can be shown that if the Gram matrices and are constructed using normalized infinitely divisible kernels (based on (9)), such that , (10) is never larger than the sum of the individual entropies and . This allows us to define the matrix notion of Renyi’s mutual information:

(11)

As we are going to see this theory allows us to interpret mappings created by arbitrary cost functions which are the norm in deep learning. They are also readily applicable to any dataset because both the ITL estimator of Renyi’s entropy and mutual information and their NPD matrix extensions can be directly applied to data. However, there is again a free parameter (the kernel size) that needs to be cross-validated or carefully tuned for the estimation silverman1986density , and the computation complexity is high because of the intrinsic eigendecomposition.

2.4 Previous approaches

Current works on understanding DNNs typically fall into two categories. The first category intends to explain the mechanism of DNNs by building a strong connection with the widely acknowledged concepts or theorems from other disciplines. The authors in mehta2014exact

showed that there is an exact one-to-one mapping between the variational renormalization group (RG) in theoretical physics and stacked restricted Boltzmann machines (RBM), which suggests that stacked RBM iteratively integrate out irrelevant features in the bottom layer while retaining the most relevant ones in the upper layer. This argument was later questioned by

lin2017does , in which the authors claimed an extraordinary link between DNNs and the nature of the universe. Therefore, the essence of DNNs seems to be buried in the laws of physics. On the other hand, the authors in tishby2015deep proposed to formulate the learning of a DNN as a tradeoff between compression and prediction, i.e., the DNN learning problem can be formulated under the information bottleneck (IB) framework tishby2000information that attempts to extract the minimal sufficient statistics of input data with respect to the target. The authors in khadivi2016flow investigated the flow of the discrete Shannon entropy across consecutive layers in a MLP and defined a new optimization problem for training a MLP based on the IB principle. Moreover, they demonstrated numerically that a MLP can successfully learn Boolean functions (AND, OR, XOR) while achieving the minimal representation of the data. A similar work is shown in huang2016flow , where the training of MLP is formulated with the rate-distortion function. Another recent work demonstrated that SAEs have remarkable similarity with communication channels, thus holding the potential to lead to alternative communication system designs yu2017autoencoders .

Following tishby2015deep , a DNN should be analyzed by measuring the information quantities that each layer’s output preserves about the input with respect to the target. A new terminology of the IP222The plane of information quantities that each hidden layer preserves about the input with respect to the target , i.e., with respect to , where denotes mutual information. framework is defined thereafter in shwartz2017opening

. This paper empirically shows that the common stochastic gradient descent optimization undergoes two separate phases: an early “drift” phase, in which the variance of the weights’ gradients is much smaller than the means of the gradients; and a later “diffusion” phase, in which there is a rapid reversal such that the variance of the weights’ gradients becomes greater than the means of the gradients. In spite of imposing a constraint not widely used in machine learning, i.e. the cost function is not necessarily an information quantity,

shwartz2017opening conjectured that each layer’s inputs and outputs follow the IB framework. These results, along with explanations for the importance of network depth and the information bottleneck optimality of the layers, made shwartz2017opening a very promising avenue to improve the understanding of DNNs. However, the results so far have not been extended to real-world scenarios involving large networks and complex datasets.

On the other hand, the approaches in the second category concentrate more on the analysis of deep feature representations from a geometric perspective. The projection space perspective can benefit from, and be easily integrated with, the IP framework. In fact,

quantifies the mutual information between the cloud of samples in the input space and the corresponding projected cloud of points in the MIPS, which is a very efficient way of quantifying the MIPS rotation through learning. Likewise, measures the mutual information between the codes in the MIPS and the cloud of points formed by the desired response. But there are more examples such as bengio2013better , in which the authors conjectured that deep layers can help extract the underlying factors of variations that define the structure of the data geometry. This hypothesis was experimentally validated in brahma2016deep by quantitatively defining several manifold measures. Other examples include pascanu2013number and montufar2014number , in which the authors demonstrated that the layer-wise composition of functions in DNNs are able to separate the input data space into exponentially more linear response regions than their shallow counterparts, thus increasing the power of computing complex and structure data. Different from the early work, the authors of achille2017emergence suggested using the information stored in the weights, rather than activations or layer outputs, to understand the network optimization and representations. According to them, networks with low information in the weights realize invariant and disentangled representations. Therefore, invariance and disentanglement emerge naturally when training a network with implicit (e.g., Stochastic Gradient Descent or SGD) or explicit (e.g., IB Lagrangian tishby2000information ) regularization. Although these results are very promising, we will show, in the later portion of this paper that there should be a limit on the number of layers, because, the deeper the neural network, the more information about the input is lost.

There are some other works concentrating on hidden codes visualization, aiming at giving insights on the function of hidden layers. For examples, the authors in zeiler2014visualizing use deconvolutional network (deconvnet) zeiler2011adaptive

to visualize features in higher layers of convolutional neural networks (CNNs), whereas the authors in

mahendran2015understanding suggested understanding CNN features by inverting them to measure how many information is retained in these features from a image reconstruction perspective. Other related works include yosinski2015understanding , nguyen2016multifaceted , etc., and the trend is to explore the hidden mechanism of different layers using an explanatory graph zhang2017interpreting . However, these methods are typically only applicable for CNNs and fail to unveil the intrinsic properties of DNNs in the training phase.

In our perspective, DNNs are definitely not “black boxes” as illustrated in their geometric interpretation extended with the significance of layer-wise mutual information. However, the usefulness of the IP framework in machine learning requires further analysis to relate how the processing of information through nonlinearities can achieve the task goals, and help properly design hyper-parameters of the mapper and the learning process. Along these lines, we suggest and verify three fundamental properties associated with different layer-wise mutual information, including the data processing inequality and two novel and related IPs effectively extending the IP to any pairwise layers to further understand the learning process. Moreover, it is worth noting that, our idea is motivated from a geometric interpretation, rather than strictly by the IB principle tishby2000information , i.e. it is not necessary to be interested in mutual information cost functions as is the norm in neural network training, so we believe this motivation complements shwartz2017opening .

3 Understanding Autoencoders with Information Theoretic Concepts

3.1 The Data Processing Inequality (DPI) and its extensions to stacked autoencoders (SAEs)

Before interpreting systematically SAEs using information theoretic concepts, it would be useful to have a statistical insight into its architecture. The SAE is a simple extension of the autoencoder (and also a special case of the MLP) that attempts to reconstruct its input. Different from the basic autoencoder or MLP, a SAE actually contains multiple encoding and decoding stages made up of a sequence of nonlinear encoding layers followed by a stack of decoding layers, and the desired signal in the training phase is exactly the input itself.

Therefore, given a basic SAE shown in Fig. 4(a), where and are the input and output variables respectively, denote different hidden layer representations in the encoder and () denote different hidden layer representations in the decoder, it would be interesting to infer some intrinsic properties embedded in SAEs. To this end, we present the first two fundamental properties:


Fundamental Property I: There exists the famed (named here the first type of) Data Processing Inequality (DPI) in both encoder and decoder of SAEs, i.e., and .


Fundamental Property II: There also exists a second type of DPI associated with the layer-wise mutual information, i.e., .


Reasoning:

Recall the basic learning mechanism (i.e., backpropagation) in any feedforward DNNs (including SAEs, MLP, etc.), the input signals are propagated from input layer to the output layer and the errors are back-propagated from the output layer to the input layer. Both propagations are unidirectional and static, hence obeying the Markov assumption and thus forming a Markov chain

shwartz2017opening ; luttrell1994bayesian . This is because, in the feedforward propagation phase, the amount of information on the input carried by the current layer only depends on the previous layer, as required by the Markov assumption. The Markov chain length will be given by the number of layers of the DNNs. Similarly, in the back-propagation phase, the amount of information on the desired signal (characterized by the error) carried by the current layer is only determined by the previous top layer, which means that there is also a Markov chain in the inverse direction (output to input) enforced by backpropagation training.

Therefore, on the one hand, the successive representations in the encoder should form a simple Markov chain shwartz2017opening ; luttrell1994bayesian ; huang2016flow , i.e., . On the other hand, the symmetric counterparts in the decoder should also form a simple Markov chain, i.e., 333We just stop at the -th layer (or bottleneck layer) because it is sufficient for our goal of understanding coding and decoding in SAEs. But it should continue to the output layer (for feedforward chain) or input layer (for backpropagate chain) in MLP or other DNNs for an in-depth understanding.. This is because if the SAE is well-trained by backpropagation then we can expect

because the error backpropagation follows a Markov model from the output to the input of the network, such that

and are symmetric or “dual” (liggett2012interacting, , Chapter II) jansen2014notion of each other, i.e., the layer-by-layer transition probabilities converge to an equilibrium (norris1998markov, , Chapter 1 & 3) and the decoder chain “undoes” the transformations operated by the encoder chain.

The first type of DPI is a natural outcome of our assumption that both and form a Markov chain. The second type of DPI is actually built upon these two chains jointly. In fact, given a strictly convex function , the -divergence can be defined csiszar1972class

as a generalized notion of the divergence between two probability distributions:

(12)

When the

-divergence was applied to the joint distribution (in the role of

) and the product of marginals (in the role of ) of two random variables (such as and ), it yields a generalized notion of mutual information444The classical Kullback-Leibler (KL) divergence is a special case of -divergence when , . In this sense, the standard mutual information , defined as , is just a special case of its generalized version . cover2012elements :

(13)

which was shown in csiszar1972class to obey a second type of DPI, thus extending the famed (first type of) DPI in a broader sense, i.e.,

(14)

where and denote indirect observations to and , respectively. The equality holds if and only if and are the sufficient statistic with respect to csiszar1972class ; merhav2011data . By referring to the above descriptions, we expect a monotonically non-increasing trend (as the number of layers increases) of the mutual information between the layer output and their “symmetric” counterparts, i.e., we cannot gain more mutual information when we process the original observations in a deeper layer. Therefore this seems to impose a limit to the number of layers in practical situations, which is not been yet recognized as a limitation in deep learning empirical validation.

(a) The architecture of SAE with (-1) hidden layers in both encoder and decoder.
(b) The graph representation of SAE with (-1) hidden layers in both encoder and decoder.
Figure 4: (a) shows a stacked autoencoder (SAE) with () hidden layers and (b) shows its graph representation, where the black solid arrow denotes the direction of input feedforward propagation and the green dashed arrow denotes the direction of information flow in the error back-propagation phase. In both figures, is the observations in the input layer, is the codes in the bottleneck layer, and is the output. The input data is encoded by probabilistic encoder and recovered by corresponding decoder leading to output data . Thus the SAE operates simultaneously to perform encoding (i.e. layer-by-layer probability transition in forward pass) and decoding (i.e. layer-by-layer probability transition in backward pass). The data processing inequality associated with the mutual information states that .

3.2 Two types of Information Planes (IPs)

The IP, initiated in tishby2015deep and matured in shwartz2017opening , creates an observable space for how stochastic gradient descent optimizes the deep neural network: compression by diffusion creates efficient internal representations in each layer. However, we would like to note that this work only applies to the bottleneck training method and has not been exploited to help us design appropriately DNNs, nor learning, i.e., the mechanism of compression has not been elucidated yet. Additionally, although the IP presents an explicit way to inspect pairs of layer-wise mutual information simultaneously, we demonstrate that inspecting only the mutual information that each layer preserves about the input with respect to the target is insufficient to provide a comprehensive understanding of neural network training.

To this end, we extend the definition of IP into a broader and more general perspective and suggest two novel IPs: 1) the plane of information quantities that each hidden layer preserves about the input with respect to the output, i.e., with respect to ( for SAEs); 2) the plane of information quantity that each hidden layer (in the encoder) preserves about the input with respect to the information quantity that counterpart (or symmetric) hidden layer (in the decoder) preserves on the output, i.e., with respect to ( for SAEs). We term them Information Plane I (IP-I) and Information Plane II (IP-II), respectively.

The IP-I makes a simple modification to the original IP in shwartz2017opening by substituting with . The motivation for this modification is straightforward: the output layer contains significant signals to analyze in any neural network architecture haykin1994neural . Moreover, if we insist on using for analyzing SAEs, the IP curve reduces to a line because the target is just a mirrored input, thus resulting in a poor visualization and the loss of useful information. The IP-II, on the other hand, compares the amount of information that gained from with the amount of information that gained from , which also provide an implicit measure on how marginal distributions and match each other. Such visualization is promising, as it tells us when the symmetric layer-wise SAE pairs matches well under the objective of minimizing reconstruction error. We believe it has the potential to guide the development of new training methods in a feedforward manner555A similar vision is shown in bengio2014auto , but no solid examples are presented., as an alternative to the standard back-propagation method, and may also help answer questions about generalization.


Fundamental Property III: We expect the existence of a different behavior in the IP (a bifurcation point) associated with the dimensionality of the SAE bottleneck layer that is controlled by the intrinsic dimensionality of the given data, i.e., the curves in the IP-I or IP-II might demonstrate two distinct patterns depending upon the ability of the bottleneck layer to represent the input data intrinsic dimensionality.


Reasoning: The exploration of bifurcation or critical points is not new in machine learning and time series analysis. An interesting example comes from time series analysis, in which the Takens’ Theorem takens1981detecting

states that if the dynamical system degrees of freedom is confined to an attractor

of dimension in the state space, then the topology of the attractor that characterizes the dynamical system can be discovered from the analysis of the time series data when it is arranged into a delay coordinate map that concatenates previous outputs. In other words, when , it is impossible to recover the attractor without any distortion, so results will suffer. Therefore, it is reasonable to conjecture that the SAE’s bottleneck layer is controlled by the data’s characteristics (e.g., the intrinsic dimensionality). If the IP is a good observable of learning, then we should see a difference in the dynamics of learning for bottleneck layers that are above and below the intrinsic dimensionality of the data. We test this property by altering the topologies of SAEs, specifically the number of units in the bottleneck layer.

4 Experiments

This section presents two sets of experiments to corroborate our section 3 fundamental properties directly from data and the nonparametric statistical estimators put forth in this work. Specifically, section 4.1 validates the first type of DPI and also demonstrates the two IPs defined in section 3.2 to illustrate the existence of bifurcation point that is controlled by the given data, whereas section 4.2 validates the second type of DPI raised in section 3.1. Note that, we also give a preliminary interpretation to the observations shown in section 4.2, by inspecting the hidden codes distribution in the training phase. All the experiments reported in this work were conducted in MATLAB b under a Windows bit operating system.

The real-world datasets selected for evaluation are explained next and Fig. 5 depicts the representative images from each dataset.

(a) MNIST lecun1998gradient , contains a training set of images and a testing set of images of handwritten digits. Each digit has been normalized and centered in a image. The thickness, height, angular alignment, and relative position in a frame are some of the intrinsic hidden properties that govern the generation of the examples for each digit manifold. The entire data set of images can be considered as an embedded manifold plus additive noise brahma2016deep .

(b) Fashion-MNIST xiao2017fashion , is a recently released benchmark to test machine learning algorithms. As an alternative to MNIST, it features the same image size, data format and the structure of training and testing splits. The only difference is that the handwritten digits are replaced with different fashion products, like T-Shirts or Trousers. This will provide diversity for the size of the embedded manifolds.

(c) FERG-DB aneja2016modeling , contains face images from stylized characters with annotated facial expressions. The images for each character are grouped into types of expressions, i.e., anger, disgust, fear, joy, neutral, sadness and surprise. Each image has a resolution of either (full resolution) or (reduced size). We take the inner pixels of each reduced size image and resize it to the size of pixels from which we form a vector with dimensions as the input. According to our initial investigation on FERG-DB using t-SNE maaten2008visualizing , the variance among different subjects is much higher than the variance among different facial expressions. This means that the embedded manifold of the data set perhaps is too high to be well estiamted with the available data. For this reason, we only conduct one subject-dependent facial expression classification experiment using all facial expression images of “Bonnie”. The selected dataset is separated into for training and for testing.

In this paper, we use the basic SAE with no other architecture constraints. The activation functions of all the neurons are sigmoid functions which have been theoretically proven effective in encouraging sparse representation

arpit2016regularized . The only exception comes from the bottleneck layer, in which a simple linear activation function is employed to obey the Folded Markov Chain (FMC) architecture666Note that, the Fundamental Property I and the Fundamental Property II still hold even though we relax the selection of activation functions. Without loss of generality, this work only considers sigmoid activation function in hidden layers and linear activation function in the bottleneck layer.. The networks were trained using SGD under the objective of minimizing reconstruction error power. The topology of SAEs on MNIST and Fashion-MNIST is fixed to be “--------” as suggested in hinton2006reducing , where denotes the number of neurons in the bottleneck layer. Unlike MNIST, the topology of SAEs on FERG-DB is selected as “--------”. Due to page limitations, we only demonstrate the results on MNIST in section 4.1 and 4.2. The corresponding results on Fashion-MNIST and FERG-DB, and the robust analysis on kernel size turning on information quantities estimation, are shown in supplementary material.

The training of SAE is iterated for epochs, with the mini-batch size set to . The information quantities mentioned in this paper are estimated using the matrix-based functional of Renyi’s -entropy giraldo2015measures with to approximate Shannon’s entropy as suggested in giraldo2015measures ; giraldo2013rate . Since the kernel size in the estimation of Renyi’s -entropy is a compromise between bias and variance of the estimator, we must select the kernel size properly because the estimated entropy values depend upon the kernel size. We tune the kernel size by the Silverman’s rule of thumb silverman1986density 777More details on selection of is demonstrated in section 5., which takes into consideration the change in kernel size with the dimension of the data.

Figure 5: Visualization of training samples selected from (a) MNIST; (b) Fashion-MNIST and (c) FERG-DB (after pre-processing).

4.1 Experimental validation of Fundamental Properties I Iii

We first validate the Fundamental Properties I III, since these two properties can be easily verified with IPs. Specifically, we expect the existence of DPI such that and . We also expect a bifurcation point associated with the value of that is controlled by the intrinsic dimensionality camastra2016intrinsic of given data888Note that, the intrinsic dimensionality mentioned in this paper only refers to an effective dimensionality that can give a reasonable fit wang2008scale . We leave a rigorous investigation to the physical meaning of this dimensionality as future work.: the curves in the IPs may demonstrate distinct behavior depending on or . To corroborate this argument, we test different SAE topologies with ranging from to . The corresponding IP-I is shown in Fig. 12.

Fig. 12 shows the behavior of the IP-I in the encoder and the decoder for several values of the bottleneck layer size . As can be seen, is consistently larger than , is consistently larger than and is consistently larger than , no matter the value of . Moreover, after a very short period of training (the SAE is trained with a certain fidelity), is consistently larger than , is consistently larger than and is consistently larger than , no matter the value of . Therefore, the first type of DPI always holds, i.e., and .

(a) IP-I (encoder part) when
(b) IP-I (decoder part) when
(c) IP-I (encoder part) when
(d) IP-I (decoder part) when
(e) IP-I (encoder part) when
(f) IP-I (decoder part) when
(g) IP-I (encoder part) when
(h) IP-I (decoder part) when
Figure 6: The validation of bifurcation point associated with . (a), (c), (e) and (g) demonstrate the IP-I ( is in the encoder module) with equals to , , and respectively, whereas (b), (d), (f) and (h) demonstrate the corresponding IP-I when is in the decoder module. As can be seen, the general patterns of curves in IPs begin to have a transition between and . This suggests an effective dimensionality of MNIST dataset is approximately or .

A finer analysis of the IP-I curves shows that starts higher for the layers closer to the input (shallow), but the rate of increase of is the fastest for the first layer of the encoder, showing that early in learning the shallow layers learn faster about the desired than the deeper layers. The vertical increase of is very likely due to the overcomplete first layer projection space. For a properly set bottleneck layer, the final value of in each layer tends to be close to the initial value of (see IP-I encoder curves), which is very interesting since it means that the mutual information between the input and the layer is transferred to the mutual information between the layer and the desired. This can be potentially used to to evaluate if the overall system is trained well enough, as well as to properly set the learning rates for each layer. Notice also that the shallow layers are more sensitive to the size of the bottleneck layer . The behavior of the curves in the layers close to the bottleneck layer approaches the entropy of the codes, which means that the SAE learning is controlled by the evolution of the entropy in the bottleneck layer codes, which does not conform with the IB principle. Therefore, the change of curve patterns in IP-I seems a good indicator of the DPI property. The picture for the IP-I layers in the decoder is not as clear, because the mutual information only stabilizes once the bottleneck layer settles.

We now start the analysis of the IP-I encoder because it is the one that refers to the coding of information. The effective dimensionality for this dataset is between or . When , the (majority of) pair-wise mutual information curves in the IP-I start to increase up to a point and then go back approaching the bisector of the plane, i.e., converging to the line . This is not surprising as the optimal999Here the “optimal” means the SAE is trained with the objective of minimizing the distortion measure, i.e., mean square error. solution resides in this bisector because . However, for the case of , we observed a different behavior. In fact, the curves associated with and keep increasing with two different slopes up to a point, while the is smaller and increases slower such that the curve is further away from the bisector.

The above results corroborate partially the conclusion in shwartz2017opening : there indeed exists two separate phases when using the standard SGD to train DNNs. In the first and short phase, the network is progressively fitting the data manifold, whereas in the second and much longer phase, the purpose of training is to fine tune the representation locally. If the cost function is mutual information as in shwartz2017opening one could in fact talk about compression of representations. However, here with mean square error (MSE) training, this metaphor does not hold since MSE is not a sparsifying criterion. Nevertheless, we also see that the combined nonlinearity of the units enhances the quality of the local representations, perhaps by moving the units to saturation. However, what we discovered is that, for the SAE, this conclusion only holds in an ideal scenario (i.e., ). By contrast, if , the network is incapable of fitting the data with high fidelity. As a result, the representations are unable to match the local neighborhoods of the data manifold (see Fig. 8).

Finally, it is worth noting that our estimated matches well the values of intrinsic dimensionality given by benchmarking estimators. In fact, the intrinsic dimensionality estimated by the Maximum Likelihood Estimation (MLE) levina2005maximum , the Minimum Neighbor Distance (MiND) Estimator lombardi2011minimum and the Dimensionality from Angle and Norm Concentration (DANCo) ceruti2014danco are , and , respectively.

4.2 Experimental validation of Fundamental Property II

We then validate the second type of DPI in the Fundamental Property II, that is . To this end, we demonstrate, in Fig. 13, the layer-wise mutual information corresponding to two network topologies with different number of neurons in the bottleneck layer, i.e., and . The experimental results corroborate the DPI property - the deeper the neural network, the more information about the input is lost, thus the less information the network can manipulate. In fact, by referring to Figs 13(a) and 13(d), both and gradually deviate from , and , with consistently larger than . But there are differences between the two bottleneck layer cases. For all the mutual information curves are parallel to each other during training (compare Figs 13(b) and 13(c)), while is monotonically increasing and much smaller than . However, for , although and are slightly larger than , these two values are almost the same. One probable reason is that we select an over-complete representation in the first hidden layer lewicki2006learning , which should not affect the mutual information (the topological hypervolume of data in over-complete projections remains the same). This phenomenon does not occur for , because a -dimensional projection space is insufficient to guarantee lossless reconstruction of input data, thus the first hidden layer keeps losing information even though it is over-complete.

Figure 7: The validation of data processing inequality (DPI) associated with the layer-wise mutual information. (a) demonstrates the layer-wise mutual information when ; (b) shows a zoom-in results of (a) in the iterations between (approximately) to ; (c) shows a further zoom-in results of (b). Similarly, (d) demonstrates the layer-wise mutual information when ; (e) shows a zoom-in results of (d) in the iterations between (approximately) to ; (f) shows a further zoom-in results of (e). In all sub-figures, the green triangles denote the mutual information between and , the red circles denote the mutual information between and , the blue plus signs denote the mutual information between and , the magenta squares denote the mutual information between and , and the black asterisks denote the mutual information between and (reduces to the entropy of in our case).

But the most interesting observation is that the entropy of bottleneck codes begins to decrease after a certain number of iterations when while for

no similar phenomena occurs. This suggests that the bottleneck codes undergo different forms of specialization when the reconstruction reaches to a certain fidelity, but the existence of compression phase depends on whether the topology can guarantee an information-lossless reconstruction. Otherwise, we think that this specialization results in distortion of the original manifold. To verify this, we demonstrate the geometric distribution of

for both and in Fig. 8. Note that, it is impossible to explicitly observe -dimensional point clouds in a -dimensional (or -dimensional) space, thus we randomly select (out of ) neurons and use scatterplot matrix to visualize the geometric distribution changes.

As can be seen, in both cases the codes attempt to fill up the projection space, thus increasing the overall entropy (see Figs 8(b) and 8(f)). However, in the case of , the clusters break up (green, yellow) and the codes persistently enlarge to cover the projection space, with no trend to decrease the redundancy (see Figs 8(c) and 8(d)). This is because the compressed -dimensional space is insufficient to accommodate the natural structure of the data, so continuous training distorts the local structure of the data manifold. The parameter adaptation tries to minimize the error by spreading the codes in a larger region of the space, which reduces classification error until unit saturation takes over. However, the volume of the projected data is maximum so local structure is lost. By contrast, in the case of , when the reconstruction reaches a certain fidelity and the network has sufficient discriminative power with degrees of freedom (see Fig. 8(f)), the manifold of each class begins to shrink (see Figs 8(g) and 8(h)), thus decreasing the overall entropy and achieving also a very good classification accuracy.

(a) Iteration ()
(b) Iteration ()
(c) Iteration ()
(d) Iteration ()
(e) Iteration ()
(f) Iteration ()
(g) Iteration ()
(h) Iteration ()
Figure 8: Bottleneck layer codes visualization. (a)-(d) demonstrate the codes distribution at iteration , , and respectively when , whereas (e)-(h) demonstrate the codes distribution at iteration , , and respectively when . In each sub-figure, different color denotes different class.
(a) Iteration ()
(b) Iteration ()
(c) Iteration ()
(d) Iteration ()
(e) Iteration ()
(f) Iteration ()
(g) Iteration ()
(h) Iteration ()
Figure 9: The frequency histograms of activation values of randomly selected neurons in the first hidden layer . (a)-(d) demonstrate the histograms at iteration , , and respectively when , whereas (e)-(h) demonstrate the histograms at iteration , , and respectively when . In each sub-figure, the -axis denotes the bounded activation range (divide into bins of same length), whereas the -axis denotes the frequency in each bin. The higher the frequency in the leftmost and the rightmost bin, the broader the space that hidden codes occupy, and thus the higher the saturation.

We expect a similar phenomenon to happen for other hidden layer representations. To verify this, we demonstrate the geometric distribution of for both and in Fig. 9. Specifically, we randomly selected (out of ) neurons and plot the normalized histograms (by frequency) of their activation values to infer the geometric distribution changes of in . Intuitively, the broader the space the codes occupy (before activation), the higher the possibility of neuron saturation. In fact, from Fig. 9, almost all the neurons in tend to be saturated at the end of iteration when . This suggests that the original hidden representations of persistently enlarge the projection space, just like what does. By contrast, except for a few neurons, there is no obvious saturation for other neurons of when . Moreover, the normalized histograms remain almost the same from iteration to . This suggests that the hidden representations of are self-constrained in - same as .

Finally, it is worth noting that this behavior is expected to generalize to other DNNs, because the DPI is an intrinsic characteristic of any feedforward DNNs. Further work should validate this property in DNNs and make the proper modifications to analyze recurrent neural networks (RNNs).

5 Conclusions

In this paper, we analyzed deep neural networks (DNNs) learning from a joint geometric and information theoretic perspective, thus emphasizing the role that pair-wise mutual information plays in understanding DNNs. As an application of this idea, three fundamental properties are presented concentrating on stacked autoencoders (SAEs). The experiments on three real-world datasets validated the data processing inequality associated with layer-wise mutual information and the existence of bifurcation point associated with the topology of SAEs that is controlled by training data. Moreover, this indirectly corroborates the appropriateness of the non parametric estimators that were used to apply the information theoretic understanding.

Our observations have some critical insights and implications for future research:

1) The potential of Information Theoretic Learning (ITL) principe2010information in understanding DNNs.

Using information theory to explain DNNs remains a promising avenue, but there are still several important issues in the implementation. Among them is the accurate and tractable estimation of information quantities from large data. This is because Shannon’s definition is hard to estimate, which severely limits its powers to analyze machine learning algorithms gao2015efficient . For example, employing Shannon’s discrete entropy, khadivi2016flow limits the analysis to simple Boolean networks (discrete codes), whereas shwartz2017opening still concentrates on a small toy datasets.

Information Theoretic Learning (ITL) principe2010information , on the other hand, utilizes Renyi’s quadratic information measures renyi1961measures and Parzen windowing parzen1962estimation

to estimate information quantities directly from continuous random variables with few assumptions. The recently proposed matrix formulation of Renyi’s information

giraldo2015measures

is a departure from the original quadratic information measures and allows estimation of high dimensional data. This useful property makes it well suited to analyze the dynamics or information flow of any deep neural networks, thus achieving the goal of explaining DNN mappings. However, as emphasized in previous sections, care must be taken to select an appropriate value for the kernel size

. In this paper, is defined by the Silverman’s rule of thumb silverman1986density :

(15)

where is the number of samples (mini-batch size of SGD training in our application), is the sample dimensionality (number of neurons for each layer in our application), is an empirical value selected experimentally by taking into account the data’s average marginal variance. We understand that this density estimation perspective may not be the best to select the RKHS inner product, but its advantage of showing a dependence on dimension made it still effective. Theoretically, for small , the Gram matrix approaches identity and thus its eigenvalues become more similar, with as the limit case. Therefore, both entropy and mutual information monotonically increase as giraldo2015measures . We select , as the entropy estimated using matches well with the geometric distributional changes mentioned in section 4.2 (see Fig. 10). However, we also show, in the supplementary material, that even though (hence ) is not optimized, we can still observe the same trends of general patterns of the curves in the IP, although the values of entropy change with the kernel size as expected. The supplemental material also shows a similar behavior with the estimation of mutual information that now depends upon two kernel sizes.

Figure 10: The entropy of bottleneck layer codes (in logarithmic scale) with respect to different values of and kernel size . As can be seen, is monotonically increase as or decreases. makes quickly approaches to its upper bound, thus failing to inspect any distributional changes in the bottleneck layer codes. Although the entropy given by matches well with the general distribution s trend (the green blue curves show the same trends), a large makes tend to ( in our application) which is not convincing. By contrast, provides an entropy estimator using the full range and is discriminative of codes distributional changes.

2) Implications on the design of DNNs topology.

The optimal design of DNNs topology is essential in many practical applications. Unfortunately, there is still a lack of fixed rules or widely acknowledged methods currently available. Previous works either employ a trial-and-error process starting from a set of rules of thumb or dynamically adjust the network configuration (e.g., the cascade-correlation algorithm fahlman1990cascade ).

With the advent of deep learning the tendency is to design much deeper neural networks to guarantee favorable performance on different tasks, especially for image classification he2016deep . However, from the DPI perspective validated in this work (we also refer to another type of DPI specifically for MLP as shown in shwartz2017opening ), the deeper the neural networks, the more information about the input is lost, thus the less information the network can manipulate. In this sense, one can expect an upper bound on the number of layers in DNNs that achieves optimal performance. The advantage of the proposed methodology is that the experimentalist can find out how much residual information exists in the intermediate layers to guarantee a generalization performance. Alternatively, this may help “principled tweaking” by replacing the nonlinear units by linear units (as proposed by glorot2011deep ) in specific layers that substantially reduce the mutual information. In fact, more layers will not only result in more information loss, but also will introduce much more parameters that are hard to be tuned and will compromise generalization.

3) Implications on the feedforward training of DNNs.

The idea of training DNNs using information theoretic concepts has a long history dating back to the celebrated “InfoMax” principle proposed by Linsker linsker1988self ; linsker1989generate , which states that the most informative learner is the one that maximizes the mutual information between input (e.g., sample attributes) and target (e.g., class label). Motivated by the “InfoMax” principle, several training methods have been developed concentrating on different types of network architectures, including MLP xu1999training , Autoencoders miranda2013breaker , Restricted Boltzmann Machine (RBM) peng2016mutual , etc.

We believe the utilization of ITL and the IPs (e.g., Figs 11(a) and 11(b)) holds potential for feedforward training of DNNs, as an alternative to the basic backpropagation method. Our argument stems from three main reasons. First, ITL gives a tractable estimation of information quantities, which is critical to implement the “InfoMax” principle. Note that the gradient of the mutual information becomes much less dependent on kernel size, because it is insensitive to the bias caused by the selection of the kernel size. Second, the IPs provide an explicit and flexible way to visualize information flow between any layer of interest. For example, by monitoring the information quantities between symmetric pair-wise layers (e.g., and ) in a greedy manner, bengio2014auto suggests the possibility of using “target propagation” to train SAEs. In this sense, the IP-II might be a good complement to implement this idea. In fact, by comparing Fig. 11(a) with 11(b), it is interesting to find that can exceed . This is a fundamental difference to IP-I, in which all curves are strictly below the bisector of the IP. Moreover, by comparing Figs 11(a) with 11(b) with 11(c), it is interesting that for the classification case, the bottleneck layer code that corresponds to the training phase close to the knee (in both IP-I and IP-II) is the point where classification accuracy on testing set is maximum. These results may provide an explicit cut-off point to “early stopping” for optimal generalization haykin2009neural .

4) Implications on optimal generalization.

We extend the above observations to the problem of generalization, i.e., the ability of the model (learned from training data) to fit unseen instances (or testing data) haykin2009neural . This is perhaps the most challenging topic in DNNs, as it has been experimentally proven that popular techniques including explicit regularization (e.g., weight decay or Dropout srivastava2014dropout

) or implicit methods (e.g., early stopping or batch normalization

ioffe2015batch ) cannot explain the generalization of DNNs very well zhang2016understanding .

To the best of our knowledge, the analysis of generalization ability using information theoretic concepts has seldom been investigated before, except for some recent published works (e.g., raginsky2017information ; alabdulmohsin2017information ; zheng2017understanding ). Different from these works, we present an alternative perspective herein. In fact, we experimentally found that the bottleneck layer code that corresponds to the training phase (shown in the IPs) for the SAE is a stable indicator of the knee of the generalization performance when the codes

are used for classification using a Softmax Regression classifier (see Fig.

11). If this preliminary observation extends to other cases, it may be possible to address the problem of generalization of a classifier using an ITL framework. We leave a rigorous implementation of this idea as future work.

(a) IP-I when
(b) IP-II when
(c) generalization curve
Figure 11: The relationships between IP-I, IP-II and the generalization performance in classification case. We consistently looking at the bottleneck layer codes when the system is trained as a SAE and test the codes’ accuracy in classification using a Softmax Regression classifier (without fine-tuning). (a) and (b) demonstrate the IP-I (encoder module) and IP-II when , where the black dashed line indicates the bisector. (c) demonstrates the corresponding classification accuracy using with respect to the number of iterations. A fundamental difference between IP-I and IP-II is that can exceed . Therefore, it is very probable that when IP-II goes beyond the bisector, that particular layer is overtrained. Moreover, it is interesting that for the classification case, the bottleneck layer code that corresponds to the training phase close to the knee (in both IP-I and IP-II) is the point where classification accuracy on testing set is maximum.

Acknowledgement

The authors would like to express their sincere gratitude to Dr. Luis Gonzalo Sánchez Giraldo from the University of Miami and Dr. Robert Jenssen from the UiT - The Arctic University of Norway for their careful reading of our manuscript and many insightful comments and suggestions. This work is supported in part by the U.S. Office of Naval Research under Grant N---.

References

References

  • (1)

    A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.

  • (2) A. Graves, A.-r. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, IEEE, 2013, pp. 6645–6649.
  • (3) G. Mesnil, X. He, L. Deng, Y. Bengio, Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding., in: Interspeech, 2013, pp. 3771–3775.
  • (4) G. Alain, Y. Bengio, Understanding intermediate layers using linear classifier probes, arXiv preprint arXiv:1610.01644.
  • (5) J. C. Principe, N. R. Euliano, W. C. Lefebvre, Neural and adaptive systems: fundamentals through simulations, Vol. 672, Wiley New York, 2000.
  • (6)

    A. N. Kolmogorov, Sur l interpolation et extrapolation des suites stationnaires, CR Acad. Sci 208 (1939) 2043–2045.

  • (7) J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives, Springer Science & Business Media, 2010.
  • (8) R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neural networks via information, arXiv preprint arXiv:1703.00810.
  • (9) D. J. MacKay, Information theory, inference and learning algorithms, Cambridge university press, 2003.
  • (10) R. L. Stratonovich, On value of information, Izvestiya of USSR Academy of Sciences, Technical Cybernetics 5 (1) (1965) 3–12.
  • (11) S. Yu, M. Emigh, E. Santana, J. C. Príncipe, Autoencoders trained with relevant information: Blending shannon and wiener’s perspectives, in: Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, IEEE, 2017, pp. 6115–6119.
  • (12) L. G. S. Giraldo, M. Rao, J. C. Principe, Measures of entropy from data using infinitely divisible kernels, IEEE Transactions on Information Theory 61 (1) (2015) 535–548.
  • (13) J. C. Principe, B. Chen, Universal approximation with convex optimization: Gimmick or reality?[discussion forum], IEEE Computational Intelligence Magazine 10 (2) (2015) 68–77.
  • (14) B. Chen, Y. Zhu, J. Hu, J. C. Principe, System parameter identification: information criteria and algorithms, Newnes, 2013.
  • (15) E. Kokiopoulou, J. Chen, Y. Saad, Trace optimization and eigenproblems in dimension reduction methods, Numerical Linear Algebra with Applications 18 (3) (2011) 565–602.
  • (16) P. Baldi, K. Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural networks 2 (1) (1989) 53–58.
  • (17) A. Rényi, et al., On measures of entropy and information, in: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, The Regents of the University of California, 1961.
  • (18) B. W. Silverman, Density estimation for statistics and data analysis, Vol. 26, CRC press, 1986.
  • (19) L. G. S. Giraldo, J. C. Principe, Rate-distortion auto-encoders, arXiv preprint arXiv:1312.7381.
  • (20) C.-W. Huang, S. S. S. Narayanan, Flow of renyi information in deep neural networks, in: Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on, IEEE, 2016, pp. 1–6.
  • (21) A. M. Álvarez-Meza, J. A. Lee, M. Verleysen, G. Castellanos-Dominguez, Kernel-based dimensionality reduction using renyi’s -entropy measures of similarity, Neurocomputing 222 (2017) 36–46.
  • (22) R. Bhatia, Infinitely divisible matrices, The American Mathematical Monthly 113 (3) (2006) 221–235.
  • (23) P. Mehta, D. J. Schwab, An exact mapping between the variational renormalization group and deep learning, arXiv preprint arXiv:1410.3831.
  • (24) H. W. Lin, M. Tegmark, D. Rolnick, Why does deep and cheap learning work so well?, Journal of Statistical Physics 168 (6) (2017) 1223–1247.
  • (25) N. Tishby, N. Zaslavsky, Deep learning and the information bottleneck principle, in: Information Theory Workshop (ITW), 2015 IEEE, IEEE, 2015, pp. 1–5.
  • (26) N. Tishby, F. C. Pereira, W. Bialek, The information bottleneck method, arXiv preprint physics/0004057.
  • (27) P. Khadivi, R. Tandon, N. Ramakrishnan, Flow of information in feed-forward deep neural networks, arXiv preprint arXiv:1603.06220.
  • (28) Y. Bengio, G. Mesnil, Y. Dauphin, S. Rifai, Better mixing via deep representations, in: Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 552–560.
  • (29) P. P. Brahma, D. Wu, Y. She, Why deep learning works: A manifold disentanglement perspective, IEEE transactions on neural networks and learning systems 27 (10) (2016) 1997–2008.
  • (30) R. Pascanu, G. Montufar, Y. Bengio, On the number of response regions of deep feed forward networks with piece-wise linear activations, arXiv preprint arXiv:1312.6098.
  • (31) G. F. Montufar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep neural networks, in: Advances in neural information processing systems, 2014, pp. 2924–2932.
  • (32) A. Achille, S. Soatto, Emergence of invariance and disentangling in deep representations, arXiv preprint arXiv:1706.01350.
  • (33)

    M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European conference on computer vision, Springer, 2014, pp. 818–833.

  • (34) M. D. Zeiler, G. W. Taylor, R. Fergus, Adaptive deconvolutional networks for mid and high level feature learning, in: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 2018–2025.
  • (35)

    A. Mahendran, A. Vedaldi, Understanding deep image representations by inverting them, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5188–5196.

  • (36) J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, H. Lipson, Understanding neural networks through deep visualization, arXiv preprint arXiv:1506.06579.
  • (37) A. Nguyen, J. Yosinski, J. Clune, Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks, arXiv preprint arXiv:1602.03616.
  • (38) Q. Zhang, R. Cao, F. Shi, Y. N. Wu, S.-C. Zhu, Interpreting cnn knowledge via an explanatory graph, arXiv preprint arXiv:1708.01785.
  • (39)

    S. P. Luttrell, A bayesian analysis of self-organizing maps, Neural Computation 6 (5) (1994) 767–794.

  • (40) T. M. Liggett, Interacting particle systems, Vol. 276, Springer Science & Business Media, 2012.
  • (41) S. Jansen, N. Kurt, et al., On the notion (s) of duality for markov processes, Probability surveys 11 (2014) 59–120.
  • (42) J. R. Norris, Markov chains, no. 2, Cambridge university press, 1998.
  • (43) I. Csiszár, A class of measures of informativity of observation channels, Periodica Mathematica Hungarica 2 (1-4) (1972) 191–213.
  • (44) T. M. Cover, J. A. Thomas, Elements of information theory, John Wiley & Sons, 2012.
  • (45) N. Merhav, Data processing theorems and the second law of thermodynamics, IEEE Transactions on Information Theory 57 (8) (2011) 4926–4939.
  • (46) S. Haykin, Neural networks: a comprehensive foundation, Prentice Hall PTR, 1994.
  • (47) Y. Bengio, How auto-encoders could provide credit assignment in deep networks via target propagation, arXiv preprint arXiv:1407.7906.
  • (48) F. Takens, et al., Detecting strange attractors in turbulence, Lecture notes in mathematics 898 (1) (1981) 366–381.
  • (49) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
  • (50) H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, arXiv preprint arXiv:1708.07747.
  • (51) D. Aneja, A. Colburn, G. Faigin, L. Shapiro, B. Mones, Modeling stylized character expressions via deep learning, in: Asian Conference on Computer Vision, Springer, 2016, pp. 136–153.
  • (52) L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research 9 (Nov) (2008) 2579–2605.
  • (53) D. Arpit, Y. Zhou, H. Ngo, V. Govindaraju, Why regularized auto-encoders learn sparse representation?, in: International Conference on Machine Learning, 2016, pp. 136–144.
  • (54) G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, science 313 (5786) (2006) 504–507.
  • (55) F. Camastra, A. Staiano, Intrinsic dimension estimation: Advances and open problems, Information Sciences 328 (2016) 26–41.
  • (56) X. Wang, J. S. Marron, et al., A scale-based approach to finding effective dimensionality in manifold learning, Electronic Journal of Statistics 2 (2008) 127–148.
  • (57) E. Levina, P. J. Bickel, Maximum likelihood estimation of intrinsic dimension, in: Advances in neural information processing systems, 2005, pp. 777–784.
  • (58) G. Lombardi, A. Rozza, C. Ceruti, E. Casiraghi, P. Campadelli, Minimum neighbor distance estimators of intrinsic dimension, Machine Learning and Knowledge Discovery in Databases (2011) 374–389.
  • (59) C. Ceruti, S. Bassis, A. Rozza, G. Lombardi, E. Casiraghi, P. Campadelli, Danco: An intrinsic dimensionality estimator exploiting angle and norm concentration, Pattern recognition 47 (8) (2014) 2569–2581.
  • (60) M. S. Lewicki, T. J. Sejnowski, Learning overcomplete representations, Learning 12 (2).
  • (61)

    S. Gao, G. Ver Steeg, A. Galstyan, Efficient estimation of mutual information for strongly dependent variables, in: Artificial Intelligence and Statistics, 2015, pp. 277–286.

  • (62) E. Parzen, On estimation of a probability density function and mode, The annals of mathematical statistics 33 (3) (1962) 1065–1076.
  • (63) S. E. Fahlman, C. Lebiere, The cascade-correlation learning architecture, in: Advances in neural information processing systems, 1990, pp. 524–532.
  • (64) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • (65) X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323.
  • (66) R. Linsker, Self-organization in a perceptual network, Computer 21 (3) (1988) 105–117.
  • (67) R. Linsker, How to generate ordered maps by maximizing the mutual information between input and output signals, Neural computation 1 (3) (1989) 402–411.
  • (68) D. Xu, J. C. Principe, Training mlps layer-by-layer with the information potential, in: Neural Networks, 1999. IJCNN’99. International Joint Conference on, Vol. 3, IEEE, 1999, pp. 1716–1720.
  • (69) V. Miranda, J. Krstulovic, J. Hora, V. Palma, J. C. Principe, Breaker status uncovered by autoencoders under unsupervised maximum mutual information training, in: 17th International Conference on Intelligent System Applications to Power Systems, 2013.
  • (70) K.-H. Peng, H. Zhang, Mutual information-based rbm neural networks, in: Pattern Recognition (ICPR), 2016 23rd International Conference on, IEEE, 2016, pp. 2458–2463.
  • (71) S. S. Haykin, Neural networks and learning machines, Vol. 3, Pearson Upper Saddle River, NJ, USA:, 2009.
  • (72) N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting., Journal of machine learning research 15 (1) (2014) 1929–1958.
  • (73) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, 2015, pp. 448–456.
  • (74) C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requires rethinking generalization, arXiv preprint arXiv:1611.03530.
  • (75) M. Raginsky, A. Xu, Information-theoretic analysis of generalization capability of learning algorithms, in: Advances in Neural Information Processing Systems, 2017, pp. 2521–2530.
  • (76) I. Alabdulmohsin, An information-theoretic route from generalization in expectation to generalization in probability, in: Artificial Intelligence and Statistics, 2017, pp. 92–100.
  • (77) G. Zheng, J. Sang, C. Xu, Understanding deep learning generalization by maximum entropy, arXiv preprint arXiv:1711.07758.

Supplementary Material to Understanding Autoencoders with Information Theoretic Concepts

In this supplementary material, we first validate the three fundamental properties on Fashion-MNIST and FERG-DB. The network topology of stacked autoencoders (SAEs) on Fashion-MNIST is “784-1000-500-250--250-500-1000-784”, where denotes the number of neurons in the bottleneck layer. By contrast, the network topology of SAEs on FERG-DB is selected to be “1024-512-256-100--100-256-512-1024”. Same as experiments on MNIST, the information quantities mentioned here are estimated using the matrix-based functional of Renyi’s -entropy giraldo2015measures . We select to approximate Shannon’s entropy and tune the kernel size by Silverman’s rule of thumb silverman1986density . Note that, we also conduct a simple simulation, on the MNIST dataset, to illustrate that the phenomena observed in the main text is robust to the variation of kernel size (in a reasonable range).

6 Property validation on Fashion-MNIST

6.1 Validation of Fundamental Properties I Iii

We first test different SAEs topologies with ranging from to . We expect different behaviors of the curves shown in the IPs depending on or , where is an effective dimensionality that can fit the training data well. The corresponding IP-I is shown in Fig. 12. It is very easy to observe the monotonically decreasing characteristics of and . Thus, the Fundamental Property I (i.e., and ) is always established.

(a) IP-I (encoder part) when
(b) IP-I (decoder part) when
(c) IP-I (encoder part) when
(d) IP-I (decoder part) when
(e) IP-I (encoder part) when
(f) IP-I (decoder part) when
(g) IP-I (encoder part) when
(h) IP-I (decoder part) when
Figure 12: The validation of bifurcation point associated with . (a), (c), (e) and (g) demonstrate the IP-I ( is in the encoder module) with equals to , , and respectively, whereas (b), (d), (f) and (h) demonstrate the corresponding IP-I when is in the decoder module. As can be seen, the general patterns of curves in IPs begin to have a transition between and . This suggests an effective dimensionality of Fashion-MNIST dataset is approximately or .

If we look deeper, the IP-I (encoder part) is more sensitive to the change of , thus providing a good indicator to investigate the data dimensionality property. The experimental results validate the Fundamental Property III very well, because the curves associated with and begin to approaching the bisector after a certain point when , rather than deviating the bisector as demonstrated when . Moreover, it is worth noting that our estimated matches well the values of intrinsic dimensionality given by benchmarking estimators. In fact, the intrinsic dimensionality estimated by the Maximum Likelihood Estimation (MLE) levina2005maximum , the Minimum Neighbor Distance (MiND) Estimator lombardi2011minimum and the Dimensionality from Angle and Norm Concentration (DANCo) ceruti2014danco are , and , respectively.

6.2 Validation of Fundamental Property II

Similar to the validation procedure on MNIST, we select two specific values of to validate the second type of DPI in the Fundamental Property II. Fig. 13 demonstrates the layer-wise mutual information corresponding to and respectively. It is obviously that, the DPI associated with mutual information (i.e., ) is established no matter the value of .

Figure 13: The validation of data processing inequality (DPI) associated with the layer-wise mutual information. (a) demonstrates the layer-wise mutual information when ; (b) shows a zoom-in results of (a) in the iterations between (approximately) to ; (c) shows a further zoom-in results of (b). Similarly, (d) demonstrates the layer-wise mutual information when ; (e) shows a zoom-in results of (d) in the iterations between (approximately) to ; (f) shows a further zoom-in results of (e). In all sub-figures, the green triangles denote the mutual information between and , the red circles denote the mutual information between and , the blue plus signs denote the mutual information between and , the magenta squares denote the mutual information between and , and the black asterisks denote the mutual information between and (reduces to the entropy of in our case).

7 Property validation on FERG-DB

7.1 Validation of Fundamental Properties I, II Iii

We test the DPIs and the existence of bifurcation point together. We vary the value of from to . Figs. 14(a), 14(c), 14(e) and 14(g) demonstrate the IP-I when equals to , , and respectively, whereas Figs. 14(b), 14(d), 14(f) and 14(h) demonstrate the layer-wise mutual information.

As can be seen, the two types of DPI is established no matter the value of , i.e., , and ). However, there is no distinct patterns in the IP-I, i.e., all the curves approaching the bisector monotonically for the values tried. This is different from the scenario in MNIST or Fashion-MNIST when , in which all curves start to increase up to a point and then go back to approaching bisection. It is also different from the scenario when , in which all the curves go away from bisection. One possible reason is that the effective dimensionality for FERG-DB is very small such that can already fit the data very well. Therefore, we expect the effective dimensionality to be . Because the size of bottleneck layer is always larger than , the curves in IPs should demonstrate the same pattern.

To support our argument, we estimate using the aforementioned benchmarking estimators. It is surprising to find that the intrinsic dimensionality estimated by the Maximum Likelihood Estimation (MLE) levina2005maximum , the Minimum Neighbor Distance (MiND) Estimator lombardi2011minimum and the Dimensionality from Angle and Norm Concentration (DANCo) ceruti2014danco are , and , respectively. These results corroborate our argument, i.e., since the bifurcation point in IPs for FERG-DB is just , all the curves for should demonstrate the same pattern no matter the value of . Recall that this dataset is a representation of a face from a single cartoon character (or template), which can be sketched using just one line with different shapes, therefore the result is acceptable.

(a) IP-I (encoder part) when
(b) DPI when
(c) IP-I (encoder part) when
(d) DPI when
(e) IP-I (encoder part) when
(f) DPI when
(g) IP-I (encoder part) when
(h) DPI when
Figure 14: The information plane and corresponding pair-wise mutual information with different values of . (a), (c), (e) and (g) demonstrate the IP-I ( is in the encoder module) with equals to , , and respectively, whereas (b), (d), (f) and (h) demonstrate the corresponding layer-wise mutual information to illustrate the DPI.

8 Robustness analysis on kernel size

As emphasized in the main text, care must be taken to select an appropriate value for kernel size . This paper selects . We show, in Fig. 15, that even though (hence ) is not optimized, we can still observe the same trends of general patterns of the curves in the IP that is controlled by the bottleneck size.

(a) IP (, )
(b) IP (, )
(c) IP (, )
(d) IP (, )
Figure 15: The robustness analysis on the selection of parameter (hence ). (a) and (b) demonstrate, when , the IPs for and respectively. (c) and (d) demonstrate, when , the IPs for and respectively. The SAE topology is fixed to “784-1000-500-250--250-500-1000-784”, where “” denotes the bottleneck layer size. The distinctions of curves in the IP remain the same if is in a large yet reasonable range. However, it should be acknowledged that if is too small (less than ), all the estimation values are saturated and approaching to their limit. On the other hand, if is large enough (more than ), it is hard to observe the distinctions due to the large estimation variance among different epoches.

We also show that the value of mutual information between two variables is monotonically decreasing if the kernel size in any one of the variables increases. To this end, we train a basic autoencoder with topology “--” on MNIST dataset with epochs, which, as has been observed, is sufficient to reliably converge. We estimate the mutual information with respect to different values (, ) in both input layer and bottleneck layer. Fig. 16 demonstrates the value of at the end of epoch and epoch . As can be seen, is monotonically decreasing as any one of the increases. This suggests that the same trends for entropy also apply for mutual information.

(a) epoch 1 (, )
(b) epoch 1 (, )
(c) epoch 1 (, )
(d) epoch 30 (, )
(e) epoch 30 (, )
(f) epoch 30 (, )
Figure 16: (a)-(c) show the mutual information value (after epoch training) with respect to different kernel sizes in input layer and bottleneck layer at view , , and respectively, where denotes azimuth and denotes elevation. (d)-(f) show the mutual information value (after epochs training) with respect to different kernel sizes in input layer and bottleneck layer at view , , and respectively.