The classical pre-training process of deep neural networks is done in an unsupervised scheme. It consists of learning a deep nonlinear representation of the data which is then used to initialize a deep supervised feed-forward neural network. This pre-training procedure is usually done by greedily learning and stacking simple learning modules as described in. It is hypothesized  that unsupervised pre-training is useful because the nonlinear representation captures the manifold shape of the input distribution, such that nonlinear variations in the input become linear variations of the representation vector.
We identify a learning module as a model that provides a conditional distribution involving two random vectors and . represents the input or visible variables, and the features or hidden variables. Recently, much research effort have focused on these modules. It has given rise to a large number of models which essentially differ by the kind of information being extracted from to form features . We identify this information as the mutual information . The use of a generative model of to learn allows to see what kind of information contributes to with the following decomposition :
where is the entropy functional. The underlying modeling hypothesis define the information being conveyed by . For example, in factored RBM , the factors allow, when is observed, to model some dependencies between components of . These information are set in . The remaining information is put in , for example, this includes higher order dependencies. It is also possible to hide or reveal some information by a pre-processing step of data. Learning a generative model on this transformed data can be easier. For example, a popular pre-processing step is sphering, it corresponds to decorrelate the components of . This helps learning the model ICA  by determining half of its parameters. It is possible to learn without using a generative model of , for example using auto-encoders . We believe that a desirable property is to have a mutual information that represents useful information to solve the supervised problem easily, e.g. with a linear model using as input.
The information can be revealed by more or less complex interactions between components of , this also influences the ease of solving the supervised problem by using the representation . For example, suppose that is Bernoulli with , and suppose that
with an uniform distribution. A generative model ofcould be . In such case, if one component or is not observed, then it is not possible to determine the value of . In other words we have . Information about is revealed by observation of both components of : its value is determined by an interaction between components corresponding to the xor function. To minimize interactions between components, one can consider a learning objective that would maximize for each components. A particular setting is obtained when components of are independent and conditionally independent relative to , in this case we have . We consider a more general objective which consists of maximizing mutual information where represents any subset of components of . With an empathize for small subsets, this maximization would lead to representations that are robust : even if some component of are not observed, we can still have information about . In this work we show that sparse coding  helps to get such representations. We shall see that we can derive this objective from another one which integrates the supervised signal.
A poor number of models have focused on an explicit integration of the signal of supervision during the pre-training process. This is an important question since there is nothing to guaranty that information is represented on in such a way that supervision can be easily disentangled by a simple model using as input. For example, the manifold of the data learned in an unsupervised scheme doesn’t guaranty that supervision, e.g. discrete classes, splits the manifold in easily separable parts. Another motivation is that distribution of may be too complicated to be properly modeled with a simple model. The mutual information that can be learned is limited by the model’s capacity. It is then important that this capacity is spent for useful information relative to the supervised task. We denote as the variable representing the supervision, e.g. labels. Previous related works  can be interpreted as a joint optimization of and . We propose to maximize mutual information for any subset of component of . This objective leads to distributions that are robust, because if some component of are noisy or give misleading information about , then other components can still fill the gap of information about . Moreover, we can show that it helps to model
with a simple model like Naive Bayes, because it generates components ofthat are conditionally independent relative to .
2 Learning objective
We aim to learn a model of , where and are two random vectors, which respectively represent the input and the output, and we have a set
of samples of their joint distribution. We modelwith a deep feed-forward neural network initialized with a deep representation. We hypothesize that the deep representation is a distribution that factorizes multiple layers as following :
As in , is trained by greedily stacking simpler learning modules that extracts features of previous layer. We model a module by a parameterized distribution , where and are observed and hidden random vectors. We have and, if , then , else . Note that observations of also come with observations of during the training phase.
We suppose that . We note , the component of . We note to designate the Shannon entropy or the differential entropy if variables are continuous, we also note to designate the mutual information between two variables.
We suppose that the inference of is easy by assuming that :
2.2 Learning objective
We note the set of all subsets of components of of size . We consider the following objective :
The coefficients are positive hyper-parameters. This objective maximizes mutual information , maximization for subsets of size can be emphasized by a high value . Let , then for any component in , we can write , with , this consideration allows us to equivalently express (2) as component-wise sums :
with ,111the notation designates the combination . The relation can be proved by recursion. and the set of pairs, defined by .
Let note . To compute using the model , we need the distribution of which is unknown, however we can use the data set
to get an estimation. Using , the learning objective becomes :
We use this objective to learn modules, we expect that stacking them allows to greedily get better solutions with higher values of .
2.3 Adding constraints to the objective
Since , the maximization of can be made by maximizing differences between estimates and . This possibly yield different solutions depending on the value of . We make a prior hypothesizing that estimate is more robust if is high with redundant information about . We suppose that represents a source of information that may help to satisfy this property. Therefore, we propose to increase the entropy by increasing mutual information .
Constrained learning objective :
We propose to find a solution of (4) by solving :
This learning objective corresponds to a constrained version of (4) where coefficients imply a corresponding coefficient .222Note that this requires some hypothesis if variables are continuous because then we consider the differential entropies which can diverge to . This divergence is avoided if we assume noise in the distribution which bounds its differential entropy with a finite value. As seen in , if the model is such that does not depend on , maximizing is equivalent to maximizing .
However, solving the problem (5) is not tractable because :
the size of set is the combination , the sum over its elements is not tractable for ,
computing and has a complexity of , where is the number of values that can take a component with a discretization using bits, and with the size of .
To find solutions to the problem (5), we propose to consider two sub problems :
maximization of ,
In the next section we show how we can approximately optimize these functions.
2.4 Maximization of the conditional mutual information
For clarity, we do not write and in this section. We show how to maximize :
The idea is as follow :
For small sets , we spread the conditional mutual information over components of to get a lower bound of . This lower bound is a function of , thus increasing the mutual information, also increases the conditional mutual information.
For large sets , we show how we can use sparsity to increase .
Spreading the information :
Let say that the information is spread to a depth if :
Spreading information is not tractable for large depth , in practice we wont be able to spread the information for depth . Note also that by (1), we have :
Lower bound :
If the information is spread to depth , then we have, for :
Since the mutual information can be written :
the spreading to depth implies :
let , the hypothesis (1) allows us to write, for all so that :
then we have , which gives (9).
We see that , therefore can be increased by increasing . But this is not the case for with . For example, we have , the bound depends on , it is low if is high. However, we can use sparsity to control the growth of and guaranty a higher bound on .
Spreading and sparsity :
Constraining can be done by reducing the entropies
, we can do this by constraining the probabilities (or densities)to high value, this leads to be sparse.
Let suppose that information is spread at depth , since we have , constraining , does not allow to have higher value than . Thus sparsity may be useful to increase conditional mutual information for large set , but too much sparsity may hurt the conditional mutual information for smaller set .
2.4.1 Optimization in the binary case
We propose to optimize the conditional mutual information in the case where . This case allows to easily optimize the spread of information to depth . First we show that in some conditions, we can estimate mutual information by the entropies , then we deduce a simple way to optimize the spread to depth .
Estimating the mutual information :
If , then for all , and for all pairs , we have :
This is because , with which is implied by .
Optimization of may be done by saturating probabilities to one or to zero. Note that previous work  advocate such optimization, but with motivation related to an invariance property of the representation.
Spreading information to depth one :
If we estimate the conditional mutual information by the conditional entropy. And if without loss of generality, we suppose that for all component , it can be easily verified that information is spread to depth one if :
So under the condition , spreading information to depth one in the binary case can be done by constraining the probabilities of activation of components to one value, and by constraining the probabilities of joint activation for pairs of components to another value.
Optimization of (13
) can be done by minimization of sum of Kullback-Leibler divergences, reintroducing the parametersof the model :
is the Bernoulli distribution of parameter. Constraints (13) are satisfied if both functions and equal zero.
We propose to set and as hyper-parameters, we can see that controls also the sparsity level of . For simplicity we will choose , this leads components to be pair-wise independent.
Optimized function :
the loss function optimized to learn mutual informationwith the help of training set , and inducing a model . For example can refer to the negative log-likelihood of a generative model, or a reconstruction error if is learned using an auto-encoder. We define a loss function which allows us to optimize (6), by jointly optimizing and spread of information :
Note that this loss does not include explicitly an optimization of . This is because we hypothesize that the combination of sparsity and model behind , do not allow do have an high entropy . However, relaxing the sparsity constraint and adding an optimization of is a path that would be interesting to explore.
2.5 Minimization of the conditional entropy
In this section, we show how to minimize the entropy conditioned with the supervision variable . This is easier than increasing the mutual information, as seen in previous sub-section, because it consists basically to delete information. We want to minimize :
Since we have , we can minimize (15) by minimizing :
Minimizing (16), also minimizes the (conditional) total correlation :
The (conditional) total correlation is positive and equals zero if and only if components of are independent conditionally to . This kind of independence is the hypothesis made by Naive Bayes models, therefore minimization of (16) allows to use them to model . Particularly, we can prove that, if component of are independent conditionally to , if is countable, binary, and that , then can be modeled by a linear model.
As in the previous sub-section, we suppose that
is binary. Since a lot of supervised problems are classification tasks which involve an uniform discrete random variabletaking a small number of values, we propose to develop an optimizable function to minimize (16) under this hypothesis.
2.5.1 Optimization in the binary case and small number of classes
We suppose that , and that is a uniform discrete random variable taking values in , we have . We also suppose that is sufficiently large and sufficiently small, so that for all component , we have . Let be a component of , and be a joint distribution of , we have :
We can show that a distribution that satisfies hypothesis, minimizes if it also satisfies :
We assign to each component a class by defining a surjection333This is possible if , we suppose that this is the case. . The distribution is of the form (17), if and only if we have :
This can be optimized using data set by minimizing the following Kullback-Leibler divergences :
where is the Bernoulli distribution of parameter , is a sample of , and is the indicator function, it equals if , equals otherwise.
2.6 Joint optimization
We propose a global loss that jointly optimizes the maximization of conditional mutual information and minimization of conditional entropies.
The hyper parameters are (with ).
We used two data-sets, Mnist and Cifar-BW. Mnist is the well known data set of digit classification problem. We used 50000 examples for training, 10000 examples for validation, and we tested on the official 10000 examples. Cifar-BW is a gray-scale version of Cifar-10 data set , obtained by averaging the RGB values. This data set represents a image-classification task with 10 classes. We trained on 40000 examples, 10000 for validation, and 10000 for test.
We have trained a one hidden layer representation on Mnist with a RBM and we have optimized sparsity and spread of information to depth one (we note ). We have compared the conditional mutual information obtained with those obtained with a Sparse RBM . Sparse RBM is a RBM trained with sparsity regularization which corresponds to the optimization of spread to depth zero. Figure 1 shows the histogram of minimal mutual information owned by components with a conditioning set of size one, for a component , it is , where follows the distribution of Mnist. We see that Sparse RBM does not avoid two components to be completely redundant, this is characterized by an information of 0 nat, while spreading the information over components to depth one, prevents this discrepancy.
The figure 2 compares classification performances on Mnist and Cifar-BW when spreading is optimized to depth zero only (Sparse RBM/GRBM), or to depth one (). For the loss we used an RBM on Mnist, or an Gaussian RBM (GRBM)  on Cifar-BW. The Sparse GRBM is a GRBM trained with the same regularization as Sparse RBM. For each case we show the best performance obtained after a grid search on hyper-parameters. We fixed , we used a model with one hidden layer with 1024 components. Each result represents the mean classification error on 30 runs using different random initialization of parameters.
|Sparse RBM / GRBM||1.36||49.1|
The figure 3 shows box plot of classification error on Mnist using a model with 2 hidden layers with 1000 components each, with optimizing a RBM only (), optimizing information spread to depth one (), optimizing a RBM along with (), and optimizing both and . 444Number of runs are 5 for , 6 for , 11 for , and 18 for . Optimizing alone yield worse performance on a model with 2 hidden layers (we got an error mean of ), than with one hidden layer only (we got , see Figure 2). However effect of appears to be beneficial when optimized along with (we got ).
The figure 4 shows the effect of
on probability of components. Each square displays the hidden representation of an example from Mnist, the black dots represent components with high probability to be equal to one, components on the same row are specialized for the same label (theirvalue are equal). For each layer, we have increased the parameter by a factor . We see that distribution of components shifts to the distribution of the supervision as the depth increases, while the loss tries to keep as much information as possible about .
|First hidden layer||Second hidden layer||Third hidden layer|
We have introduced an objective based on information theory that aims to discover robust representations. The objective is not tractable, therefore we have derived a surrogate loss. This loss is a weighted sum of two terms and . The first one, , maximizes the entropy of components of representation while their redundancy is kept small. We have seen relations to sparse coding methods. We have proposed to increase the entropy using information expressed by the input distribution, this links our approach with unsupervised pre-training methods. The second term,
, minimizes entropy of components conditioned to the supervising signal. This leads to components that are conditionally independent relative to the supervision. This allows distribution of the supervision to be modeled with a Naive Bayes model using the representation as input. We have proposed to work in the context of deep learning, where deep representation is obtained by greedily training and stacking simple modules. The final model is a deep feed-forward neural network initialized with the representation. A set of experiments have shown promising results. We have seen that pre-training neural network withalone gives good results for shallow representation, but addition of one hidden layer worsens performance. However, using both losses and gives our best performance. This advocates the integration of supervised signal during pre-training. Although, more experiments have to be done to confirm experimental results.
-  E. Oja A. Hyvärinen, J. Karhunen. Independent Component Analysis.
-  Anthony J. Bell and Terrence J. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. NEURAL COMPUTATION, 1995.
Learning deep architectures for AI.
Foundations and Trends in Machine Learning, 2(1), 2009.
-  Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. 2007.
-  S.Osindero G.Hinton. A fast learning algorithm for deep belief nets. Neural Computation, 2006.
-  Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
Hugo Larochelle and Yoshua Bengio.
Classification using discriminative restricted boltzmann machines.In Proceedings of the 25th international conference on Machine learning, ICML ’08, New York, NY, USA, 2008.
-  Honglak Lee, Chaitanya Ekanadham, and Andrew Y. Ng. Sparse deep belief net model for visual area v2. In Advances in Neural Information Processing Systems 20, pages 873–880. 2008.
-  Michael S. Lewicki and Terrence J. Sejnowski. Learning Overcomplete Representations. Neural Computation, 12(2):337–365, 2000.
-  Marc’ Aurelio Ranzato and Martin Szummer. Semi-supervised learning of compact document representations with deep networks. In Proceedings of the 25th international conference on Machine learning, ICML ’08, New York, NY, USA, 2008.
Marc’Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun.
Sparse feature learning for deep belief networks.In Advances in Neural Information Processing Systems (NIPS 2007), 2007.
Marc’Aurelio Ranzato, Alex Krizhevsky, and Geoffrey E. Hinton.
Factored 3-Way Restricted Boltzmann Machines For Modeling Natural
Proc. Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), AISTAT ’10, 2010.
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio.
Contractive auto-encoders: Explicit invariance during feature extraction.In Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, 2011.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre antoine Manzagol.
Extracting and composing robust features with denoising autoencoders, 2008.
-  Max Welling, Michal R. Zvi, and Geoffrey E. Hinton. Exponential Family Harmoniums with an Application to Information Retrieval. In NIPS, 2004.