Conclusions and perspectives
We have established that there is a strong relationship between minimizing a codelength of the data and minimizing reconstruction error using an auto-encoder. A variational approach provides a bound on data codelength in terms of the reconstruction error to which certain regularization terms are added.
The additional terms in the codelength bounds can be interpreted as a denoising condition from features to reconstructed output. This is in contrast with previously proposed denoising auto-encoders. For neural networks, this criterion can be trained using standard backpropagation techniques.
The codelength approach determines an optimal noise level for this denoising interpretation, namely, the one that will provide the tightest codelength. This optimal noise is approximately the inverse Hessian of the reconstruction function, for which several approximation techniques exist in the literature.
A practical consequence is that the noise level should be set differently for each data sample in a denoising approach.
Under certain approximations, the codelength approach also translates as a penalty for large derivatives from feature to output, different from that posited in contractive auto-encoders. However, the resulting criterion is hard to train for complex models such as multilayer neural networks. More work is needed on this point.
Including the variances of the outputs as parameters results in better compression bounds and a modified reconstruction error involving the logarithms of the square errors together with the data quantization level. Still, having these variances as parameters is a modeling choice that may be relevant for compression but not in applications where the actual reconstruction error is considered.
It would be interesting to explore the practical consequences of these insights. Another point in need of further inquiry is how this codelength viewpoint combines with the stacking approach to deep learning, namely, after the datahave been learned using features and an elementary model for , to further learn a finer model of . For instance, it is likely that there is an interplay, in the denoising interpretation, between the noise level used on when computing the codelength of , and the output variance used in the definition of the reconstruction error of a model of at the next level. This would require modeling the transmission of noise from one layer to another in stacked generative models and optimizing the levels of noise to minimize a resulting bound on codelength of the output.
- [AO12] Ludovic Arnold and Yann Ollivier. Layer-wise training of deep generative models. Preprint, arXiv:1212.1524, 2012.
- [Bis95] Christopher M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116, 1995.
- [Bis06] Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006.
- [GCB97] Yves Grandvalet, Stéphane Canu, and Stéphane Boucheron. Noise injection: Theoretical prospects. Neural Computation, 9(5):1093–1108, 1997.
- [Gra11] Alex Graves. Practical variational inference for neural networks. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 2348–2356, 2011.
- [Grü07] Peter D. Grünwald. The minimum description length principle. MIT Press, 2007.
- [HS06] Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313:504–507, 2006.
Geoffrey E. Hinton and Drew van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In Lenny Pitt, editor,
Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, COLT 1993, Santa Cruz, CA, USA, July 26-28, 1993., pages 5–13. ACM, 1993.
- [KW13] Diederik P. Kingma and Max Welling. Stochastic gradient VB and the variational auto-encoder. Preprint, arXiv:1312.6114, 2013.
- [LBOM96] Yann LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Efficient backprop. In Genevieve B. Orr and Klaus-Robert Müller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, pages 9–50. Springer, 1996.
- [Oll13] Yann Ollivier. Riemannian metrics for neural networks I: feedforward networks. Preprint, http://arxiv.org/abs/1303.0818 , 2013.
- [PH87] David C. Plaut and Geoffrey Hinton. Learning sets of filters using back-propagation. Computer Speech and Language, 2:35–61, 1987.
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio.
Contractive auto-encoders: Explicit invariance during feature extraction.In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 833–840. Omnipress, 2011.
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of Machine Learning Research, 11:3371–3408, 2010.
Appendix: Derivative of for multilayer neural networks
The codelength bound from Theorem 5 involves a term where is the Hessian of the loss function for input . Optimizing this term with respect to the model parameters is difficult in general.
We consider the case when the generative model is a multilayer neural network. We provide an algorithm to compute the derivative of the term appearing in Theorem 5 with respect to the network weights, using the layer-wise diagonal Gauss–Newton approximation of the Hessian from [LBOM96]. The algorithm has the same asymptotic computational cost as backpropagation.
So let the generative model be a multilayer neural network with activation function . The activity of unit is
where the sum includes the bias term via the always-activated unit with .
Let be the loss function of the network.
The layer-wise diagonal Gauss–Newton approximation computes an approximation to the Hessian in the following way [LBOM96, Sections 7.3–7.4]: On the output units , is directly set to , and this is backpropagated through the network via
so that computing is similar to backpropagation using squared weights. This is also related to the backpropagated metric from [Oll13].
Theorem 9 (Gradient of the determinant of the Gauss–Newton Hessian).
Consider a generative model given by a multilayer neural network. Let the reconstruction error be where are the components of the reconstructed data using features . Let the elementary model on be Gaussian with variance .
Let as in Theorem 5. Let be the layer-wise diagonal Gauss–Newton approximation of , namely
with computed from (42), initialized via on the output layer.
Then the derivative of with respect to the network weights can be computed exactly with an algorithmic cost of two forward and backpropagation passes.
This computation is trickier than it looks because the coefficients used in the backpropagation for depend on the weights of all units before (because does), not only the units directly influencing .
Apply the following lemma with , , and . ∎
Lemma 10 (Gradients of backpropagated quantities).
Let be a function of the state of a neural network computed according to the backpropagation equation
initialized with some fixed values on the output layer.
for some functions on the input layer .
Then the derivatives of with respect to the network parameters can be computed at the same algorithmic cost as one forward and two backpropagation passes, as follows.
Compute for all by backpropagation.
Compute the variable by forward propagation for all units , as
initialized with for in the input layer.
Compute the variable by backpropagation for all units , as
(also used for initialization with in the output layer, with an empty sum in the second term).
Then the derivatives of are
for all .
Note that we assume that the values used to initialize on the output layer are fixed (do not depend on the network weights). Any dependency of on the output layer activity values can, instead, be incorporated into via .
We assume that the network is an arbitrary finite, directed acyclic graph. We also assume (for simplicity only) that no unit is both an output unit and influences other units. We denote if there is an edge from to , if there is a path of length from to , and if or .
The computation has a structure similar to the forward-backward
algorithm used in hidden Markov models.
The computation has a structure similar to the forward-backward algorithm used in hidden Markov models.
For any pair of units , in the network, define the “backpropagation transfer rate” [Oll13] from to as
where the sum is over all paths from to in the network (including the length- path for ), and is the length of . In particular, and if there is no path from to . By construction these satisfy the backpropagation equation
for . By induction
where the sum is over in the output layer . Consequently the derivative of with respect to a weight is
so that we have to compute the derivatives of . (This assumes that the initialization of on the output layer does not depend on the weights .)
A weight influences and also influences which in turn influences all values of at subsequent units. Let us first compute the derivative of with respect to . Summing over paths from to we find
by substituting , for each value of , and unraveling the definition of and .
Since only influences later units in the network, the only non-zero terms are those with . We can decompose into and :
Now, for , the influence of on has to transit through some unit directly connected to , namely, for any function ,
where is the activation function of the network. So
where the difference between the last two lines is that we removed the condition in the summation over : indeed, any with non-vanishing satisfies hence . According to (55), is , so that (59) is . Collecting from (56), we find
so that the quantities can be computed by backpropagation on , if the are known.
To compute the derivatives of with respect to a weight , observe that influences the term in , as well as all terms with via its influence on . Since we find
where we have set
and where we have used that satisfies
It remains to provide ways to compute and . For , note that the transfer rates satisfy the forward propagation equation
by construction. Summing over with weights yields the forward propagation equation for given in the statement of the lemma.
Finally, by summing over and in (60), with weights , and using the definition of and again the property , we obtain
which is the backpropagation equation for and concludes the proof. ∎