The denoising autoencoder (DAE) is a role model for representation learning, the objective of which is to capture a good representation of the data. Vincent2008and extract the hidden layer to obtain the so-called “code.” On the other hand, the DAE is trained as a denoising map of deliberately corrupted inputs . The corrupt and denoise principle is simple, but truly is compatible with stacking, and thus, inspired many new autoencoders. See Section 1.1 for details.
We are interested in what deeper layers represent and why we should deepen layers. In contrast to the rapid development in its application, the stacked autoencoder remains unexplained analytically, because generative models, or probabilistic alternatives, are currently attracting more attention. In addition, deterministic approaches, such as kernel analysis and signal processing, tend to focus on convolution networks from a group invariance aspect. We address these questions from deterministic viewpoints: transportation theory and ridgelet analysis.
Alain2014 derived an explicit map that a shallow DAE learns as
and showed that it converges to the score of the data distribution
as the variance oftends to zero. Then, they recast it as manifold learning and score matching. We reinterpret (1) as a transportation map of , the variance as time, and the infinitesimal limit as the initial velocity field.
Ridgelet analysis is an integral representation theory of neural networks (Sonoda2015; Sonoda2014; Candes1998; Murata1996). It has a concrete geometric interpretation as wavelet analysis in the Radon domain. We can clearly state that the first hidden layer of a stacked DAE is simply a discretization of the ridgelet transform of (1). On the other hand, the character of deeper layers is still unclear, because the ridgelet transform on stacked layers means the composition of ridgelet transforms , which lacks geometric interpretation. One of the challenges here is to develop the integral representation of deep neural networks.
We make two important observations. First, through decoding, a stacked DAE is equivalent to a composition of DAEs. By definition, they differ from each other, because “stacked” means a concatenation of autoencoders with each output layer removed, while “composition” means a concatenation of autoencoders with each output layer remaining. Nevertheless, decoding relates the stacked DAE and the composition of DAEs. Then, ridgelet transform is reasonable, because it can be performed layer-wise, which leads to the integral representation of a deep neural network.
Second, an infinite composition results in a continuous DAE, which is rich in analytic properties and geometric interpretation, because it solves the backward heat equation. This means that what deep layers do is to transport mass so as to decrease entropy. Together with ridgelet analysis, we can conclude that what a deep layer represents is a discretization of the ridgelet transform of the transportation map.
1.1 Related Work
Vincent2008 introduced the DAE as a modification of traditional autoencoders. While the traditional autoencoder is trained as an identity map , the DAE is trained as a denoising map for artificially corrupted inputs , in order to enhance robustness.
Theoretical justifications and extensions follow from at least five aspects: manifold learning (Rifai2011; Alain2014), generative modeling (Vincent2010; Bengio2013; Bengio2014), infomax principle (Vincent2010), learning dynamics (Erhan2010), and score matching (Vincent2011). The first three aspects were already mentioned in the original paper (Vincent2008). According to these aspects, a DAE learns one of the following: a manifold on which the data are arranged (manifold learning); the latent variables, which often behave as nonlinear coordinates in the feature space, that generate the data (generative modeling); a transformation of the data distribution that maximizes the mutual information (infomax); good initial parameters that allow the training to avoid local minima (learning dynamics); or the data distribution (score matching).
A turning point appears to be the finding of the score matching aspect (Vincent2011)
, which reveals that score matching with a special form of energy function coincides with a DAE. This means that a DAE is a density estimator of the data distribution. In other words, it extracts and stores information as a function of . Since then many researchers omitted stacking deterministic autoencoders, and have developed generative density estimators (Bengio2013; Bengio2014) instead.
The generative modeling is more compatible not only with the restricted Boltzmann machine and deep belief nets(Hinton2006a) and the deep Boltzmann machine (Salakhutdinov2009), but also with many sophisticated algorithms, such as variational autoencoder (Kingma2014a)
, minimum probability flow(Sohl-Dickstein2009; Sohl-Dickstein2015), adversarial generative networks (Goodfellow2014)Kingma2014; Rasmus2015), and image generation (Radford2015)
. In generative models, what a hidden layer represents basically corresponds to either the “hidden state” itself that generates the data or the parameters (such as means and covariance matrices) of the probability distribution of the hidden states. SeeBengio2014, for example.
“What do deep layers represent?” and “why deep?” are difficult questions for concrete mathematical analysis because a deep layer is a composition of nonlinear maps. In fact, even a shallow network is a universal approximator; that is, it can approximate any function, and thus, deep structure is simply redundant in theory. It has even been reported that a shallow network could outperform a deep network (Ba2014). Hence, no studies on subjects such as “integral representations of deep neural networks” or “deep ridgelet transform” exist.
Thus far, few studies have characterized the deep layer of stacked autoencoders. The only conclusion that has been drawn is the traditional belief that a combination of the “codes” exponentially enhances the expressive power of the network by constructing a hierarchy of knowledge and it is efficient to capture a complex feature of the data. Bouvrie2009, Bruna2013, Patel2015 and Anselmi2015a developed sophisticated formulations for convolution networks from a group invariance viewpoint. However, their analyses are inherently restricted to the convolution structure, which is compatible with linear operators.
In this paper, we consider an autoencoder to be a transportation map and focus on its dynamics, which is a deterministic standpoint. We address the questions stated above while seeking an integral representation of a deep neural network.
In this paper, we treat five versions of DAEs: the ordinary DAE , anisotropic DAE , stacked DAE , a composition of DAEs , and the continuous DAE .
By using the single symbols and , we emphasize that they are realized as a shallow network or a network with a single hidden layer. Provided that there is no risk of confusion, the term “DAE ” without any modifiers means a shallow DAE, without distinguishing “ordinary,” “anisotropic,” or “continuous,” because they are all derived from (3).
By , and , we denote time derivative, gradient, and Laplacian, by the Euclidean norm, by the identity map, and by the uni/multivariate Gaussian with mean and covariance matrix .
An (anisotropic) heat kernel is the fundamental solution of an anisotropic diffusion equation on
with respect to the diffusion coefficient tensor:
When , the diffusion equation and the heat kernel are reduced to a heat equation and a Gaussian . If is clear from the context, we write simply without indicating .
For a map with , the Jacobian is calculated by , regarding as an matrix. By , we denote the pushforward measure of a probability measure with respect to a map , which satisfies . See (Evans2015) for details.
2.1 Denoising Autoencoder
be a random vector inand be its corruption:
We train a shallow neural network for minimizing an objective function
In this study, we assumed that has a sufficiently large number of hidden units to approximate any function, and thus, the training attains the Bayes optimal. In other words, converges to the regression function
as the number of hidden units tends to infinity. We regard and treat this limit as a shallow network and call it a denoising autoencoder or DAE.
Let be a DAE trained for . Denote by and the hidden layer and output layer of , respectively; that is, they satisfy . According to custom, we call the encoder, the decoder, and the feature of .
Remark on a potential confusion. Although we trained as a function of in order to enhance robustness, we plug in in place of . Then, no longer behaves as an identity map, which may be expected from traditional autoencoders, but as a denoising map formulated in (3).
2.2 Alain’s Derivation of Denoising Autoencoders
where is the isotropic heat kernel () and is the data distribution. The proof is straightforward:
where the second equation follows by the fact that .
As an infinitesimal limit, (3) is reduced to an asymptotic formula:
We can interpret it as a velocity field over the ground space :
It implies that the initial velocity of the transportation of a mass on is given by the score, which is in the sense of “score matching.”
2.3 Anisotropic Denoising Autoencoder
We introduce the anisotropic DAE as
by replacing the heat kernel in (3) with an anisotropic heat kernel . The original formulation corresponds to the case .
Because of the definition, the initial velocity does not depend on . Hence, (5) still holds for the anisotropic case.
If is clear from the context, we write simply without indicating .
2.4 Stacked Denoising Autoencoder
Let be vector spaces and denote a feature vector that takes a value in . The input space () and an input vector () are rewritten in and , respectively. A stacked DAE is obtained by iteratively alternating (i) training a DAE for the feature and (ii) extracting a new feature with the encoder of .
We call a composition of encoders a stacked DAE, which corresponds to the solid lines in the diagram below.