Decoding Stacked Denoising Autoencoders

05/10/2016 ∙ by Sho Sonoda, et al. ∙ 0

Data representation in a stacked denoising autoencoder is investigated. Decoding is a simple technique for translating a stacked denoising autoencoder into a composition of denoising autoencoders in the ground space. In the infinitesimal limit, a composition of denoising autoencoders is reduced to a continuous denoising autoencoder, which is rich in analytic properties and geometric interpretation. For example, the continuous denoising autoencoder solves the backward heat equation and transports each data point so as to decrease entropy of the data distribution. Together with ridgelet analysis, an integral representation of a stacked denoising autoencoder is derived.



There are no comments yet.


page 13

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The denoising autoencoder (DAE) is a role model for representation learning, the objective of which is to capture a good representation of the data. Vincent2008

introduced it as a heuristic modification of traditional autoencoders for enhancing robustness. In the setting of traditional autoencoders, we train a neural network as an identity map

and extract the hidden layer to obtain the so-called “code.” On the other hand, the DAE is trained as a denoising map of deliberately corrupted inputs . The corrupt and denoise principle is simple, but truly is compatible with stacking, and thus, inspired many new autoencoders. See Section 1.1 for details.

We are interested in what deeper layers represent and why we should deepen layers. In contrast to the rapid development in its application, the stacked autoencoder remains unexplained analytically, because generative models, or probabilistic alternatives, are currently attracting more attention. In addition, deterministic approaches, such as kernel analysis and signal processing, tend to focus on convolution networks from a group invariance aspect. We address these questions from deterministic viewpoints: transportation theory and ridgelet analysis.

Alain2014 derived an explicit map that a shallow DAE learns as


and showed that it converges to the score of the data distribution

as the variance of

tends to zero. Then, they recast it as manifold learning and score matching. We reinterpret (1) as a transportation map of , the variance as time, and the infinitesimal limit as the initial velocity field.

Ridgelet analysis is an integral representation theory of neural networks (Sonoda2015; Sonoda2014; Candes1998; Murata1996). It has a concrete geometric interpretation as wavelet analysis in the Radon domain. We can clearly state that the first hidden layer of a stacked DAE is simply a discretization of the ridgelet transform of (1). On the other hand, the character of deeper layers is still unclear, because the ridgelet transform on stacked layers means the composition of ridgelet transforms , which lacks geometric interpretation. One of the challenges here is to develop the integral representation of deep neural networks.

We make two important observations. First, through decoding, a stacked DAE is equivalent to a composition of DAEs. By definition, they differ from each other, because “stacked” means a concatenation of autoencoders with each output layer removed, while “composition” means a concatenation of autoencoders with each output layer remaining. Nevertheless, decoding relates the stacked DAE and the composition of DAEs. Then, ridgelet transform is reasonable, because it can be performed layer-wise, which leads to the integral representation of a deep neural network.

Second, an infinite composition results in a continuous DAE, which is rich in analytic properties and geometric interpretation, because it solves the backward heat equation. This means that what deep layers do is to transport mass so as to decrease entropy. Together with ridgelet analysis, we can conclude that what a deep layer represents is a discretization of the ridgelet transform of the transportation map.

Figure 1: Denoising autoencoder (left), stacked denoising autoencoder with linear output (center), and a composition of two denoising autoencoders (right). Decoding translates a stacked denoising autoencoder into a composition of denoising autoencoders.

1.1 Related Work

Vincent2008 introduced the DAE as a modification of traditional autoencoders. While the traditional autoencoder is trained as an identity map , the DAE is trained as a denoising map for artificially corrupted inputs , in order to enhance robustness.

Theoretical justifications and extensions follow from at least five aspects: manifold learning (Rifai2011; Alain2014), generative modeling (Vincent2010; Bengio2013; Bengio2014), infomax principle (Vincent2010), learning dynamics (Erhan2010), and score matching (Vincent2011). The first three aspects were already mentioned in the original paper (Vincent2008). According to these aspects, a DAE learns one of the following: a manifold on which the data are arranged (manifold learning); the latent variables, which often behave as nonlinear coordinates in the feature space, that generate the data (generative modeling); a transformation of the data distribution that maximizes the mutual information (infomax); good initial parameters that allow the training to avoid local minima (learning dynamics); or the data distribution (score matching).

A turning point appears to be the finding of the score matching aspect (Vincent2011)

, which reveals that score matching with a special form of energy function coincides with a DAE. This means that a DAE is a density estimator of the data distribution

. In other words, it extracts and stores information as a function of . Since then many researchers omitted stacking deterministic autoencoders, and have developed generative density estimators (Bengio2013; Bengio2014) instead.

The generative modeling is more compatible not only with the restricted Boltzmann machine and deep belief nets

(Hinton2006a) and the deep Boltzmann machine (Salakhutdinov2009), but also with many sophisticated algorithms, such as variational autoencoder (Kingma2014a)

, minimum probability flow

(Sohl-Dickstein2009; Sohl-Dickstein2015), adversarial generative networks (Goodfellow2014)

, semi-supervised learning

(Kingma2014; Rasmus2015), and image generation (Radford2015)

. In generative models, what a hidden layer represents basically corresponds to either the “hidden state” itself that generates the data or the parameters (such as means and covariance matrices) of the probability distribution of the hidden states. See

Bengio2014, for example.

“What do deep layers represent?” and “why deep?” are difficult questions for concrete mathematical analysis because a deep layer is a composition of nonlinear maps. In fact, even a shallow network is a universal approximator; that is, it can approximate any function, and thus, deep structure is simply redundant in theory. It has even been reported that a shallow network could outperform a deep network (Ba2014). Hence, no studies on subjects such as “integral representations of deep neural networks” or “deep ridgelet transform” exist.

Thus far, few studies have characterized the deep layer of stacked autoencoders. The only conclusion that has been drawn is the traditional belief that a combination of the “codes” exponentially enhances the expressive power of the network by constructing a hierarchy of knowledge and it is efficient to capture a complex feature of the data. Bouvrie2009, Bruna2013, Patel2015 and Anselmi2015a developed sophisticated formulations for convolution networks from a group invariance viewpoint. However, their analyses are inherently restricted to the convolution structure, which is compatible with linear operators.

In this paper, we consider an autoencoder to be a transportation map and focus on its dynamics, which is a deterministic standpoint. We address the questions stated above while seeking an integral representation of a deep neural network.

2 Preliminaries

In this paper, we treat five versions of DAEs: the ordinary DAE , anisotropic DAE , stacked DAE , a composition of DAEs , and the continuous DAE .

By using the single symbols and , we emphasize that they are realized as a shallow network or a network with a single hidden layer. Provided that there is no risk of confusion, the term “DAE ” without any modifiers means a shallow DAE, without distinguishing “ordinary,” “anisotropic,” or “continuous,” because they are all derived from (3).

By , and , we denote time derivative, gradient, and Laplacian, by the Euclidean norm, by the identity map, and by the uni/multivariate Gaussian with mean and covariance matrix .

An (anisotropic) heat kernel is the fundamental solution of an anisotropic diffusion equation on

with respect to the diffusion coefficient tensor


When , the diffusion equation and the heat kernel are reduced to a heat equation and a Gaussian . If is clear from the context, we write simply without indicating .

For a map with , the Jacobian is calculated by , regarding as an matrix. By , we denote the pushforward measure of a probability measure with respect to a map , which satisfies . See (Evans2015) for details.

2.1 Denoising Autoencoder


be a random vector in

and be its corruption:

We train a shallow neural network for minimizing an objective function

In this study, we assumed that has a sufficiently large number of hidden units to approximate any function, and thus, the training attains the Bayes optimal. In other words, converges to the regression function


as the number of hidden units tends to infinity. We regard and treat this limit as a shallow network and call it a denoising autoencoder or DAE.

Let be a DAE trained for . Denote by and the hidden layer and output layer of , respectively; that is, they satisfy . According to custom, we call the encoder, the decoder, and the feature of .

Remark on a potential confusion. Although we trained as a function of in order to enhance robustness, we plug in in place of . Then, no longer behaves as an identity map, which may be expected from traditional autoencoders, but as a denoising map formulated in (3).

2.2 Alain’s Derivation of Denoising Autoencoders

Alain2014 showed that the regression function (2) for a DAE is reduced to (1). We can rewrite it as


where is the isotropic heat kernel () and is the data distribution. The proof is straightforward:

where the second equation follows by the fact that .

As an infinitesimal limit, (3) is reduced to an asymptotic formula:


We can interpret it as a velocity field over the ground space :


It implies that the initial velocity of the transportation of a mass on is given by the score, which is in the sense of “score matching.”

2.3 Anisotropic Denoising Autoencoder

We introduce the anisotropic DAE as

by replacing the heat kernel in (3) with an anisotropic heat kernel . The original formulation corresponds to the case .

Because of the definition, the initial velocity does not depend on . Hence, (5) still holds for the anisotropic case.

If is clear from the context, we write simply without indicating .

2.4 Stacked Denoising Autoencoder

Let be vector spaces and denote a feature vector that takes a value in . The input space () and an input vector () are rewritten in and , respectively. A stacked DAE is obtained by iteratively alternating (i) training a DAE for the feature and (ii) extracting a new feature with the encoder of .

We call a composition of encoders a stacked DAE, which corresponds to the solid lines in the diagram below.