# Transportation analysis of denoising autoencoders: a novel method for analyzing deep neural networks

The feature map obtained from the denoising autoencoder (DAE) is investigated by determining transportation dynamics of the DAE, which is a cornerstone for deep learning. Despite the rapid development in its application, deep neural networks remain analytically unexplained, because the feature maps are nested and parameters are not faithful. In this paper, we address the problem of the formulation of nested complex of parameters by regarding the feature map as a transport map. Even when a feature map has different dimensions between input and output, we can regard it as a transportation map by considering that both the input and output spaces are embedded in a common high-dimensional space. In addition, the trajectory is a geometric object and thus, is independent of parameterization. In this manner, transportation can be regarded as a universal character of deep neural networks. By determining and analyzing the transportation dynamics, we can understand the behavior of a deep neural network. In this paper, we investigate a fundamental case of deep neural networks: the DAE. We derive the transport map of the DAE, and reveal that the infinitely deep DAE transports mass to decrease a certain quantity, such as entropy, of the data distribution. These results though analytically simple, shed light on the correspondence between deep neural networks and the Wasserstein gradient flows.

## Authors

• 12 publications
• 8 publications
• ### Learning High Dimensional Wasserstein Geodesics

We propose a new formulation and learning strategy for computing the Was...
02/05/2021 ∙ by Shu Liu, et al. ∙ 19

• ### Generative Modeling with Optimal Transport Maps

With the discovery of Wasserstein GANs, Optimal Transport (OT) has becom...
10/06/2021 ∙ by Litu Rout, et al. ∙ 0

• ### Beyond Dropout: Feature Map Distortion to Regularize Deep Neural Networks

Deep neural networks often consist of a great number of trainable parame...
02/23/2020 ∙ by Yehui Tang, et al. ∙ 72

• ### Survey of Expressivity in Deep Neural Networks

We survey results on neural network expressivity described in "On the Ex...
11/24/2016 ∙ by Maithra Raghu, et al. ∙ 0

• ### Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Despite the widespread practical success of deep learning methods, our t...
12/20/2013 ∙ by Andrew M. Saxe, et al. ∙ 0

• ### On the Learning Dynamics of Deep Neural Networks

While a lot of progress has been made in recent years, the dynamics of l...
09/18/2018 ∙ by Remi Tachet des Combes, et al. ∙ 0

• ### Category coding with neural network application

In many applications of neural network, it is common to introduce huge a...
05/21/2018 ∙ by Qizhi Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Despite the rapid development in its application, the deep structure of neural networks remains analytically unexplained because (1) functional composition has poor compatibility with the basics of machine learning: “basis and coefficients,” and (2) the parameterization of neural networks is not faituful and thus parametric arguments are subject to technical difficulties such as local minima and algebraic singularities. In this paper, we introduce the transportation interpretation of deep neural networks; we regard a neural network with

-inputs and

-outputs as a vector-valued map

, and interpret as a transport map that transforms the input vector to . Because the composition of transport maps is also a transport map, a trajectory is the natural model of the composition structure of deep neural networks. Furthermore, because a trajectory is independent of its parameterization, redundant parameterization of neural networks is avoided. By determining and analyzing the transportation dynamics of a deep neural network, we can understand the behavior of that network. For example, we can expect that in a deep neural network that distinguishes the pictures of dogs and cats, the feature extractor would be a transport map that separates the input vectors of dogs and cats apart, like the physical phenomenon of oil and water being immiscible. It is noteworthy that the input and output dimensions of a feature map in a neural network rarely coincide with each other. Nevertheless, we can regard the feature map in a neural network as a transport map by considering that both the input and output spaces are embedded in a common high-dimensional space. In this manner, we can always assign a trajectory with a deep neural network, and transportation is therefore a universal character of deep neural networks.

The denoising autoencoder (DAE)—used to obtain a good representation of data—is a cornerstone for deep learning, or representation learning. The traditional autoencoder is a neural network that is trained as an identity map . The hidden layer of the network is used as a feature map, which is often called the “code” because, in general, the activation pattern appears random and encoded. Vincent et al. (2008)

introduced DAE as a heuristic modification of traditional autoencoders to increase robustness. In this case, the DAE is trained as a “denoising” map

 g(˜x)≈x,

of deliberately corrupted inputs . Though the corrupt and denoise principle is simple, it is successfully used for deep learning, and has therefore, inspired many representation learning algorithms (Vincent et al., 2010; Vincent, 2011; Rifai et al., 2011; Bengio et al., 2013, 2014; Alain and Bengio, 2014). Though the term “DAE” is the name of a training method, as long as there is no risk of confusion, we abbreviate “a training result of the DAE” as “a DAE ”.

As discussed later, we found that when the corruption process is additive, i.e., with some noise , then the DAE takes the form

 gt(˜x)=˜x−Et[ε|˜x], (1)

where

denotes noise variance, and the expectation is taken with respect to a posterior distribution of noise

given . We can observe that the DAE (1) is composed of the traditional autoencoder and the denoising term . From a statistical viewpoint, this form is reasonable because a DAE is an estimator of the mean, or the location parameter. Specifically, given a corrupted input of an unknown truth ,

is an estimator of

.

In this study, we interpret (1) as a transport map, by regarding the denoising term as a displacement vector from the origin . In addition, we regard the noise variance as transport time. As time evolves, the data distribution will be deformed to according to the mass transportation given by , i.e., is the pushforward measure of by , and is denoted by . Because defines a time-dependent dynamical system, is difficult to analyze. Instead, we focus on , and show that evolves according to a Wasserstein gradient flow with respect to a certain potential functional , which is independent of time. In general, a DAE is identified by .

In the following sections, we determine and analyze the transport map of DAEs. In Section 2, we show that is given by (1), and that evolves according to the continuity equation as . Then, in Section 3, we consider the composition of DAEs, or a deep DAE, and show that the continuum limit of the compositions satisfies the continuity equation at every time . Finally, in Section 4, we explain the association between the DAE and the Wasserstein gradient flow.

### 1.1 A minimum introduction to Wasserstein gradient flow

The Wasserstein gradient flow (Villani, 2009, § 23), also known as the Otto calculus and the abstract gradient flow, is an infinite-dimensional gradient flow defined on the -Wasserstein space . Here,

is a family of sufficiently smooth probability density functions on

that have at least second moments, equipped with

-Wasserstein metric ; is an infinite-dimensional Riemannian metric that is compatible with -Wasserstein distance ; and is a distance between two probability densities in , which coincides with the infimum of the total Euclidean cost to transport mass that is distributed according to to . In summary, is a functional Riemannian manifold, and the infinite-dimensional gradient operator on is defined via metric .

### 1.2 Related works

Alain and Bengio (2014) is the first to derive a special case of (1), and their paper has been a motivation for the present study. While we investigated a deterministic formulation of DAEs—the transport map— they developed a probabilistic formulation of DAEs, i.e., generative modeling (Alain et al., 2016). Presently, various formulations based on this generative modeling method are widespread; for example, variational autoencoder (Kingma and Welling, 2014), minimum probability flow (Sohl-Dickstein et al., 2015), and adversarial generative networks (GANs) (Goodfellow et al., 2014). In particular, Wasserstein GAN (Arjovsky et al., 2017) employed Wasserstein geometry to reformulate and improve GANs.

## 2 Dae

We formulate the DAE as a variational problem, and show that the minimizer , or the training result, is a transport map. Because a single training result of the DAE typically produces a neural network, even though the variational formulation is independent of the choice of approximators, we refer to the minimizer as a DAE. We further investigate the initial velocity vector field for mass transportation, and show that the data distribution evolves according to the continuity equation.

### 2.1 Training procedure of DAE

Let be an -dimensional random vector that is distributed according to , and be its corruption defined by

 ˜x=x+ε,ε∼νt

where denotes the noise distribution parametrized by variance . A basic example of is the Gaussian noise with mean and variance , i.e. .

The DAE is a function that is trained to remove corruption and restore it to the original ; this is equivalent to training a function for minimizing an objective function, i.e.,

 L[g]:=Ex,˜x|g(˜x)−x|2. (2)

In this study, we assume that is a universal approximator, which need not be a neural network, and thus can attain a minimum. Typical examples of are neural networks with sufficiently large number of hidden units,

-splines, random forests, and kernel machines.

### 2.2 Transport map of DAE

The global minimizer of (2) is explicitly obtained as follows.

###### Theorem 2.1 (Generalization of (Alain and Bengio, 2014, Theorem 1)).

For every and , attains the global minimum at

 g∗(˜x) =Et[x|˜x]=1μ0∗νt(˜x)∫Rmxνt(x−˜x)μ0(x)dx, (3) =˜x−Et[ε|˜x]=˜x−1μ0∗νt(˜x)∫Rmενt(ε)μ0(˜x−ε)dε, (4)

where denotes the convolution operator.

Henceforth, we refer to the minimizer as a DAE, and symbolize (4) by . That is,

 Φt(x):=x−1μ0∗νt(x)∫Rmενt(ε)μ0(x−ε)dε.

As previously stated, the DAE is composed of the identity map and the denoising map . In particular, when , the denoising map vanishes and DAE reduces to a traditional autoencoder. We reinterpret the DAE as a transport map with transport time that transports mass at toward with displacement vector .

Note that the variational calculation first appeared in (Alain and Bengio, 2014, Theorem 1), in which the authors obtained (3). In statistics, (4) is known as Brown’s representation of the posterior mean (George et al., 2006). This is not just a coincidence because, as (3) suggests, the DAE is an estimator of the mean.

### 2.3 Initial velocity of the transport map

For the sake of simplicity and generality, we consider a generalized form of short time transport maps:

 gt(x):=x+t∇Vt(x), (5)

with some potential function , and the potential velocity field, or flux, . For example, as shown in (7), the Gaussian DAE is expressed in this form. Note that the establishment of a reasonable correspondence between and for an arbitrary is an open question.

For the initial moment (), the following lemma holds.

###### Lemma 2.2.

Given a data distribution , the pushforward measure satisfies the continuity equation

 ∂tμt(x)=−∇⋅[μt(x)∇Vt(x)],t=0 (6)

where denotes the divergence operator on .

The proof is given in Appendix B. Intuitively, the statement seems natural because (5) is a standard setup for the continuity equation. Note that this relation does not hold in general. Particularly, for

. This is because time-dependent dynamics should be written as an ordinary differential equation such as

.

### 2.4 Example: Gaussian DAE

When , the posterior mean is analytically obtained as follows.

 Et[ε|x]=−t∇νt∗μ0(x)νt∗μ0(x)=−t∇log[νt∗μ0(x)],

where the first equation follows by Stein’s identity

 −t∇νt(ε)=ενt(ε),

which is known to hold only for Gaussians.

###### Theorem 2.3.

Gaussian DAE is given by

 Φt(x)=x+t∇log[Wt∗μ0](x), (7)

with Gaussian .

When , the initial velocity vector is given by the score (i.e., score matching)

 ∂tΦt=0(x)=limt→0Φt(x)−xt=∇logμ0(x). (8)

Hence, by substituting the score (8) in the continuity equation (6), we have

 ∂tμt=0(x) =−∇⋅[μ0(x)∇logμ0(x)] =−∇⋅[∇μ0(x)] =−△μ0(x),

where denotes the Laplacian on .

###### Theorem 2.4.

The pushforward measure of Gaussian DAE satisfies the backward heat equation:

 ∂tμt=0(x)=−△μ0(x). (9)

We shall investigate the backward heat equation in Section 4.

## 3 Deep Gaussian DAEs

As a concrete example of deep DAEs, we investigate further the Gaussian DAE (). We introduce the composition of DAEs, and the continuous DAE as an infinitesimal limit. We can understand the composition of DAEs as the Eulerian broken line approximation of a continuous DAE.

### 3.1 Composition of Gaussian DAEs

Let be an -dimensional input vector that is subject to data distribution , and be a DAE that is trained for with noise variance . Write . Then is a random vector in that is subject to the pushforward measure , and thus, we can train another DAE using with noise variance . By repeating the procedure, we can obtain from , and with variance . We write the composition of DAEs by

 Φt0:L(x):=ΦL∘⋯∘Φ0(x),

where denotes “total time”; . By definition, at every , the velocity vector of a composition of DAEs coincides with the score

 ∂tΦt=tℓ0:ℓ(x)=∇logμtℓ(x).

### 3.2 Continuous Gaussian DAE

We set total time and take limit of the layer number. Then, we can see that the velocity vector of “infinite composition of DAEs” tends to coincide with the continuity equation at every time. Hence, we introduce an ideal version of DAE as follows.

###### Definition 3.1.

Set data distribution . We call the solution operator, or flow , of the following dynamics as the continuous DAE.

 ddtx(t)=∇logμt(x(t)),t≥0 (10)

where .

The limit converges to a continuous DAE when, for example, the score is Lipschitz continuous at every time , because trajectory corresponds to a Eulerian broken line approximation of the integral curve of (10).

The following property is immediate from Theorem 2.4.

###### Theorem 3.1.

Let be a continuous DAE trained for . Then, the pushforward measure is the solution of the initial value problem

 ∂tμt(x)=−△μt(x),μt=0(x)=μ0(x) (11)

which we refer to as the backward heat equation.

The backward heat equation (11) is equivalent to the following final value problem for the ordinary heat equation:

 ∂tut(x)=△ut(x),ut=T(x)=μ0(x) for some T

where denotes a probability measure on . Indeed,

 μt(x)=uT−t(x),

is the solution of (11). In other words, backward heat equation describes the time reversal of an ordinary diffusion process.

### 3.3 Numerical example of trajectories

Figure 2 compares the trajectories of four DAEs trained for the same data distribution

 μ0=N([0,0],diag[2,1]).

The trajectories are analytically calculated as

 φt(x) =√I−2tΣ−10(x−μ0)+μ0, (12)

and

 Φt(x) =(I+tΣ−10)−1x+(I+t−1Σ0)−1μ0, (13)

where and

are mean and covariance matrix of the normal distribution, respectively.

The continuous DAE (12) attains the singularity at . On the contrary, the DAE (13) slows down as and never attains the singularity in finite time. As tends to infinity, draws a similar orbit as the continuous DAE ; the curvature of orbits also changes according to .

As an analogy of the Gaussian DAE, we can expect that the pushforward measure of a general continuous DAE satisfies the continuity equation:

 ∂tμt(x)=−∇⋅[μt(x)∇Vt(x)],t>0. (14)

According to Otto calculus (Villani, 2009, Ex.15.10), the solution coincides with a trajectory of the Wasserstein gradient flow

with respect to a potential functional . Here, denotes the gradient operator on -Wasserstein space , and satisfies the following equation:

 ddtF[μt]=∫Rm∇Vt(x)[∂tμt](x)dx.

Recall that the -Wasserstein space is a functional manifold. While (15) is an ordinary differential equation on the space of probability density functions, (14

) is a partial differential equation on the Euclidean space

. Hence, we use different notations for the time derivatives: and .

The Wasserstein gradient flow (15) possesses a distinct advantage that the potential functional does not depend on time . In the following subsections, we will see both the Boltzmann entropy and the Renyi entropy as examples of .

### 4.1 Example: Gaussian DAE

According to Wasserstein geometry, an ordinary heat equation corresponds to a Wasserstein gradient flow that increases the entropy functional (Villani, 2009, Th. 23.19). Consequently, we can conclude that the feature map of the Gaussian DAE is a transport map that decreases the entropy of the data distribution:

This is immediate, because when , then ; thus,

which means (14) reduces to the backward heat equation.

### 4.2 Example: Renyi Entropy

Similarly, when is the Renyi entropy

 Hα[μ]:=∫Rmμα(x)−μ(x)α−1dx,

then (see (Villani, 2009, Ex.15.6) for the proof) and thus (14) reduces to the backward porous medium equation

 ∂tμt(x)=−△μαt(x). (17)

### 4.3 Numerical example of abstract trajectories

Figure 3 compares the abstract trajectories of pushforward measures in the space of bivariate Gaussians

 μ0=N([0,0],diag[σ21,σ22]).

The entropy functional is given by

 H(σ1,σ2) =(1/2)log|diag[σ21,σ22]|+const. =logσ1+logσ2+const.

Note that the parameterization is reasonable, because, in this space, the Wasserstein distance between two points and is given by . The pushforward measures are analytically calculated as

 φt♯N(μ0,Σ0) =N(μ0,Σ0−2tI),

and

 Φt♯N(μ0,Σ0) =N(μ0,Σ0(I+tΣ−10)−2),

where and are mean and covariance matrix of the normal distribution, respectively.

## 5 Discussion

We investigated deep denoising autoencoders (DAEs) using transportation theory.

The training algorithm of the DAE is equivalent to the minimization of with respect to . We found that the minimizer is given by a transport map (4). The initial velocity vector of the mass transportation is given by the score . Consequently, for Gaussian DAEs, the initial velocity of the pushforward measure coincides with the negative Laplacian . In particular, the DAE transports mass to restore the diffusion. From a statistical viewpoint, it is a natural consequence because the DAE is an estimator of the mean.

These properties are limited to for the DAE. Hence, we introduced the composition of DAEs and its limit, i.e., the continuous DAE. We can understand the composition of DAEs as a Eulerian broken line approximation of a continuous DAE. The pushforward measure of the continuous Gaussian DAE satisfies the backward heat equation (Theorem 3.1). According to Wasserstein geometry, the continuous Gaussian DAE, which is an infinitely deep DAE, transports mass to decrease the entropy of the data distribution.

In general, the estimation of the time reversal of a diffusion process is an inverse problem. In fact, our preliminary experiments indicated that the training result is sensitive to the small perturbation of training data. However, as previously mentioned, from a statistical viewpoint, this was expected, because, by definition, a DAE is an estimator of the mean. Therefore, like a good estimator that reduces uncertainty of a parameter, the DAE will decrease entropy of the parameter.

We expect that not only the DAE, but also a wide range of deep neural networks, including both supervised and unsupervised ones, can be uniformly regarded as transport maps. For example, it is not difficult to imagine that DAEs with non-Gaussian noise correspond to other Lyapunov functionals such as the Renyi entropy and the Bregman divergence. The form of transport maps emerges not only in DAEs, but also, for example, in ResNet (He et al., 2016). Transportation analysis of these deep neural networks will be part of our future works.

## Appendix A Proof of Theorem 2.1

This proof follows from a variational calculation. Rewrite

 L[g] =∫RmEε|g(x+ε)−x|2μ0(x)dx =∫RmEε[|g(x′)−x′+ε|2μ0(x′−ε)]dx′,x′←x+ε.

Then, for every function , variation is given by the directional derivative along :

 δL[h] =ddsL[g+sh]∣∣s=0 =∫Rm∂∂sEε[|g(x)+sh(x)−x+ε|2μ0(x−ε)]dx∣∣s=0 =2∫RmEε[(g(x)−x+ε)μ0(x−ε)]h(x)dx.

At a critical point of , for every . Hence

 Eε[(g∗(x)−x+ε)μ0(x−ε)]=0,a.e.x,

and we have

 g∗(x) =Eε[(x−ε)μ0(x−ε)]Eε[μ0(x−ε)]=(???) =x−Eε[εμ0(x−ε)]Eε[μ0(x−ε)]=(???).

The attains the global minimum, because, for every function ,

 L[g∗+h] =∫RmEε[|ε−Et[ε|x]+h(x)|2μ0(x−ε)]dx =∫RmEε[|ε−Et[ε|x]|2μ0(x−ε)]dx+∫RmEε[|h(x)|2μ0(x−ε)]dx +2∫RmEε[(ε−Et[ε|x])μ0(x−ε)]h(x)dx =L[g∗]+L[h]+2⋅0≥L[g∗].

## Appendix B Proof of Lemma 2.2

To facilitate visualization, we write and instead of , and , respectively. It immediately follows then,

 g(x,0)=x,∂tg(x,0)=∇V(x,0),∇g(x,0)=I.

According to the change of variables formula,

 μ(g(x,t),t)⋅|∇g(x,t)|=μ(x,0).

where denotes the determinant.

Take logarithm on both sides, and then differentiate with respect to . Then, the RHS vanishes, and the LHS is calculated as follows.

 ∂tlog[μ(g(x,t),t)⋅|∇g(x,t)|] =∂t[μ(g(x,t),t)]μ(g(x,t),t)+∂tlog|∇g(x,t)| =(∇μ)(g(x,t),t)⋅∂tg(x,t)+(∂tμ)(g(x,t),t)μ(g(x,t),t) +tr[(∇g(x,t))−1∇∂tg(x,t)]

where the second term follows a differentiation formula (Petersen and Pedersen, 2012, (43))

 ∂log|J|=tr[J−1∂J].

Substitute . Then, we have

 ∇μ(x,0)⋅∇V(x,0)+(∂tμ)(x,0)μ(x,0)+tr[∇∇V(x,0)]=0,

 ∂tμ(x,0)=−∇⋅[μ(x,0)∇V(x)].

## References

• Alain and Bengio [2014] Guillaume Alain and Yoshua Bengio. JMLR, pages 3743–3773, 2014.
• Alain et al. [2016] Guillaume Alain, Yoshua Bengio, Li Yao, Jason Yosinski, Eric Thibodeau-Laufer, Saizheng Zhang, and Pascal Vincent. Information and Inference, (2):210–249, 2016.
• Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. Technical report, 2017.
• Bengio et al. [2013] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. In NIPS2013, pages 899–907, 2013.
• Bengio et al. [2014] Yoshua Bengio, Éric Thibodeau-Laufer, Guillaume Alain, and Jason Yosinski. In ICML2014, pages 226–234, 2014.
• George et al. [2006] Edward I. George, Feng Liang, and Xinyi Xu. Annals of Statistics, 34(1):78–91, 2006.
• Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. In NIPS2014, pages 2672–2680, 2014.
• He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

, pages 770–778, 2016.
• Kingma and Welling [2014] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In ICLR2014, pages 1–14, 2014.
• Petersen and Pedersen [2012] Kaare Brandt Petersen and Michael Syskind Pedersen. Technical report, Technical University of Denmark, 2012.
• Rifai et al. [2011] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. In ICML2011, pages 833–840, 2011.
• Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. In ICML2015, pages 2256–2265, 2015.
• Villani [2009] Cédric Villani. Springer-Verlag Berlin Heidelberg, 2009.
• Vincent [2011] Pascal Vincent. Neural Computation, 23(7):1661–1674, 2011.
• Vincent et al. [2008] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. In ICML2008, pages 1096–1103, 2008.
• Vincent et al. [2010] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. JMLR, pages 3371–3408, 2010.