Despite the rapid development in its application, the deep structure of neural networks remains analytically unexplained because (1) functional composition has poor compatibility with the basics of machine learning: “basis and coefficients,” and (2) the parameterization of neural networks is not faituful and thus parametric arguments are subject to technical difficulties such as local minima and algebraic singularities. In this paper, we introduce the transportation interpretation of deep neural networks; we regard a neural network with-inputs and
-outputs as a vector-valued map, and interpret as a transport map that transforms the input vector to . Because the composition of transport maps is also a transport map, a trajectory is the natural model of the composition structure of deep neural networks. Furthermore, because a trajectory is independent of its parameterization, redundant parameterization of neural networks is avoided. By determining and analyzing the transportation dynamics of a deep neural network, we can understand the behavior of that network. For example, we can expect that in a deep neural network that distinguishes the pictures of dogs and cats, the feature extractor would be a transport map that separates the input vectors of dogs and cats apart, like the physical phenomenon of oil and water being immiscible. It is noteworthy that the input and output dimensions of a feature map in a neural network rarely coincide with each other. Nevertheless, we can regard the feature map in a neural network as a transport map by considering that both the input and output spaces are embedded in a common high-dimensional space. In this manner, we can always assign a trajectory with a deep neural network, and transportation is therefore a universal character of deep neural networks.
The denoising autoencoder (DAE)—used to obtain a good representation of data—is a cornerstone for deep learning, or representation learning. The traditional autoencoder is a neural network that is trained as an identity map . The hidden layer of the network is used as a feature map, which is often called the “code” because, in general, the activation pattern appears random and encoded. Vincent et al. (2008)
introduced DAE as a heuristic modification of traditional autoencoders to increase robustness. In this case, the DAE is trained as a “denoising” map
of deliberately corrupted inputs . Though the corrupt and denoise principle is simple, it is successfully used for deep learning, and has therefore, inspired many representation learning algorithms (Vincent et al., 2010; Vincent, 2011; Rifai et al., 2011; Bengio et al., 2013, 2014; Alain and Bengio, 2014). Though the term “DAE” is the name of a training method, as long as there is no risk of confusion, we abbreviate “a training result of the DAE” as “a DAE ”.
As discussed later, we found that when the corruption process is additive, i.e., with some noise , then the DAE takes the form
denotes noise variance, and the expectation is taken with respect to a posterior distribution of noisegiven . We can observe that the DAE (1) is composed of the traditional autoencoder and the denoising term . From a statistical viewpoint, this form is reasonable because a DAE is an estimator of the mean, or the location parameter. Specifically, given a corrupted input of an unknown truth ,
is an estimator of.
In this study, we interpret (1) as a transport map, by regarding the denoising term as a displacement vector from the origin . In addition, we regard the noise variance as transport time. As time evolves, the data distribution will be deformed to according to the mass transportation given by , i.e., is the pushforward measure of by , and is denoted by . Because defines a time-dependent dynamical system, is difficult to analyze. Instead, we focus on , and show that evolves according to a Wasserstein gradient flow with respect to a certain potential functional , which is independent of time. In general, a DAE is identified by .
In the following sections, we determine and analyze the transport map of DAEs. In Section 2, we show that is given by (1), and that evolves according to the continuity equation as . Then, in Section 3, we consider the composition of DAEs, or a deep DAE, and show that the continuum limit of the compositions satisfies the continuity equation at every time . Finally, in Section 4, we explain the association between the DAE and the Wasserstein gradient flow.
1.1 A minimum introduction to Wasserstein gradient flow
The Wasserstein gradient flow (Villani, 2009, § 23), also known as the Otto calculus and the abstract gradient flow, is an infinite-dimensional gradient flow defined on the -Wasserstein space . Here,
is a family of sufficiently smooth probability density functions on
that have at least second moments, equipped with-Wasserstein metric ; is an infinite-dimensional Riemannian metric that is compatible with -Wasserstein distance ; and is a distance between two probability densities in , which coincides with the infimum of the total Euclidean cost to transport mass that is distributed according to to . In summary, is a functional Riemannian manifold, and the infinite-dimensional gradient operator on is defined via metric .
1.2 Related works
Alain and Bengio (2014) is the first to derive a special case of (1), and their paper has been a motivation for the present study. While we investigated a deterministic formulation of DAEs—the transport map— they developed a probabilistic formulation of DAEs, i.e., generative modeling (Alain et al., 2016). Presently, various formulations based on this generative modeling method are widespread; for example, variational autoencoder (Kingma and Welling, 2014), minimum probability flow (Sohl-Dickstein et al., 2015), and adversarial generative networks (GANs) (Goodfellow et al., 2014). In particular, Wasserstein GAN (Arjovsky et al., 2017) employed Wasserstein geometry to reformulate and improve GANs.
We formulate the DAE as a variational problem, and show that the minimizer , or the training result, is a transport map. Because a single training result of the DAE typically produces a neural network, even though the variational formulation is independent of the choice of approximators, we refer to the minimizer as a DAE. We further investigate the initial velocity vector field for mass transportation, and show that the data distribution evolves according to the continuity equation.
2.1 Training procedure of DAE
Let be an -dimensional random vector that is distributed according to , and be its corruption defined by
where denotes the noise distribution parametrized by variance . A basic example of is the Gaussian noise with mean and variance , i.e. .
The DAE is a function that is trained to remove corruption and restore it to the original ; this is equivalent to training a function for minimizing an objective function, i.e.,
In this study, we assume that is a universal approximator, which need not be a neural network, and thus can attain a minimum. Typical examples of are neural networks with sufficiently large number of hidden units,
-splines, random forests, and kernel machines.
2.2 Transport map of DAE
The global minimizer of (2) is explicitly obtained as follows.
Theorem 2.1 (Generalization of (Alain and Bengio, 2014, Theorem 1)).
For every and , attains the global minimum at
where denotes the convolution operator.
Henceforth, we refer to the minimizer as a DAE, and symbolize (4) by . That is,
As previously stated, the DAE is composed of the identity map and the denoising map . In particular, when , the denoising map vanishes and DAE reduces to a traditional autoencoder. We reinterpret the DAE as a transport map with transport time that transports mass at toward with displacement vector .
Note that the variational calculation first appeared in (Alain and Bengio, 2014, Theorem 1), in which the authors obtained (3). In statistics, (4) is known as Brown’s representation of the posterior mean (George et al., 2006). This is not just a coincidence because, as (3) suggests, the DAE is an estimator of the mean.
2.3 Initial velocity of the transport map
For the sake of simplicity and generality, we consider a generalized form of short time transport maps:
with some potential function , and the potential velocity field, or flux, . For example, as shown in (7), the Gaussian DAE is expressed in this form. Note that the establishment of a reasonable correspondence between and for an arbitrary is an open question.
For the initial moment (), the following lemma holds.
Given a data distribution , the pushforward measure satisfies the continuity equation
where denotes the divergence operator on .
The proof is given in Appendix B. Intuitively, the statement seems natural because (5) is a standard setup for the continuity equation. Note that this relation does not hold in general. Particularly, for
. This is because time-dependent dynamics should be written as an ordinary differential equation such as.
2.4 Example: Gaussian DAE
When , the posterior mean is analytically obtained as follows.
where the first equation follows by Stein’s identity
which is known to hold only for Gaussians.
Gaussian DAE is given by
with Gaussian .
When , the initial velocity vector is given by the score (i.e., score matching)
where denotes the Laplacian on .
The pushforward measure of Gaussian DAE satisfies the backward heat equation:
We shall investigate the backward heat equation in Section 4.
3 Deep Gaussian DAEs
As a concrete example of deep DAEs, we investigate further the Gaussian DAE (). We introduce the composition of DAEs, and the continuous DAE as an infinitesimal limit. We can understand the composition of DAEs as the Eulerian broken line approximation of a continuous DAE.
3.1 Composition of Gaussian DAEs
Let be an -dimensional input vector that is subject to data distribution , and be a DAE that is trained for with noise variance . Write . Then is a random vector in that is subject to the pushforward measure , and thus, we can train another DAE using with noise variance . By repeating the procedure, we can obtain from , and with variance . We write the composition of DAEs by
where denotes “total time”; . By definition, at every , the velocity vector of a composition of DAEs coincides with the score
3.2 Continuous Gaussian DAE
We set total time and take limit of the layer number. Then, we can see that the velocity vector of “infinite composition of DAEs” tends to coincide with the continuity equation at every time. Hence, we introduce an ideal version of DAE as follows.
Set data distribution . We call the solution operator, or flow , of the following dynamics as the continuous DAE.
The limit converges to a continuous DAE when, for example, the score is Lipschitz continuous at every time , because trajectory corresponds to a Eulerian broken line approximation of the integral curve of (10).
The following property is immediate from Theorem 2.4.
Let be a continuous DAE trained for . Then, the pushforward measure is the solution of the initial value problem
which we refer to as the backward heat equation.
3.3 Numerical example of trajectories
Figure 2 compares the trajectories of four DAEs trained for the same data distribution
The trajectories are analytically calculated as
are mean and covariance matrix of the normal distribution, respectively.
The continuous DAE (12) attains the singularity at . On the contrary, the DAE (13) slows down as and never attains the singularity in finite time. As tends to infinity, draws a similar orbit as the continuous DAE ; the curvature of orbits also changes according to .
4 Wasserstein gradient flow
As an analogy of the Gaussian DAE, we can expect that the pushforward measure of a general continuous DAE satisfies the continuity equation:
According to Otto calculus (Villani, 2009, Ex.15.10), the solution coincides with a trajectory of the Wasserstein gradient flow
with respect to a potential functional . Here, denotes the gradient operator on -Wasserstein space , and satisfies the following equation:
) is a partial differential equation on the Euclidean space. Hence, we use different notations for the time derivatives: and .
The Wasserstein gradient flow (15) possesses a distinct advantage that the potential functional does not depend on time . In the following subsections, we will see both the Boltzmann entropy and the Renyi entropy as examples of .
4.1 Example: Gaussian DAE
According to Wasserstein geometry, an ordinary heat equation corresponds to a Wasserstein gradient flow that increases the entropy functional (Villani, 2009, Th. 23.19). Consequently, we can conclude that the feature map of the Gaussian DAE is a transport map that decreases the entropy of the data distribution:
This is immediate, because when , then ; thus,
which means (14) reduces to the backward heat equation.
4.2 Example: Renyi Entropy
4.3 Numerical example of abstract trajectories
Figure 3 compares the abstract trajectories of pushforward measures in the space of bivariate Gaussians
The entropy functional is given by
Note that the parameterization is reasonable, because, in this space, the Wasserstein distance between two points and is given by . The pushforward measures are analytically calculated as
where and are mean and covariance matrix of the normal distribution, respectively.
We investigated deep denoising autoencoders (DAEs) using transportation theory.
The training algorithm of the DAE is equivalent to the minimization of with respect to . We found that the minimizer is given by a transport map (4). The initial velocity vector of the mass transportation is given by the score . Consequently, for Gaussian DAEs, the initial velocity of the pushforward measure coincides with the negative Laplacian . In particular, the DAE transports mass to restore the diffusion. From a statistical viewpoint, it is a natural consequence because the DAE is an estimator of the mean.
These properties are limited to for the DAE. Hence, we introduced the composition of DAEs and its limit, i.e., the continuous DAE. We can understand the composition of DAEs as a Eulerian broken line approximation of a continuous DAE. The pushforward measure of the continuous Gaussian DAE satisfies the backward heat equation (Theorem 3.1). According to Wasserstein geometry, the continuous Gaussian DAE, which is an infinitely deep DAE, transports mass to decrease the entropy of the data distribution.
In general, the estimation of the time reversal of a diffusion process is an inverse problem. In fact, our preliminary experiments indicated that the training result is sensitive to the small perturbation of training data. However, as previously mentioned, from a statistical viewpoint, this was expected, because, by definition, a DAE is an estimator of the mean. Therefore, like a good estimator that reduces uncertainty of a parameter, the DAE will decrease entropy of the parameter.
We expect that not only the DAE, but also a wide range of deep neural networks, including both supervised and unsupervised ones, can be uniformly regarded as transport maps. For example, it is not difficult to imagine that DAEs with non-Gaussian noise correspond to other Lyapunov functionals such as the Renyi entropy and the Bregman divergence. The form of transport maps emerges not only in DAEs, but also, for example, in ResNet (He et al., 2016). Transportation analysis of these deep neural networks will be part of our future works.
Appendix A Proof of Theorem 2.1
This proof follows from a variational calculation. Rewrite
Then, for every function , variation is given by the directional derivative along :
At a critical point of , for every . Hence
and we have
The attains the global minimum, because, for every function ,
Appendix B Proof of Lemma 2.2
To facilitate visualization, we write and instead of , and , respectively. It immediately follows then,
According to the change of variables formula,
where denotes the determinant.
Take logarithm on both sides, and then differentiate with respect to . Then, the RHS vanishes, and the LHS is calculated as follows.
where the second term follows a differentiation formula (Petersen and Pedersen, 2012, (43))
Substitute . Then, we have
which leads to
- Alain and Bengio  Guillaume Alain and Yoshua Bengio. What Regularized Auto-Encoders Learn from the Data Generating Distribution. JMLR, pages 3743–3773, 2014.
- Alain et al.  Guillaume Alain, Yoshua Bengio, Li Yao, Jason Yosinski, Eric Thibodeau-Laufer, Saizheng Zhang, and Pascal Vincent. GSNs : Generative Stochastic Networks. Information and Inference, (2):210–249, 2016.
- Arjovsky et al.  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. Technical report, 2017.
- Bengio et al.  Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as generative models. In NIPS2013, pages 899–907, 2013.
- Bengio et al.  Yoshua Bengio, Éric Thibodeau-Laufer, Guillaume Alain, and Jason Yosinski. Deep Generative Stochastic Networks Trainable by Backprop. In ICML2014, pages 226–234, 2014.
- George et al.  Edward I. George, Feng Liang, and Xinyi Xu. Improved minimax predictive densities under Kullback-Leibler loss. Annals of Statistics, 34(1):78–91, 2006.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In NIPS2014, pages 2672–2680, 2014.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In , pages 770–778, 2016.
- Kingma and Welling  Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In ICLR2014, pages 1–14, 2014.
- Petersen and Pedersen  Kaare Brandt Petersen and Michael Syskind Pedersen. The Matrix Cookbook, Version: November 15, 2012. Technical report, Technical University of Denmark, 2012.
- Rifai et al.  Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: explicit invariance during feature extraction. In ICML2011, pages 833–840, 2011.
- Sohl-Dickstein et al.  Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In ICML2015, pages 2256–2265, 2015.
- Villani  Cédric Villani. Optimal Transport: Old and New. Springer-Verlag Berlin Heidelberg, 2009.
- Vincent  Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
- Vincent et al.  Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders. In ICML2008, pages 1096–1103, 2008.
- Vincent et al.  Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. JMLR, pages 3371–3408, 2010.