Optimization on the space of probability distributions is important to a number of machine learning models including variational inference, generative models [13, 2]
, and policy optimization in reinforcement learning. A number of recent studies have considered solution approaches to these problems based upon a construction of gradient flow on the space of probability distributions [24, 17, 12, 8, 20, 6]. Such constructions are useful for convergence analysis as well as development of numerical algorithms.
In this paper, we propose a methodology and numerical algorithms that achieve accelerated
gradient flows on the space of probability distributions. The proposed numerical algorithms are related to yet distinct from the accelerated stochastic gradient descent
and Hamiltonian Markov chain Monte-Carlo (MCMC) algorithms[19, 7]. The proposed methodology extends the variational formulation of  from vector valued variables to probability distributions. The original formulation of  was used to derive and analyze the convergence properties of a large class of accelerated optimization algorithms, most significant of which is the continuous-time limit of the Nesterov’s algorithm 
. In this paper, the limit is referred to as the Nesterov’s ordinary differential equation (ODE).
The extension proposed in our work is based upon a generalization of the formula for the Lagrangian in : (i) the kinetic energy term is replaced with the expected value of kinetic energy; and (ii) the potential energy term is replaced with a suitably defined functional on the space of probability distributions. The variational problem is to obtain a trajectory in the space of probability distributions that minimizes the action integral of the Lagrangian.
The variational problem is modeled as a mean-field optimal problem. The maximum principle of the optimal control theory is used to derive the Hamilton’s equations which represent the first order optimality conditions. The Hamilton’s equations provide a generalization of the Nesterov’s ODE to the space of probability distributions. A candidate Lyapunov function is proposed for the convergence analysis of the solution of the Hamilton’s equations. In this way, quantitative estimates on convergence rate are obtained for the case when the objective functional is displacement convex . Table 1 provides a summary of the relationship between the original variational formulation in  and the extension proposed in this paper.
We also consider the important special case when the objective functional is the relative entropy functional defined with respect to a target probability distribution . In this case, the accelerated gradient flow is shown to be related to the continuous limit of the Hamiltonian Monte-Carlo algorithm  (Remark 1). The Hamilton’s equations are finite-dimensional for the special case when the initial and the target probability distributions are both Gaussian. In this case, the mean evolves according to the Nesterov’s ODE. For the general case, the Lyapunov function-based convergence analysis applies when the target distribution is log-concave.
As a final contribution, the proposed methodology is used to obtain a numerical algorithm. The algorithm is an interacting particle system that empirically approximates the distribution with a finite but large number of particles. The difficult part of this construction is the approximation of the interaction term between particles. For this purpose, two types of approximations are described: (i) Gaussian approximation which is asymptotically (as ) exact in Gaussian settings; and (ii) Diffusion map approximation which is computationally more demanding but asymptotically exact for a general class of distributions.
The outline of the remainder of this paper is as follows: Sec. 2 provides a brief review of the variational formulation in . The proposed extension to the space of probability distribution appears in Sec. 3 where the main result is also described. The numerical algorithm along with the results of numerical experiments appear in Sec. 4. Comparisons with MCMC and Hamiltonian MCMC are also described. The conclusions appear in Sec. 5.
Notation: The gradient and divergence operators are denoted as and respectively. With multiple variables, denotes the gradient with respect to the variable . Therefore, the divergence of the vector field is . The space of absolutely continuous probability measures on
with finite second moments is denoted by. The Wasserstein gradient and the Gâteaux derivative of a functional is denoted as and respectively (see Appendix C
for definition). The probability distribution of a random variableis denoted as .
2 Review of the variational formulation of 
The basic problem is to minimize a smooth convex function on . The standard form of the gradient descent algorithm for this problem is an ODE:
Accelerated forms of this algorithm are obtained based on a variational formulation due to . The formulation is briefly reviewed here using an optimal control formalism. The Lagrangian is defined as
where is the time, is the state, is the velocity or control input, and the time-varying parameters satisfy the following scaling conditions: , , and where and are constants.
The variational problem is
The Hamiltonian function is
where is dual variable and denotes the dot product between vectors and .
According to the Pontryagin’s Maximum Principle, the optimal control . The resulting Hamilton’s equations are
The system (5) is an example of accelerated gradient descent algorithm. Specifically, if the parameters are defined using , one obtains the continuous-time limit of the Nesterov’s accelerated algorithm. It is referred to as the Nesterov’s ODE in this paper.
For this system, a Lyapunov function is as follows:
where . It is shown in  that upon differentiating along the solution trajectory, . This yields the following convergence rate:
3 Variational formulation for probability distributions
3.1 Motivation and background
Let be a functional on the space of probability distributions. Consider the problem of minimizing . The (Wasserstein) gradient flow with respect to is
where is the Wasserstein gradient of .
An important example is the relative entropy functional where where is referred to as the target distribution. The gradient of relative entropy is given by . The gradient flow
is the Fokker-Planck equation . The gradient flow achieves the density transport from an initial probability distribution to the target (here, also equilibrium) probability distribution ; and underlies the construction and the analysis of Markov chain Monte-Carlo (MCMC) algorithms. The simplest MCMC algorithm is the Langevin stochastic differential equation (SDE):
where is the standard Brownian motion in .
The main problem of this paper is to construct an accelerated form of the gradient flow (8). The proposed solution is based upon a variational formulation. As tabulated in Table 1, the solution represents a generalization of  from its original deterministic finite-dimensional to now probabilistic infinite-dimensional settings.
The variational problem can be expressed in two equivalent forms: (i) The probabilistic form is described next in the main body of the paper; and (ii) The partial differential equation (PDE) form appears in the Appendix. The probabilistic form is stressed here because it represents a direct generalization of the Nesterov’s ODE and because it is closer to the numerical algorithm.
3.2 Probabilistic form of the variational problem
Consider the stochastic process that takes values in and evolves according to:
where the control input also takes values in , and is the probability distribution of the initial condition . It is noted that the randomness here comes only from the random initial condition.
Suppose the objective functional is of the form . The Lagrangian is defined as
This formula is a natural generalization of the Lagrangian (2) and the parameters are defined exactly the same as in the finite-dimensional case. The stochastic optimal control problem is:
is the probability density function of the random variable.
The Hamiltonian function for this problem is given by [5, Sec. 6.2.3]:
where is the dual variable.
3.3 Main result
Consider the variational problem (11).
The optimal control where the optimal trajectory evolves according to the Hamilton’s equations:
where is any convex function and .
Suppose also that the functional is displacement convex and is its minimizer. Define the energy along the optimal trajectory
where the map is the optimal transport map from to . Suppose also that the following technical assumption holds: . Then . Consequently, the following rate of convergence is obtained along the optimal trajectory:
The Hamilton’s equations are derived using the standard mean-field optimal control theory . The Lyapunov function argument is based upon the variational inequality characterization of a displacement convex function [1, Eq. 10.1.7]. The detailed proof appears in the Appendix. We expect that the technical assumption is not necessary. This is the subject of the continuing work. ∎
3.4 Relative entropy as the functional
In the remainder of this paper, we assume that the functional is the relative entropy where is a given target probability distribution. In this case the Hamilton’s equations are given by
where and . Moreover, if is convex (or equivalently is log-concave), then is displacement convex with the unique minimizer at and the convergence estimate is given by .
The Hamilton’s equations (16) with the relative entropy functional is related to the under-damped Langevin equation . The difference is that the deterministic term in (16) is replaced with a random Brownian motion term in the under-damped Langevin equation. More detailed comparison appears in the Appendix D.
3.5 Quadratic Gaussian case
Suppose the initial distribution and the target distribution are both Gaussian, denoted as and , respectively. This is equivalent to the objective function being quadratic of the form . Therefore, this problem is referred to as the quadratic Gaussian case. The following Proposition shows that the mean of the stochastic process evolves according to the Nesterov ODE (5):
(Quadratic Gaussian case) Consider the variational problem (11) for the quadratic Gaussian case. Then
The stochastic process is a Gaussian process. The Hamilton’s equations are given by:
where and are the mean and the covariance of .
Upon taking the expectation of both sides, and denoting
which is identical to Nesterov ODE (5).
4 Numerical algorithm
The proposed numerical algorithm is based upon an interacting particle implementation of the Hamilton’s equation (16). Consider a system of particles that evolve according to:
The interaction term is an empirical approximation of the term in (16). We propose two types of empirical approximations as follows:
1. Gaussian approximation: Suppose the density is approximated as a Gaussian . In this case, . This motivates the following empirical approximation of the interaction term:
where is the empirical mean and is the empirical covariance.
Even though the approximation is asymptotically (as ) exact only under the Gaussian assumption, it may be used in a more general settings, particularly when the density11].
2. Diffusion map approximation: This is based upon the diffusion map approximation of the weighted Laplacian operator [9, 14]. For a function , the weighted Laplacian is defined as . Denote as the coordinate function on . It is a straightforward calculation to show that . This allows one to use the diffusion map approximation of the weighted Laplacian to approximate the interaction term as follows:
where the kernel is constructed empirically in terms of the Gaussian kernel . The parameter is referred to as the kernel bandwidth. The approximation is asymptotically exact as and . The approximation error is of order
where the first term is referred to as the bias error and the second term is referred to as the variance error. The variance error is the dominant term in the error for small values of , whereas the bias error is the dominant term for large values of (see Figure 3).
The resulting interacting particle algorithm is tabulated in Table 1. The symplectic method proposed in  is used to carry out the numerical integration. The algorithm is applied to two examples as described in the following sections.
(Comparison with density estimation) The diffusion map approximation algorithm is conceptually different from an explicit density estimation-based approach. A basic density estimation is to approximate where is the Gaussian kernel. Using such an approximation, the interaction term is approximated as
Despite the apparent similarity of the two formulae, (19) for diffusion map approximation and (20) for density estimation, the nature of the two approximations is different. The difference arises because the kernel in (19) is data-dependent whereas the kernel in (20) is not. While both approximations are exact in the asymptotic limit as and , they exhibit different convergence rates. Numerical experiments presented in Figure 3-(d) show that the diffusion map approximation has a much smaller variance for intermediate values of . Theoretical understanding of the difference is the subject of continuing work.
4.1 Gaussian Example
Consider the Gaussian example as described in Sec. 3.5. The simulation results for the scalar () case with initial distribution and target distribution where and is depicted in Figure 1-(a)-(b). For this simulation, the numerical parameters are as follows: , , , , ,, and . The result numerically verifies the convergence rate derived in Theorem 1 for the case where the target distribution is Gaussian.
4.2 Non-Gaussian example
This example involves a non-Gaussian target distribution which is a mixture of two one-dimensional Gaussians with and . The simulation results are depicted in Figure 2-(a)-(b). The numerical parameters are same as in the Example 4.1. The interaction term is approximated using the diffusion map approximation with . The numerical result depicted in Figure 2-(a) show that the diffusion map algorithm converges to the mixture of Gaussian target distribution. The result depicted in Figure 2-(b) suggests that the convergence rate also appears to hold for this non-log-concave target distribution. Theoretical justification of this is subject of continuing work.
4.3 Comparison with MCMC and HMCMC
This section contains numerical experiment comparing the performance of the accelerated algorithm 1 using the diffusion map (DM) approximation (19) and the density estimation (DE)-based approximation (20) with the Markov chain Monte-Carlo (MCMC) algorithm studied in  and the Hamiltonian MCMC algorithm studied in .
We consider the problem setting of the mixture of Gaussians as in example 4.2. All algorithms are simulated with a fixed step-size of for iterations. The performance is measured by computing the mean-squared error in estimating the expectation of the function . The mean-square error at the -th iteration is computed by averaging the error over runs:
The numerical results are depicted in Figure 3. Figure 3 depicts the m.s.e as a function of . It is observed that the accelerated algorithm 1 with the diffusion map approximation admits an order of magnitude better m.s.e for the same number of particles. It is also observed that the m.s.e decreases rapidly for intermediate values of before saturating for large values of , where the bias term dominates (see discussion following Eq. 19).
Figure 3 depicts the m.s.e as a function of the number of iterations for a fixed number of particles . It is observed that the accelerated algorithm 1 displays the quickest convergence amongst the algorithms tested.
Figure 3 depicts the average computational time per iteration as a function of the number of samples . The computational time of the diffusion map approximation scales as because it involves computing a matrix , while the computational cost of the MCMC and HMCMC algorithms scale as . The computational complexity may be improved by (i) exploiting the sparsity structure of the matrix ; (ii) sub-sampling the particles in computing the empirical averages; (iii) adaptively updating the matrix according to a certain error criteria.
Finally, we provide comparison between diffusion map approximation (20) and the density-based approximation (20): Figure 3 depicts the m.s.e for these two approximations as a function of the kernel-bandwidth for a fixed number of particles . For very large and for very small values of , where bias and variance dominates the error, respectively, the two algorithms have similar m.s.e. However, for intermediate values of , the diffusion map approximation has smaller variance, and thus lower m.s.e.
5 Conclusion and directions for future work
The main contribution of this paper is to extend the variational formulation of  to obtain theoretical results and numerical algorithms for accelerated gradient flow in the space of probability distributions. In continuous-time settings, bounds on convergence rate are derived based on a Lyapunov function argument. Two numerical algorithms based upon an interacting particle representation are presented and illustrated with examples. As has been the case in finite-dimensional settings, the theoretical framework is expected to be useful in this regard. Some direction for future include: (i) removing the technical assumption in the proof of the Theorem 1; (ii) analysis of the convergence under the weaker assumption that the target distribution satisfies only a spectral gap condition; and (iii) analysis of the numerical algorithms in the finite- and in the finite cases.
-  Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  Michael Betancourt, Michael I Jordan, and Ashia C Wilson. On symplectic optimization. arXiv preprint arXiv:1802.03653, 2018.
-  David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
-  Rene Carmona and François Delarue. Probabilistic Theory of Mean Field Games with Applications I-II. Springer, 2017.
-  Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A unified particle-optimization framework for scalable bayesian sampling. arXiv preprint arXiv:1805.11659, 2018.
-  Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped langevin mcmc: A non-asymptotic analysis. arXiv preprint arXiv:1707.03663, 2017.
-  Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv preprint arXiv:1805.09545, 2018.
-  Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.
-  Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. arXiv preprint arXiv:1605.01559, 2016.
-  Geir Evensen. The ensemble kalman filter: Theoretical formulation and practical implementation. Ocean dynamics, 53(4):343–367, 2003.
-  Charlie Frogner and Tomaso Poggio. Approximate inference with wasserstein gradient flows. arXiv preprint arXiv:1806.04542, 2018.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Matthias Hein, Jean-Yves Audibert, and Ulrike von Luxburg. Graph laplacians and their convergence on random neighborhood graphs. Journal of Machine Learning Research, 8(Jun):1325–1368, 2007.
-  Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227, 2017.
-  Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
Qiang Liu and Dilin Wang.
Stein variational gradient descent: A general purpose bayesian inference algorithm.In Advances In Neural Information Processing Systems, pages 2378–2386, 2016.
-  Robert J McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.
-  Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
-  Pierre H Richemond and Brendan Maginnis. On wasserstein reinforcement learning and the fokker-planck equation. arXiv preprint arXiv:1712.07185, 2017.
-  Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
-  Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
-  Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, page 201614734, 2016.
-  Ruiyi Zhang, Changyou Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient flows. arXiv preprint arXiv:1808.03030, 2018.
Appendix A PDE formulation of the variational problem
An equivalent pde formulation is obtained by considering the stochastic optimal control problem (11) as a deterministic optimal control problem on the space of the probability distributions. Specifically, the process is a deterministic process that takes values in and evolves according to the continuity equation
where is now a time-varying vector field. The Lagrangian is defined as:
The optimal control problem is:
The Hamiltonian function is
where is the dual variable and the inner-product
Appendix B Restatement of the main result and its proof
We restate Theorem 1 below which now includes the pde formulation as well.
For the probabilistic form (11) of the variational problem, the optimal control , where the optimal trajectory evolves according to the Hamilton’s odes:
where is a convex function, and .
For the pde form (23) of the variational problem, the optimal control is , where the optimal trajectory evolves according to the Hamilton’s pdes:
The solutions of the two forms are equivalent in the following sense:
Suppose additionally that the functional is displacement convex and is its minimizer. Define
where the map is the optimal transport map from to . Suppose also that the following technical assumption holds: . Then . Consequently, the following rate of convergence is obtained along the optimal trajectory
The Hamiltonian function defined in (12) is equal to
after inserting the formula for the Lagrangian. According to the maximum principle in probabilistic form for (mean-field) optimal control problems (see [5, Sec. 6.2.3]), the optimal control law and the Hamilton’s equations are
where are independent copies of . The derivatives
It follows that
where we used the definition and the identity [5, Sec. 5.2.2 Example 3]