Scaling limit of the Stein variational gradient descent part I: the mean field regime

05/10/2018 ∙ by Jianfeng Lu, et al. ∙ Duke University 0

We study an interacting particle system in R^d motivated by Stein variational gradient descent [Q. Liu and D. Wang, NIPS 2016], a deterministic algorithm for sampling from a given probability density with unknown normalization. We prove that in the large particle limit the empirical measure converges to a solution of a non-local and nonlinear PDE. We also prove global well-posedness and uniqueness of the solution to the limiting PDE. Finally, we prove that the solution to the PDE converges to the unique invariant solution in large time limit.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we study the following interacting particle system in :

(1)

We refer to each of the functions as a particle. The function is a smooth, symmetric, and positive definite kernel. The function is a smooth potential such that is integrable. More specific assumptions about and are given below.

We are interested in the macroscopic behavior of the particle system (1) as

in the framework of mean field limit. Formally this mean field limit is described by the following non-local, nonlinear partial differential equation (PDE):

(2)

We aim to make a rigorous connection between (1) and (2). Specifically, we prove global existence and uniqueness of a solution to this initial value problem, for in the appropriate regularity class, and we show that the empirical measure

converges as to the solution of (2), assuming converges to in the appropriate sense. We also want to study the long-time behavior of solutions to the mean field PDE (2). It is easy to see that the probability density

with is an invariant solution to (2). Under certain assumptions, we prove that converges weakly to as .

1.1 Motivation

Our interest in the particle system (1) is mainly motivated by the recent works by Liu and Wang [27, 26], where a time-discretized form of (1) was introduced as an algorithm called Stein Variational Gradient Descent (SVGD). The idea of the algorithm is to transport a set of particles in so that their empirical measure approximates the target probability measure , with an unknown normalization factor . At discrete times, the particles are updated via the map

(3)

where is a small time step size and is a velocity field, which is chosen appropriately so to have a “fastest decay” of the Kullback-Leibler (KL) divergence between the push-forward measure and the target . Recall that the KL-divergence (or relative entropy) between probability measures and is

if is absolutely continuous with respect to , and we set if is singular to . This idea of SVGD can be formalized as choosing the velocity field to solve the variational problem

(4)

at each time step, where

is a suitable space of vector fields. It is not clear that (

4) is well-defined, because the measure may be singular with respect to and . However, as shown in [27], (4) can be given meaning through the observation that if is absolutely continuous with respect to and , then

where is the so-called Stein operator defined by

In view of (4), this leads to the definition of Stein discrepancy

(5)

which has the property that is equal to zero if and only if provided that the space is sufficiently rich. For the empirical measure , the objective function in (5) may be well-defined and finite even though . Furthermore, [27] showed that if the space is chosen to be a reproducing kernel Hilbert space with a positive definite kernel , then the velocity field optimizing (5) can be characterized explicitly and is given by

Therefore, interpreting (4) by (5) and using the fact that , one sees that the optimal solution of (4) is given by

(6)

Putting this optimal velocity back into (3) and letting the step size gives the evolution (1).

The variational picture described above about the particle system (1) suggests that the mean field limit (2

) might also admit a variational structure. Indeed, it has been shown heuristically in

[26] that equation (2) can be viewed formally as a gradient flow for the KL-divergence functional

with respect to a generalized optimal transport metric whose definition involves the reproducing kernel Hilbert space with kernel . This in particular implies that the KL-divergence functional is a Lyapunov functional for the PDE (2), namely

Interpreting an evolutionary PDE as a gradient flow in the space of probability measures with respect to certain Wasserstein metric dates back to the seminar work on Fokker-Planck equation by Jordan, Kinderlehrer and Otto [22]. By now, similar gradient flow structures have been identified for a large family of evolution equations, including porous medium equation [29], McKean-Vlasov equation [7], etc. In the present paper, we will not pursue further the rigorous definition and analysis of the gradient flow structure of (2). Instead, we take the system (1) as our starting point and prove its connection to the mean field PDE (2).

1.2 Relevant Literature

Sampling from a density of the form without knowing the normalization constant

is a fundamental problem in Bayesian statistics and machine learning. One generic approach that has been tremendously successful in recent years is the Markov chain Monte Carlo (MCMC) methodology based on Metropolis-Hastings mechanism. The general principle of Metropolis-Hastings algorithms is to build an ergodic Markov chain whose invariant measure is the target measure

by first making candidate samples (proposals), which are then tuned to ensure stationarity via acception/rejection. In practice, one common approach to constructing proposals is by discretizing some stochastic dynamics, such as the following (overdamped) Langevin dynamics:

(7)

where is a standard Brownian motion in . A vanilla Euler-Maruyama discretization scheme associated to (7) together with Metropolis-Hastings step leads to the famous Metropolis Adjusted Langevin Algorithm (MALA) [31, 3] (whose non-Metropolized version known as unadjusted Langevin algorithm (ULA) [11, 16]).

One advantageous feature of stochastic dynamics-based sampling methods, e.g. MALA or ULA, is that the dynamics tend to explore high probability regions (around the local minima of ), while the random noise helps the dynamics to escape outside the basin of attraction and thus promotes its exploration of the entire state space. In contrast to this stochastic sampling approach, (1) may be viewed as a deterministic (albeit coupled) particle system for approximating . Qualitatively speaking, the terms in (1) which involve tend to drive particles toward local minima of (note however the nonlocal interaction due to the presence of ). On the other hand, the terms involving are repulsive, forcing the particles to disperse; this is seen in the fact that

where is the interaction energy. Here we assumed that . This interaction term in SVGD plays a role similar to that of the diffusion term in stochastic-dynamics-based sampling methods. Intuitively, one would expect that the empirical measure of the particles tends to be close to in the limit of both large sample size and long time. One of the contributions of this paper is to prove this convergence rigorously.

To compare these two sampling approaches at the PDE level, observe that the probability density for defined by (7) solves the linear Fokker-Planck equation

(8)

It is well known [28] that under some mild assumption on , the solution of (8) converges to the equilibrium distribution exponentially fast. On the other hand, if we formally set , the non-local mean-field equation (2) becomes

(9)

which is a non-linear porous medium equation with an additional transport due to . So, compared to (8), the mobility term and the transport term in (9) are small where the density is small. This suggests that the convergence of the solution of (9) towards may be slower than that of (8). In this paper, we consider only a fixed kernel , but if we scale the kernel as , it is natural to expect the large particle limit of (1) to be governed by (9) instead of (2), if is not too large — rigorous justification of such convergence result is still work in process.

One should also compare (1) with the following more standard deterministic interacting particle system:

(10)

It is well-known [14] that under suitable assumption on and , the mean field limit of (10) is the following McKean-Vlasov equation

(11)

The particle system (1) differs from (10) in that the external force added to each particle is non-local, and is defined by averaging the individual forces with weights defined by the kernel . Interestingly, such non-local external force guarantees that is a stationary solution of (2) — this explains the rationale for using the deterministic particle system (1) as an approximation algorithm for sampling . On the contrary, is not a stationary solution of (11). In fact, if or is non-convex, the equation (11) may have multiple stationary solutions; see e.g. [4, 5]. We also remark that the nonlocal external force makes the analysis of (1) more challenging than that of (10).

Although sampling via a deterministic particle system is less common, the use of deterministic particles is ubiquitous in numerical approximations of partial differential equations arising in physics and biology. For example, the point vortex method have been proved successful for solving equations in fluid mechanics [19, 30], and similarly the weighted particle method [12] and the diffusion-velocity method [13] for convection-diffusion and nonlinear-wave equations [9]. For a comprehensive discussion on deterministic particle methods we refer the reader to the recent review paper [8] and references therein. Recently, a blob method was proposed in [10] for an aggregation equation, which is the equation (2) with and with being attractive rather than repulsive. One typical aggregation equation is the so-called Keller-Segel equation [23, 20]. The same blob method was generalized by [6] to a more general class of nonlinear diffusion equations, which has a -Wasserstein gradient flow structure. A key feature of the blob method considered there is that the particle system preserves a similar gradient flow structure as the diffusion equation, which facilitates the proof of large particle limits. On the contrary, the SVGD dynamics (1) is not a gradient flow. This again makes the analysis of the mean field limit non-trivial.

1.3 Plan of The Paper

The rest of the paper is organized as follows. In Section 2, we first make several technical assumptions on and and then state our main results under these assumptions. In Section 3, we prove the existence and uniqueness of weak solutions to the mean field equation Eq. 2 as well as the ODE system Eq. 1

of SVGD by use of the mean field characteristic flow. Some useful estimates on the solution

Eq. 1 are also derived. Section 4 concerns the regularity of the solution to the mean field equation (2) under additional regularity assumption on . Section 5 devotes to the proof of the passage from the particles system Eq. 1 to its mean field PDE Eq. 2. Finally, in Section 6 we prove that the solution of (2) converges to the equilibrium as .

2 Preliminaries and Main Results

2.1 Assumptions and Notation

Throughout the paper we assume that the kernel satisfies the following:

Assumption

is at least with bounded derivatives. In addition, is symmetric and positive definite, meaning that

A canonical choice of satisfying Assumption 2.1 is a Gaussian kernel, e.g. . Higher regularity of will be needed to obtain higher regularity of the solution of the mean field PDE; see Proposition 2. For the long time convergence of the solution, we will need further assumption on ; see Theorem 3.

For the potential function , we will assume the following:

Assumption
  • and if .

  • There exists a constant and some index such that

    and that

    (12)
  • For any , there exists a constant such that if , then

Remark 1

We comment that Section 2.1 (A1)-(A3) will be used in the proofs of the existence, uniqueness and regularity of the solution of mean field equation. Note that by setting and in (A3), we have that

(13)

for some constant . These assumptions are by no means sharp, but proves to be sufficient for the validity of our theorems. Section 2.1 (A2) implies that there is such that

(14)

where . Indeed, this follows from

where , and then integrating from to . It is also easy to check that Section 2.1 is fulfilled by even polynomials up to order .

We use and denote the set of Borel probability measures on satisfying

(15)

respectively. Thanks to Eq. 14, we have for any . For , denotes the -Wasserstein distance [34]. Given a probability measure and a Borel-measurable map , we denote by the push-forward of the measure under the map . In places where is time-dependent, we often use notation to emphasize this time dependence in a succinct way; on the other hand, differentiation with respect to the variable will always be denoted by .

For , we denote by the usual Sobolev space of functions whose weak derivatives up to -th order belong to . When , we write . For our result on regularity of solutions to the PDE (2), we introduce function spaces

with norms and respectively. We set

with the canonical norms

We will use constant to denote a generic constant which depends on . Similar rules apply to , etc. We also use constants to denote generic constants that are independent of quantities of interest. The exact values of these constants may change from line to line.

2.2 Main Results

Our first result is the global well-posedness of the nonlinear mean field PDE (2). Observe that (2) is a nonlinear transport equation of the form , where is the vector field

(16)

Given a measure , is well-defined. In fact, due to Assumption 2.1 (A2), is Lipschitz continuous and bounded over :

(17)

We say that a measure-valued function (where is given the topology of weak convergence) is a weak solution to (2) with initial condition if

(18)

and

holds for all . Recall that .

Theorem 1

Let satisfy Section 2.1. For any , there is a unique which is a weak solution to (2) with initial condition . Moreover there is (depending on and ) such that

(19)

If , then , as well, with .

The theorem is proved in Section 3.1 Our next result, proved in Section 3.2, establishes that the finite particle system is well-posed, and that the associated empirical measure is a weak solution of the PDE (2):

Proposition 1

Let satisfy Section 2.1. Then for any initial condition , the system (1) has a unique global solution , and the measure is a weak solution to the PDE (2).

In particular, the bound (19) holds for the empirical measure . With additional assumptions about the behavior of and as , we are able to improve upon (19) and show that is bounded in time; see Lemma 1 below.

When the initial condition is more regular, then the weak solution inherits higher regularity, as described by the following proposition. We remark that this regularity result will not be used in our proof of the mean field limit, but is of interest on its own account from the PDE perspective.

Proposition 2

Let satisfy Section 2.1. Suppose that has a density . If is the unique weak solution to (2) with this initial condition, then also has a density. Furthermore, if for some , and the kernel is times differentiable with bounded derivatives, then has a density satisfying

(20)

where the constants depend only on and .

In the case that , a similar regularity result to Eq. 20 was proved for aggregation equation by Laurent [24]. The presence of the potential makes the problem more difficult since the velocity is unbounded at infinity. This difficulty was circumvented with the help of the mean field characteristic flow (c.f. Definition 1), which allows us to express the solution in terms of the initial condition and the flow map. The regularity of simply transfers from that of provided we can show . See the detailed proof in Section 4.

Next, we prove a stability estimate for weak solutions to (2).

Theorem 2

Let satisfy Section 2.1 with in (A2). Let be the conjugate index of , i.e. . Let . Assume that are two initial probability measures in satisfying , . Let and be the associated weak solutions to (2). Then given any , there exists a constant depending on and such that

(21)

Theorem 2 addresses the behavior of the particle system as . Suppose that the initial points are such that as . Then if is the unique weak solution to (2) with initial condition , Theorem 2 implies that uniformly over , since is a weak solution to (2). This hypothesis of the convergence of the initial empirical measure, i.e., , can be justified rigorously, e.g., when the initial particles are independent samples drawn from . For a detailed discussion on the convergence of empirical measures in , we refer the interested readers to references [17, 35, 2, 33, 25].

We prove Theorem 2 in Section 5 by following Dobrushin’s coupling argument [14, 18] for the mean field characteristic flow (defined later at (23)). The proof follows closely the proof of Theorem 1.4.1 of [18], which dealt with the case . The stability estimate there was stated in terms of -Wasserstein distance, and mainly resulted from the Lipschitz condition of . However, we are only be able to prove the stability of mean field characteristic flow in -Wasserstein distance with strictly larger than one. This is again due to the presence of the nonlinear drift term in the vector field (16).

Our last result pertains to the long time behavior of solutions of (2) with sufficiently regular initial condition. Since the probability density is an invariant solution to the PDE (2), it is natural to ask whether is the unique invariant measure, and whether as . Generally speaking, (2) may admit many invariant measures. For example, for any stationary solution to the finite particle system (1), the empirical measure corresponds to a (stationary) weak solution of the PDE (2); there may be many such stationary solutions. However, if we restrict to initial conditions which are absolutely continuous with respect to , one may expect that solutions to (2) converge to as . The following theorem confirms this intuition. For technical reasons, we need to make further assumptions on the kernel .

Theorem 3

Let satisfy Section 2.1. Assume that satisfies Section 2.1 and the following extra assumption:

(22)

Let be the solution to (2) with initial condition satisfying . Then converges weakly to as .

Theorem 3 in particular implies that is the unique equilibrium of the mean field equation Eq. 2 provided that the initial distribution has a density and satisfies . However, if the initial distribution is discrete, such as in the case of the particle system (1), there could be multiple equilibria, in which case the long time behavior of may depend on the initial distribution.

The proof of Theorem 3 is presented in Section 6. A quantitative convergence rate is far from clear to us. The main obstacle is the lack of a generalized logarithmic Sobolev inequality which could lower bound the Stein discrepancy in terms of the relative entropy. This issue is to be investigated in future works. Another important unresolved issue is whether “generic” stationary solutions of the particle system (1) are close in some sense to , when is large.

3 Well-posedness of the PDE and the particle system

In this section we prove Theorem 1 and Proposition 1. The main ingredient in the proof of Theorem 1 is the so-called mean field characteristic flow, introduced in Section 3.1. In Section 3.2, we prove Proposition 1 and an additional estimate on the particle system under strong assumptions on .

3.1 Mean field characteristic flow

Here we define the mean field characteristic flow for the PDE (2) (c.f [18]), which will play an essential role in the proof of the large particle limit of (1).

Definition 1

Given a probability measure , we say that the map

is a mean field characteristic flow associated to the particle system (1) or to the mean field PDE (2) if is in time and solves the following problem

(23)

The expression means that the measure is the push-forward of under the map . We think of as a family of maps from to , parameterized by and . We first prove in the theorem below that the mean field characteristic flow (23) is well-defined. To this end, define the set of functions

which is a complete metric space with the uniform metric . Recall the space of measures defined in (15).

Theorem 4

Assume the conditions of Theorem 1 hold, and . For any , there exists a unique solution to the problem (23). Moreover, the measure satisfies , some constant that is independent of .

Proof

We follow the proof of Theorem 1.3.2 in [18]. The proof of the theorem consists of two steps.

Step 1 (local well-posedness): Fix , and define

We prove that there exists such that the problem (23) has a unique solution in the set

which is a complete metric space, with metric

Consider the integral formulation of (23) given by

(24)

Let us define the operator by

Our goal is to show that is a contraction in , and thus has a unique fixed point.

We first show that maps into . Checking that is continuous is straightforward; we need to establish a bound on . If , then for any and ,

(25)

Then according to Assumptions (2.1) (A3), there exists a positive constant such that

(26)

As a consequence, we have

where we used the assumption that . Therefore,

if . This shows that maps from to , if is sufficiently small.

Next, we show that is indeed a contraction on . If , then for any and ,

(27)

The first term on the right side above can be bounded from above by

Thanks to (25) and (13), the second term can be bounded from above by

(28)

To bound the last term on the right side of (27), using Section 2.1 (A2) one obtains that

(29)

where in the last inequality we have used the fact that so that also satisfies the inequality (25), which enables us to apply (A3) of Section 2.1. Plugging (29) into the integral of the last term on the right side of (27), we can bound the last term by

Combining the estimates above leads to

which implies that is a contraction on when is small enough. By the contraction mapping theorem, has a unique fixed point , which solves (24). After defining , one sees that solves (23) in the small time interval .

Step 2 (Extension of local solution): Considering the bounds in the previous step, it is clear that the local solution may be extended beyond time as long as the quantity

remains finite. We now establish an a priori bound on this quantity, showing that the local solution may be extended for all .