In this paper we study the following interacting particle system in :
We refer to each of the functions as a particle. The function is a smooth, symmetric, and positive definite kernel. The function is a smooth potential such that is integrable. More specific assumptions about and are given below.
We are interested in the macroscopic behavior of the particle system (1) as
in the framework of mean field limit. Formally this mean field limit is described by the following non-local, nonlinear partial differential equation (PDE):
We aim to make a rigorous connection between (1) and (2). Specifically, we prove global existence and uniqueness of a solution to this initial value problem, for in the appropriate regularity class, and we show that the empirical measure
converges as to the solution of (2), assuming converges to in the appropriate sense. We also want to study the long-time behavior of solutions to the mean field PDE (2). It is easy to see that the probability density
with is an invariant solution to (2). Under certain assumptions, we prove that converges weakly to as .
Our interest in the particle system (1) is mainly motivated by the recent works by Liu and Wang [27, 26], where a time-discretized form of (1) was introduced as an algorithm called Stein Variational Gradient Descent (SVGD). The idea of the algorithm is to transport a set of particles in so that their empirical measure approximates the target probability measure , with an unknown normalization factor . At discrete times, the particles are updated via the map
where is a small time step size and is a velocity field, which is chosen appropriately so to have a “fastest decay” of the Kullback-Leibler (KL) divergence between the push-forward measure and the target . Recall that the KL-divergence (or relative entropy) between probability measures and is
if is absolutely continuous with respect to , and we set if is singular to . This idea of SVGD can be formalized as choosing the velocity field to solve the variational problem
at each time step, where
is a suitable space of vector fields. It is not clear that (4) is well-defined, because the measure may be singular with respect to and . However, as shown in , (4) can be given meaning through the observation that if is absolutely continuous with respect to and , then
where is the so-called Stein operator defined by
In view of (4), this leads to the definition of Stein discrepancy
which has the property that is equal to zero if and only if provided that the space is sufficiently rich. For the empirical measure , the objective function in (5) may be well-defined and finite even though . Furthermore,  showed that if the space is chosen to be a reproducing kernel Hilbert space with a positive definite kernel , then the velocity field optimizing (5) can be characterized explicitly and is given by
) might also admit a variational structure. Indeed, it has been shown heuristically in that equation (2) can be viewed formally as a gradient flow for the KL-divergence functional
with respect to a generalized optimal transport metric whose definition involves the reproducing kernel Hilbert space with kernel . This in particular implies that the KL-divergence functional is a Lyapunov functional for the PDE (2), namely
Interpreting an evolutionary PDE as a gradient flow in the space of probability measures with respect to certain Wasserstein metric dates back to the seminar work on Fokker-Planck equation by Jordan, Kinderlehrer and Otto . By now, similar gradient flow structures have been identified for a large family of evolution equations, including porous medium equation , McKean-Vlasov equation , etc. In the present paper, we will not pursue further the rigorous definition and analysis of the gradient flow structure of (2). Instead, we take the system (1) as our starting point and prove its connection to the mean field PDE (2).
1.2 Relevant Literature
Sampling from a density of the form without knowing the normalization constant
is a fundamental problem in Bayesian statistics and machine learning. One generic approach that has been tremendously successful in recent years is the Markov chain Monte Carlo (MCMC) methodology based on Metropolis-Hastings mechanism. The general principle of Metropolis-Hastings algorithms is to build an ergodic Markov chain whose invariant measure is the target measureby first making candidate samples (proposals), which are then tuned to ensure stationarity via acception/rejection. In practice, one common approach to constructing proposals is by discretizing some stochastic dynamics, such as the following (overdamped) Langevin dynamics:
where is a standard Brownian motion in . A vanilla Euler-Maruyama discretization scheme associated to (7) together with Metropolis-Hastings step leads to the famous Metropolis Adjusted Langevin Algorithm (MALA) [31, 3] (whose non-Metropolized version known as unadjusted Langevin algorithm (ULA) [11, 16]).
One advantageous feature of stochastic dynamics-based sampling methods, e.g. MALA or ULA, is that the dynamics tend to explore high probability regions (around the local minima of ), while the random noise helps the dynamics to escape outside the basin of attraction and thus promotes its exploration of the entire state space. In contrast to this stochastic sampling approach, (1) may be viewed as a deterministic (albeit coupled) particle system for approximating . Qualitatively speaking, the terms in (1) which involve tend to drive particles toward local minima of (note however the nonlocal interaction due to the presence of ). On the other hand, the terms involving are repulsive, forcing the particles to disperse; this is seen in the fact that
where is the interaction energy. Here we assumed that . This interaction term in SVGD plays a role similar to that of the diffusion term in stochastic-dynamics-based sampling methods. Intuitively, one would expect that the empirical measure of the particles tends to be close to in the limit of both large sample size and long time. One of the contributions of this paper is to prove this convergence rigorously.
To compare these two sampling approaches at the PDE level, observe that the probability density for defined by (7) solves the linear Fokker-Planck equation
It is well known  that under some mild assumption on , the solution of (8) converges to the equilibrium distribution exponentially fast. On the other hand, if we formally set , the non-local mean-field equation (2) becomes
which is a non-linear porous medium equation with an additional transport due to . So, compared to (8), the mobility term and the transport term in (9) are small where the density is small. This suggests that the convergence of the solution of (9) towards may be slower than that of (8). In this paper, we consider only a fixed kernel , but if we scale the kernel as , it is natural to expect the large particle limit of (1) to be governed by (9) instead of (2), if is not too large — rigorous justification of such convergence result is still work in process.
One should also compare (1) with the following more standard deterministic interacting particle system:
The particle system (1) differs from (10) in that the external force added to each particle is non-local, and is defined by averaging the individual forces with weights defined by the kernel . Interestingly, such non-local external force guarantees that is a stationary solution of (2) — this explains the rationale for using the deterministic particle system (1) as an approximation algorithm for sampling . On the contrary, is not a stationary solution of (11). In fact, if or is non-convex, the equation (11) may have multiple stationary solutions; see e.g. [4, 5]. We also remark that the nonlocal external force makes the analysis of (1) more challenging than that of (10).
Although sampling via a deterministic particle system is less common, the use of deterministic particles is ubiquitous in numerical approximations of partial differential equations arising in physics and biology. For example, the point vortex method have been proved successful for solving equations in fluid mechanics [19, 30], and similarly the weighted particle method  and the diffusion-velocity method  for convection-diffusion and nonlinear-wave equations . For a comprehensive discussion on deterministic particle methods we refer the reader to the recent review paper  and references therein. Recently, a blob method was proposed in  for an aggregation equation, which is the equation (2) with and with being attractive rather than repulsive. One typical aggregation equation is the so-called Keller-Segel equation [23, 20]. The same blob method was generalized by  to a more general class of nonlinear diffusion equations, which has a -Wasserstein gradient flow structure. A key feature of the blob method considered there is that the particle system preserves a similar gradient flow structure as the diffusion equation, which facilitates the proof of large particle limits. On the contrary, the SVGD dynamics (1) is not a gradient flow. This again makes the analysis of the mean field limit non-trivial.
1.3 Plan of The Paper
The rest of the paper is organized as follows. In Section 2, we first make several technical assumptions on and and then state our main results under these assumptions. In Section 3, we prove the existence and uniqueness of weak solutions to the mean field equation Eq. 2 as well as the ODE system Eq. 1
of SVGD by use of the mean field characteristic flow. Some useful estimates on the solutionEq. 1 are also derived. Section 4 concerns the regularity of the solution to the mean field equation (2) under additional regularity assumption on . Section 5 devotes to the proof of the passage from the particles system Eq. 1 to its mean field PDE Eq. 2. Finally, in Section 6 we prove that the solution of (2) converges to the equilibrium as .
2 Preliminaries and Main Results
2.1 Assumptions and Notation
Throughout the paper we assume that the kernel satisfies the following:
is at least with bounded derivatives. In addition, is symmetric and positive definite, meaning that
A canonical choice of satisfying Assumption 2.1 is a Gaussian kernel, e.g. . Higher regularity of will be needed to obtain higher regularity of the solution of the mean field PDE; see Proposition 2. For the long time convergence of the solution, we will need further assumption on ; see Theorem 3.
For the potential function , we will assume the following:
and if .
There exists a constant and some index such that
For any , there exists a constant such that if , then
We comment that Section 2.1 (A1)-(A3) will be used in the proofs of the existence, uniqueness and regularity of the solution of mean field equation. Note that by setting and in (A3), we have that
for some constant . These assumptions are by no means sharp, but proves to be sufficient for the validity of our theorems. Section 2.1 (A2) implies that there is such that
where . Indeed, this follows from
where , and then integrating from to . It is also easy to check that Section 2.1 is fulfilled by even polynomials up to order .
We use and denote the set of Borel probability measures on satisfying
respectively. Thanks to Eq. 14, we have for any . For , denotes the -Wasserstein distance . Given a probability measure and a Borel-measurable map , we denote by the push-forward of the measure under the map . In places where is time-dependent, we often use notation to emphasize this time dependence in a succinct way; on the other hand, differentiation with respect to the variable will always be denoted by .
For , we denote by the usual Sobolev space of functions whose weak derivatives up to -th order belong to . When , we write . For our result on regularity of solutions to the PDE (2), we introduce function spaces
with norms and respectively. We set
with the canonical norms
We will use constant to denote a generic constant which depends on . Similar rules apply to , etc. We also use constants to denote generic constants that are independent of quantities of interest. The exact values of these constants may change from line to line.
2.2 Main Results
Given a measure , is well-defined. In fact, due to Assumption 2.1 (A2), is Lipschitz continuous and bounded over :
We say that a measure-valued function (where is given the topology of weak convergence) is a weak solution to (2) with initial condition if
holds for all . Recall that .
The theorem is proved in Section 3.1 Our next result, proved in Section 3.2, establishes that the finite particle system is well-posed, and that the associated empirical measure is a weak solution of the PDE (2):
In particular, the bound (19) holds for the empirical measure . With additional assumptions about the behavior of and as , we are able to improve upon (19) and show that is bounded in time; see Lemma 1 below.
When the initial condition is more regular, then the weak solution inherits higher regularity, as described by the following proposition. We remark that this regularity result will not be used in our proof of the mean field limit, but is of interest on its own account from the PDE perspective.
Let satisfy Section 2.1. Suppose that has a density . If is the unique weak solution to (2) with this initial condition, then also has a density. Furthermore, if for some , and the kernel is times differentiable with bounded derivatives, then has a density satisfying
where the constants depend only on and .
In the case that , a similar regularity result to Eq. 20 was proved for aggregation equation by Laurent . The presence of the potential makes the problem more difficult since the velocity is unbounded at infinity. This difficulty was circumvented with the help of the mean field characteristic flow (c.f. Definition 1), which allows us to express the solution in terms of the initial condition and the flow map. The regularity of simply transfers from that of provided we can show . See the detailed proof in Section 4.
Next, we prove a stability estimate for weak solutions to (2).
Theorem 2 addresses the behavior of the particle system as . Suppose that the initial points are such that as . Then if is the unique weak solution to (2) with initial condition , Theorem 2 implies that uniformly over , since is a weak solution to (2). This hypothesis of the convergence of the initial empirical measure, i.e., , can be justified rigorously, e.g., when the initial particles are independent samples drawn from . For a detailed discussion on the convergence of empirical measures in , we refer the interested readers to references [17, 35, 2, 33, 25].
We prove Theorem 2 in Section 5 by following Dobrushin’s coupling argument [14, 18] for the mean field characteristic flow (defined later at (23)). The proof follows closely the proof of Theorem 1.4.1 of , which dealt with the case . The stability estimate there was stated in terms of -Wasserstein distance, and mainly resulted from the Lipschitz condition of . However, we are only be able to prove the stability of mean field characteristic flow in -Wasserstein distance with strictly larger than one. This is again due to the presence of the nonlinear drift term in the vector field (16).
Our last result pertains to the long time behavior of solutions of (2) with sufficiently regular initial condition. Since the probability density is an invariant solution to the PDE (2), it is natural to ask whether is the unique invariant measure, and whether as . Generally speaking, (2) may admit many invariant measures. For example, for any stationary solution to the finite particle system (1), the empirical measure corresponds to a (stationary) weak solution of the PDE (2); there may be many such stationary solutions. However, if we restrict to initial conditions which are absolutely continuous with respect to , one may expect that solutions to (2) converge to as . The following theorem confirms this intuition. For technical reasons, we need to make further assumptions on the kernel .
Theorem 3 in particular implies that is the unique equilibrium of the mean field equation Eq. 2 provided that the initial distribution has a density and satisfies . However, if the initial distribution is discrete, such as in the case of the particle system (1), there could be multiple equilibria, in which case the long time behavior of may depend on the initial distribution.
The proof of Theorem 3 is presented in Section 6. A quantitative convergence rate is far from clear to us. The main obstacle is the lack of a generalized logarithmic Sobolev inequality which could lower bound the Stein discrepancy in terms of the relative entropy. This issue is to be investigated in future works. Another important unresolved issue is whether “generic” stationary solutions of the particle system (1) are close in some sense to , when is large.
3 Well-posedness of the PDE and the particle system
In this section we prove Theorem 1 and Proposition 1. The main ingredient in the proof of Theorem 1 is the so-called mean field characteristic flow, introduced in Section 3.1. In Section 3.2, we prove Proposition 1 and an additional estimate on the particle system under strong assumptions on .
3.1 Mean field characteristic flow
The expression means that the measure is the push-forward of under the map . We think of as a family of maps from to , parameterized by and . We first prove in the theorem below that the mean field characteristic flow (23) is well-defined. To this end, define the set of functions
which is a complete metric space with the uniform metric . Recall the space of measures defined in (15).
We follow the proof of Theorem 1.3.2 in . The proof of the theorem consists of two steps.
Step 1 (local well-posedness): Fix , and define
We prove that there exists such that the problem (23) has a unique solution in the set
which is a complete metric space, with metric
Consider the integral formulation of (23) given by
Let us define the operator by
Our goal is to show that is a contraction in , and thus has a unique fixed point.
We first show that maps into . Checking that is continuous is straightforward; we need to establish a bound on . If , then for any and ,
Then according to Assumptions (2.1) (A3), there exists a positive constant such that
As a consequence, we have
where we used the assumption that . Therefore,
if . This shows that maps from to , if is sufficiently small.
Next, we show that is indeed a contraction on . If , then for any and ,
The first term on the right side above can be bounded from above by
where in the last inequality we have used the fact that so that also satisfies the inequality (25), which enables us to apply (A3) of Section 2.1. Plugging (29) into the integral of the last term on the right side of (27), we can bound the last term by
Combining the estimates above leads to
which implies that is a contraction on when is small enough. By the contraction mapping theorem, has a unique fixed point , which solves (24). After defining , one sees that solves (23) in the small time interval .
Step 2 (Extension of local solution): Considering the bounds in the previous step, it is clear that the local solution may be extended beyond time as long as the quantity
remains finite. We now establish an a priori bound on this quantity, showing that the local solution may be extended for all .