# Maximum Mean Discrepancy Gradient Flow

We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties. The MMD is an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), and serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, that can be related to particle transport when optimizing neural networks. We also propose a way to regularize this MMD flow, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence. The practical implementation of the flow is straightforward, since both the MMD and its gradient have simple closed-form expressions, which can be easily estimated with samples.

## Authors

• 18 publications
• 8 publications
• 14 publications
• 74 publications
06/16/2021

### KALE Flow: A Relaxed KL Gradient Flow for Probabilities with Disjoint Support

We study the gradient flow for a relaxed approximation to the Kullback-L...
04/24/2021

### A Class of Dimensionality-free Metrics for the Convergence of Empirical Measures

This paper concerns the convergence of empirical measures in high dimens...
05/20/2021

### Kernel Stein Discrepancy Descent

Among dissimilarities between probability distributions, the Kernel Stei...
10/10/2019

### Computationally Efficient Tree Variants of Gromov-Wasserstein

We propose two novel variants of Gromov-Wasserstein (GW) between probabi...
06/13/2019

### Statistical Inference for Generative Models with Maximum Mean Discrepancy

While likelihood-based inference and its variants provide a statisticall...
09/29/2020

### Unbalanced Sobolev Descent

We introduce Unbalanced Sobolev Descent (USD), a particle descent algori...
02/28/2020

### Generalized Sliced Distances for Probability Distributions

Probability metrics have become an indispensable part of modern statisti...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We address the problem of defining a gradient flow on the space of probability distributions endowed with the Wasserstein metric, which transports probability mass from a starting distribtion

to a target distribution . Our flow is defined on the maximum mean discrepancy (MMD) gretton2012kernel, an integral probability metric Mueller97 which uses the unit ball in a characteristic RKHS sriperumbudur2010hilbert as its witness function class. Specifically, we choose the function in the witness class that has the largest difference in expectation under and : this difference constitutes the MMD. The idea of descending a gradient flow over the space of distributions can be traced back to the seminal work of jordan1998variational

, who revealed that the Fokker-Planck equation is a gradient flow of the Kullback-Leibler divergence. Its time-discretization leads to the celebrated Langevin Monte Carlo algorithm, which comes with strong convergence guarantees (see

durmus2018analysis; dalalyan2019user), but requires the knowledge of an analytical form of the target . A more recent gradient flow approach, Stein Variational Gradient Descent (SVGD) liu2017stein, also leverages this analytical .

The study of particle flows defined on the MMD relates to two important topics in modern machine learning. The first is in training Implicit Generative Models, notably generative adversarial networks

gans. Integral probability metrics have been used extensively as critic functions in this setting: these include the Wasserstein distance towards-principled-gans; wgan-gp; sinkhorn-igm and maximum mean discrepancy gen-mmd; Li:2015; Li:2017a; cramer-gan; Binkowski:2018; Arbel:2018. In (Mroueh:2019, Section 3.3), a connection between IGMs and particle transport is proposed, where it is shown that gradient flow on the witness function of an integral probability metric takes a similar form to the generator update in a GAN. The critic IPM in this case is the Kernel Sobolev Discrepancy (KSD), which has an additional gradient norm constraint on the witness function compared with the MMD. It is intended as an approximation to the negative Sobolev distance from the optimal transport literature Otto:2000; Villani:2009; Peyre:2011. There remain certain differences between gradient flow and GAN training, however. First, and most obviously, gradient flow can be approximated by representing as a set of particles, whereas in a GAN is the output of a generator network. The requirement that this generator network be a smooth function of its parameters causes a departure from pure particle flow. Second, in modern implementations Li:2017a; Binkowski:2018; Arbel:2018, the kernel used in computing the critic witness function for an MMD GAN critic is parametrized by a deep network, and an alternating optimization between the critic parameters and the generator parameters is performed. Despite these differences, we anticipate that the theoretical study of MMD flow convergence will provide helpful insights into conditions for GAN convergence, and ultimately, improvements to GAN training algorithms.

Regarding the second topic, we note that the properties of gradient descent for large neural networks have been modeled using the convergence towards a global optimum of particle transport in the population limit, when the number of particles goes to infinity rotskoff2018neural; chizat2018global; mei2018mean; sirignano2018mean. In particular, rotskoff2019global show that gradient descent on the parameters of a neural network can also be seen as a particle transport problem, which has as its population limit a gradient flow of a functional defined for probability distributions over the parameters of the network. This functional is in general non-convex, which makes the convergence analysis challenging. The particular structure of the MMD allows us to relate its gradient flow to neural network optimization in a well-specified regression setting similar to rotskoff2019global; chizat2018global (we make this connection explicit in Appendix F).

Our main contribution in this work is to establish conditions for convergence of MMD gradient flow to its global optimum. We give detailed descriptions of MMD flow for both its continuous-time and discrete instantiations in Section 2. In particular, the MMD flow may employ a sample approximation for the target : unlike e.g. Langevin Monte Carlo or SVGD, it does not require in analytical form. Global convergence is especially challenging to prove: while for functionals that are displacement convex, the gradient flow can be shown to converge towards a global optimum ambrosio2008gradient, the case of non-convex functionals, like the MMD, requires different tools. A modified gradient flow is proposed in rotskoff2019global that uses particle birth and death to reach global optimality. Global optimality may also be achieved simply by teleporting particles from to , as occurs for the Sobolev Discrepancy flow absent a kernel regulariser (Mroueh:2019, Theorem 4, Appendix D). Note, however, that the regularised Kernel Sobolev Discrepancy flow does not rely on teleportation.

Our approach takes inspiration in particular from Bottou:2017, where it is shown that although the -Wasserstein distance is non-convex, it can be optimized up to some barrier that depends on the diameter of the domain of the target distribution. Similarly to Bottou:2017, we provide in Section 3 a barrier on the gradient flow of the MMD, although the tightness of this barrier in terms of the target diameter remains to be established. We obtain a further condition on the evolution of the flow to ensure global optimality, and give rates of convergence in that case, however the condition is a strong one: it implies that the negative Sobolev distance between the target and the current particles remains bounded at all times.

We thus propose a way to regularize the MMD flow, based on a noise injection (Section 4) in the gradient, with more tractable theoretical conditions for convergence. Encouragingly, the noise injection is shown in practice to ensure convergence in a simple illustrative case where the original MMD flow fails. Finally, while our emphasis has been on establishing conditions for convergence, we note that MMD gradient flow has a simple implementation for -samples and -samples, and requires only evaluating the gradient of the kernel on the given samples.

## 2 Gradient flow of the MMD in W2

### 2.1 Construction of the gradient flow

In this section we introduce the gradient flow of the Maximum Mean Discrepancy (MMD) and highlight some of its properties. We start by briefly reviewing the MMD introduced in gretton2012kernel. We define as the closure of a convex open set, and as the set of probability distributions on

with finite second moment, equipped with the 2-Wassertein metric denoted

. For any , is the set of square integrable functions w.r.t. . The reader may find a relevant mathematical background in Appendix A.

#### Maximum Mean Discrepancy.

Given a characteristic kernel , we denote by its corresponding RKHS (see smola1998learning). The space is a Hilbert space with inner product and norm . We will rely on specific assumptions on the kernel which are given in Appendix B. In particular, Appendix B states that the gradient of the kernel, , is Lipschitz with constant . For such kernels, it is possible to define the Maximum Mean Discrepancy as a distance on . The MMD can be written as the RKHS norm of the unnormalised witness function between and , which is the difference between the mean embeddings of and ,

 MMD(μ,ν)=∥fμ,ν∥H,fν,μ(z)=∫k(x,z)dν(x)−∫k(x,z)dμ(x)∀z∈X (1)

Throughout the paper, will be fixed and can vary, hence we will only consider the dependence in and denote by . A direct computation (Mroueh:2019, Appendix B) shows that for any finite measure such that , we have

 limϵ→0ϵ−1(F(ν+ϵχ)−F(ν))=∫fμ,ν(x)dχ(x). (2)

This means that is the differential of . Interestingly, admits a free-energy expression:

 F(ν)=∫V(x)dν(x)+12∫W(x,y)dν(x)dν(y)+C. (3)

where is a confinement potential, an interaction potential and a constant defined by:

 V(x)=−∫k(x,x′)dμ(x′),W(x,x′)=k(x,x′),C=12∫k(x,x′)dμ(x)dμ(x′) (4)

Formulation Equation 3 and the simple expression of the differential in Equation 2 will be key to construct a gradient flow of , to transport particles. In Equation 4, reflects the potential generated by and acting on each particle, while reflects the potential arising from the interactions between those particles.

#### Gradient flow of the MMD.

We consider now the problem of transporting mass from an initial distribution to a target distribution , by finding a continuous path starting from that converges to while decreasing . Such a path should be physically plausible, in that teleportation phenomena are not allowed. For instance, the path would constantly teleport mass between and although it decreases since (Mroueh:2019, Section 3.1, Case 1). The physicality of the path is understood in terms of classical statistical physics: given an initial configuration of particles, these can move towards a new configuration through successive small transformations, without jumping from one location to another.

Optimal transport theory provides a way to construct such a continuous path by means of the continuity equation

. Given a vector field

on and an initial condition

, the continuity equation is a partial differential equation which defines a path

evolving under the action of the vector field , and reads for all . The reader can find more detailed discussions in Section A.2 or Santambrogio:2015. Following ambrosio2008gradient, a natural choice is to choose as the negative gradient of the differential of at , since it corresponds to a gradient flow of associated with the metric (see Section A.3). By Equation 2, we know that the differential of at is given by , hence .111Also, (see Section A.3) where denotes the classical convolution. The gradient flow of is then defined by the solution of

 ∂tνt=div(νt∇fμ,νt). (5)

Equation Equation 5 is non-linear in that the vector field depends itself on

. This type of equation is associated in the probability theory literature to the so-called McKean-Vlasov process

kac1956foundations; mckean1966class,

 dXt=−∇fμ,νt(Xt)dtX0∼ν0. (6)

In fact, Equation 6 defines a process whose distribution satisfies Equation 5, as shown in Proposition 1. can be interpreted as the trajectory of a single particle, starting from an initial random position drawn from . The trajectory is driven by the velocity field , and is affected by other particles. These interactions are captured by the velocity field through the dependence on the current distribution of all particles. Existence and uniqueness of a solution to Equations 6 and 5 are guaranteed in the next proposition, whose proof is given Section C.1.

###### Proposition 1.

Let . Then, under Appendix B, there exists a unique process satisfying the McKean-Vlasov equation in Equation 6 such that . Moreover, the distribution of is the unique solution of Equation 5 starting from , and defines a gradient flow of .

Besides existence and uniqueness of the gradient flow of , one expects to decrease along the path and ideally to converge towards . The first property, stated in the next proposition, is rather easy to get and is the object of Proposition 2, similar to the result for KSD flow in (Mroueh:2019, Section 3.1).

###### Proposition 2.

Under Appendix B, is decreasing in time and satisfies:

 dF(νt)dt=−∫∥∇fμ,νt(x)∥2dνt(x). (7)

This property results from Equation 5 and the energy identity in (ambrosio2008gradient, Theorem 11.3.2) and is proved in Section C.1. From Equation 7, can be seen as a Lyapunov functional for the dynamics defined by Equation 5, since it is decreasing in time. Hence, the continuous-time gradient flow introduced in Equation 5 allows to formally consider the notion of gradient descent on with as a cost function. A time-discretized version of the flow naturally follows, and is provided in the next section.

### 2.2 Euler scheme

We consider here a forward-Euler scheme of Equation 5. For any a measurable map, and , we denote the pushforward measure by (see Section A.2). Starting from and using a step-size , a sequence is given by iteratively applying

 νn+1=(I−γ∇fμ,νn)#νn. (8)

For all , equation Equation 8 is the distribution of the process defined by

 Xn+1=Xn−γ∇fμ,νn(Xn)X0∼ν0. (9)

The asymptotic behavior of Equation 8 as will be the object of Section 3. For now, we provide a guarantee that the sequence approaches as the step-size .

###### Proposition 3.

Let . Consider defined in Equation 8

, and the interpolation path

defined as: , . Then, under Appendix B, ,

 W2(ργt,νt)≤γC(T)∀t∈[0,T] (10)

where is a constant that depends only on .

A proof of Proposition 3 is provided in Section C.2 and relies on standard techniques to control the discretization error of a forward-Euler scheme. Proposition 3 means that can be linearly interpolated giving rise to a path which gets arbitrarily close to on bounded intervals. Note that as the bound it is expected to blow up. However, this result is enough to show that Equation 8 is indeed a discrete-time flow of . In fact, provided that is small enough, is a decreasing sequence, as shown in Proposition 4.

###### Proposition 4.

Under Appendix B, and for , the sequence is decreasing, and

 F(νn+1)−F(νn)≤−γ(1−3γ2L)∫∥∇fμ,νn(x)∥2dνn(x),∀n≥0.

Proposition 4, whose proof is given in Section C.2, is a discrete analog of Proposition 2. In fact, Equation 8 is intractable in general as it requires the knowledge of (and thus of ) exactly at each iteration . Nevertheless, we present in Section 4.2 a practical algorithm using a finite number of samples which is provably convergent towards Equation 8 as the sample-size increases. We thus begin by studying the convergence properties of the time discretized MMD flow Equation 8 in the next section.

## 3 Convergence properties of the MMD flow

We are interested in analyzing the asymptotic properties of the gradient flow of . Although we know from Propositions 4 and 2 that decreases in time, it can very well converge to local minima. One way to see this is by looking at the equilibrium condition for Equation 7. As a non-negative and decreasing function, is guaranteed to converge towards a finite limit , which implies in turn that the r.h.s. of Equation 7 converges to . If happens to converge towards some distribution , it is possible to show that the equilibrium condition Equation 11 must hold ((mei2018mean, Prop. 2) ),

 ∫∥∥∇fμ,ν∗(x)∥∥2dν∗(x)=0. (11)

Condition Equation 11 does not necessarily imply that is a global optimum, however ((mei2018mean, Th. 6) and rotskoff2019global). For instance, the claim that KSD flow converges globally, (Mroueh:2019, Prop. 3, Appendix B.1), requires an assumption (Mroueh:2019, Assump. A) that excludes local minima which are not global (see Section D.1; recall KSD is related to MMD). Global convergence of the flow is harder to obtain, and will be the topic of this section. The main challenge is the lack of convexity of w.r.t. the Wassertein metric. We show that is merely -convex, and that standard optimization techniques only provide a loose bound on its asymptotic value. We next exploit a Lojasiewicz type inequality to prove convergence to the global optimum provided that a particular quantity remains bounded at all times.

### 3.1 Optimization in a (W2) non-convex setting

The displacement convexity of a functional is an important criterion in characterizing the convergence of its Wasserstein gradient flow. Displacement convexity states that is a convex function whenever is a path of minimal length between two distributions and (see Definition 2). Displacement convexity should not be confused with mixture convexity, which corresponds to the usual notion of convexity. As a matter of fact, is mixture convex in that it satisfies: for all and (see Lemma 25). Unfortunately, is not displacement convex. Instead, only satisfies a weaker notion of displacement convexity called -displacement convexity, given in Definition 4 (Section A.4).

###### Proposition 5.

Under Appendices B, B and B, is -displacement convex, and satisfies

 F(ρt)≤(1−t)F(ν)+tF(ν′)−∫10Λ(ρs,vs)G(s,t)ds (12)

for all and any displacement geodesic from to with velocity vectors . The functional is defined for any pair with and ,

 Λ(ρ,v)=∥∥∥∫v(x).∇xk(x,.)dρ(x)∥∥∥2H−√2λdF(ρ)12∫∥v(x)∥2dρ(x), (13)

where and is defined in Appendix B.

Proposition 5 can be obtained by computing the second time derivative of , which is then lower-bounded by (see Section D.2). In Equation 13, the map is a difference of two non-negative terms: thus can become negative, and displacement convexity does not hold in general. However, it is still possible to provide an upper bound on the asymptotic value of when are obtained using Equation 8. This bound is given in Theorem 6, and depends on a scalar , where is a constant speed displacement geodesic from to the optimal value , with velocity vectors of constant norm.

###### Theorem 6.

Let be the average of . Under Appendices B, B and B and if ,

 F(νn)≤W22(ν0,μ)2γn−¯K. (14)

Theorem 6 is obtained using techniques from optimal transport and optimization. It relies on Proposition 5 and Proposition 4 to prove an extended variational inequality (see Proposition 16), and concludes using a suitable Lyapunov function. A full proof is given in Section D.3. When is non-negative, one recovers the usual convergence rate as for the gradient descent algorithm. However, can be negative in general, and would therefore act as a barrier on the optimal value that can achieve when . In that sense, the above result is similar to (Bottou:2017, Theorem 6.9). Theorem 6 only provides a loose bound, however. In Section 3.2 we show global convergence, under the boundedness at all times of a specific distance between and .

### 3.2 A condition for global convergence

The lack of convexity of , as shown in Section 3.1, suggests that a finer analysis of the convergence should be performed. One strategy is to provide estimates for the dynamics in Proposition 2 using differential inequalities which can be solved using the Gronwall’s lemma (see oguntuase2001inequality). Such inequalities are known in the optimization literature as Lojasiewicz inequalities (see Bolte:2016), and upper-bound by the absolute value of its time derivative . The latter is the squared weighted Sobolev semi-norm of (see Section D.4), also written . Thus one needs to find a relationship between and . For this purpose, we consider the weighted negative Sobolev distance on , defined by duality using (see also Peyre:2011).

###### Definition 1.

Let , with its corresponding weighted Sobolev semi-norm . The weighted negative Sobolev distance between any and in is defined as

 ∥p−q∥˙H−1(ν)=supf∈L2(ν),∥f∥˙H(ν)≤1∣∣∣∫f(x)dp(x)−∫f(x)dq(x)∣∣∣ (15)

with possibly infinite values.

Equation Equation 59 plays a fundamental role in dynamic optimal transport. It can be seen as the minimum kinetic energy needed to advect the mass to (see Mroueh:2019). It is shown in Section D.4 that

 ∥fμ,νt∥2H≤∥fμ,νt∥˙H(νt)∥μ−νt∥˙H−1(νt). (16)

Provided that remains bounded by some positive constant at all times, Equation 16 leads to a functional version of Lojasiewicz inequality for . It is then possible to use the general strategy explained earlier to prove the convergence of the flow to a global optimum:

###### Proposition 7.

Under Appendix B,

If , for all , then: ,

If for all , then: .

Proofs of Proposition 7 (i) and (ii) are direct consequences of Propositions 4 and 2 and the bounded energy assumption: see Section D.4. The fact that Equation 59 appears in the context of Wasserstein flows of is not a coincidence. Indeed, Equation 59 is a linearization of the Wasserstein distance (see Peyre:2011; Otto:2000 and Section D.5). Gradient flows of defined under different metrics would involve other kinds of distances instead of Equation 59. For instance, rotskoff2019global consider gradient flows under a hybrid metric (a mixture between the Wasserstein distance and KL divergence), where convergence rates can then be obtained provided that the chi-square divergence remains bounded. As shown in Section D.5, turns out to linearize when and are close. Hence, we conjecture that gradient flows of under a metric can be shown to converge when the linearization of the metric remains bounded. In practice, it is hard to guarantee that remains bounded at all times. One possible approach could be to regularize using an estimate of Equation 59. Indeed, Mroueh:2019 considers the gradient flow of a regularized version of the negative Sobolev distance which can be written in closed form, and shows that this decreases the MMD. Combing both losses could improve the overall convergence properties of the MMD, albeit at additional computational cost. In the next section, we propose a different approach to improve the convergence, and a particle-based algorithm to approximate the MMD flow in practice.

## 4 A practical algorithm to descend the MMD flow

### 4.1 A noisy update as a regularization

We showed in Section 3.1 that is a non-convex functional, and derived a condition in Section 3.2 to reach the global optimum. We now address the case where such a condition does not necessarily hold, and provide a regularization of the gradient flow to help achieve global optimality in this scenario. Our starting point will be the equilibrium condition in Equation 11. If an equilibrium that satisfies Equation 11 happens to have a positive density, then would be constant everywhere. This in turn would mean that when the RKHS does not contain constant functions, as for a gaussian kernel (Steinwart:2008a, Corollary 4.44). Hence, would be a global optimum since . The limit distribution might be singular, however, and can even be a dirac distribution (mei2018mean, Theorem 6). Although the gradient is not identically in that case, Equation 11 only evaluates it on the support , on which holds. Hence a possible fix would be to make sure that the unnormalised witness gradient is also evaluated at points outside of the support of . Here, we propose to regularize the flow by injecting noise into the gradient during updates of Equation 9,

 Xn+1=Xn−γ∇fμ,νn(Xn+βnUn),n≥0, (17)

where is a standard gaussian variable and is the noise level at . Compared to Equation 8, the sample here is first blurred before evaluating the gradient. Intuitively, if approaches a local optimum , would be small on the support of but it might be much larger outside of it, hence evaluating outside the support of can help in escaping the local minimum. The stochastic process Equation 17 is different from adding a diffusion term to Equation 5. The latter case would correspond to regularizing using an entropic term as in mei2018mean; csimcsekli2018sliced (see also Section A.5 on the Langevin diffusion). Eq. Equation 17 is also different from craig2016blob; carrillo2019blob, where (and thus its associated velocity field) is regularized by convolving the interaction potential in Equation 4 with a mollifier. The optimal solution of a regularized version of the functional will be generally different from the non-regularized one, however, which is not desirable in our setting. Eq. Equation 17 is more closely related to the continuation methods Gulcehre:2016a; Gulcehre:2016; Chaudhari:2017 and graduated optimization Hazan:2015

used for non-convex optimization in Euclidian spaces, which inject noise into the gradient of a loss function

at each iteration. The key difference is the dependence of of , which is inherently due to functional optimization. We show in Proposition 8 that Equation 17 attains the global minimum of provided that the level of the noise is well controlled, with the proof given in Section E.1.

###### Proposition 8.

Let be defined by Equation 17 with an initial . Denote with

the density of the standard gaussian distribution. Under

Appendices B and B, and for a choice of such that

 8λ2β2nF(νn)≤Dβn(νn), (18)
 the following inequality holds: F(νn+1)−F(νn)≤−γ2(1−3γL)Dβn(νn), (19)

where and are defined in Appendices B and B and depend only on the choice of the kernel. Moreover if then

 F(νn)≤F(ν0)e−4λ2γ(1−3γL)∑ni=0β2i. (20)

A particular case where holds is when decays as while still satisfying Equation 18. In this case, convergence occurs in polynomial time. At each iteration, the level of the noise needs to be adjusted such that the gradient is not too blurred. This ensures that each step decreases the loss functional. However, does not need to decrease at each iteration: it could increase adaptively whenever needed. For instance, when the sequence gets closer to a local optimum, it is helpful to increase the level of the noise to probe the gradient in regions where its value is not flat. Note that for in Equation 19 , we recover a similar bound to Proposition 4.

### 4.2 The sample-based approximate scheme

We now provide a practical algorithm to implement the noisy updates in the previous section, which employs a discretization in space. The update Equation 17 involves computing expectations of the gradient of the kernel w.r.t the target distribution and the current distribution at each iteration . This suggests a simple approximate scheme, based on samples from these two distributions, where at each iteration , we model a system of interacting particles and their empirical distribution in order to approximate . More precisely, given i.i.d. samples and from and and a step-size , the approximate scheme iteratively updates the -th particle as

 Xin+1=Xin−γ∇f^μ,^νn(Xin+βnUin), (21)

where are i.i.d standard gaussians and denote the empirical distributions of and , respectively. Implementing Equation 21 is straightforward as it only requires to evaluate the gradient of on the current particles and target samples. Pseudocode is provided in Algorithm 1. The overall computational cost of the algorithm at each iteration is with memory. The computational cost becomes when the kernel is approximated using random features, as is the case for regression with neural networks (Appendix F). This is in contrast to the cubic cost of the flow of the KSD Mroueh:2019, which requires solving a linear system at each iteration. The cost can also be compared to the algorithm in csimcsekli2018sliced

, which involves computing empirical CDF and quantile functions of random projections of the particles.

The approximation scheme in Equation 21 is a particle version of Equation 17, so one would expect it to converge towards its population version Equation 17 as and goes to infinity. This is shown below.

###### Theorem 9.

Let and . Let and defined by Equation 8 and Equation 21 respectively. Suppose Appendix B holds and that for all , for some . Then for any :

 E[W2(^νn,νn)]≤14(1√N(B+var(ν0)12)e2LT+1√Mvar(μ)12))(e4LT−1)

Theorem 9 controls the propagation of the chaos at each iteration, and uses techniques from jourdain2007nonlinear. Notice also that these rates remain true when no noise is added to the updates, i.e. for the original flow when . A proof is provided in Section E.2. The dependence in underlines the fact that our procedure could be interesting as a sampling algorithm when one only has access to samples of (see Section A.5 for a more detailed discussion).

Experiments

Figure 1 illustrates the behavior of the proposed algorithm Equation 21 in a simple setting and compares it with the gradient flow of the MMD without noise injection. Here, a student network is trained to produce the outputs of a teacher network using gradient descent. More details on the experiment are provided in Section G.1. As discussed in Appendix F, such setting can be seen as a particle version of an MMD flow with a suitable kernel. Here, the MMD flow fails to converge towards the global optimum. Such behavior is consistent with the observations in Chizat:2018a

when the parameters are initialized from a gaussian noise with relatively high variance (which is the case here). On the other hand, adding noise to the gradient seems to lead to global convergence. Indeed, the training error decreases below

and leads to much better validation error. Another illustrative experiment on a simple flow between Gaussians is given in Section G.2.

## 5 Conclusion

We have introduced MMD flow, a novel flow over the space of distributions, with a practical space-time discretized implementation and a regularisation scheme to improve convergence. We provide theoretical results, highlighting intrinsic properties of the regular MMD flow, and guarantees on convergence based on recent results in optimal transport, probabilistic interpretations of PDEs, and particle algorithms. Future work will focus on a deeper understanding of regularization for MMD flow, and its application in sampling and optimization for large neural networks.

This appendix is organized as follows. In Appendix A, the mathematical background needed for this paper is given. In Appendix B, we state the main assumptions used in this work. Appendix C is dedicated to the construction of the gradient flow of the MMD. Appendix D provides proofs for the convergence results in Section 3. Appendix E is dedicated to the modified gradient flow based on noise injection. In Appendix F, we discuss the connexion with optimization of neural networks. Appendix G provides details about the experiments. Finally, some auxiliary results are provided in Appendix H.

## Appendix A Mathematical background

We define as the closure of a convex open set, and as the set of probability distributions on with finite second moment, equipped with the 2-Wassertein metric denoted . For any , is the set of square integrable functions w.r.t. .

### a.1 Maximum Mean Discrepancy and Reproducing Kernel Hilbert Spaces

We recall here fundamental definitions and properties of reproducing kernel Hilbert spaces (RKHS) (see smola1998learning) and Maximum Mean Discrepancies (MMD). Given a positive semi-definite kernel defined for all , we denote by its corresponding RKHS (see smola1998learning). The space is a Hilbert space with inner product and corresponding norm . A key property of is the reproducing property: for all . Moreover, if is -times differentiable w.r.t. each of its coordinates, then any is -times differentiable and where is any multi-index with (Steinwart:2008a, Lemma 4.34). When has at most quadratic growth, then for all , . In that case, for any , is a well defined element in called the mean embedding of . The kernel is said to be characteristic when such mean embedding is injective, that is any mean embedding is associated to a unique probability distribution. When is characteristic, it is possible to define a distance between distributions in called the Maximum Mean Discrepancy:

 MMD(μ,ν)=∥ϕμ−ϕν∥H∀μ,ν∈P2(X). (22)

The difference between the mean embeddings of and is an element in called the unnormalised witness function between and : . The MMD can also be seen as an Integral Probability Metric:

 MMD(μ,ν)=supg∈B∫gdμ−∫gdν (23)

where is the unit ball in the RKHS.

### a.2 2-Wasserstein geometry

For two given probability distributions and in , we denote by the set of possible couplings between and . In other words contains all possible distributions on such that if then and . The -Wasserstein distance on is defined by means of an optimal coupling between and in the following way:

 W22(ν,μ):=infπ∈Π(ν,μ)∫∥x−y∥2dπ(x,y)∀ν,μ∈P2(X) (24)

It is a well established fact that such optimal coupling exists Villani:2009; Santambrogio:2015 . Moreover, it can be used to define a path between and in . For a given time in and given a sample from , it is possible to construct a sample from by taking the convex combination of and : where is given by:

 st(x,y)=(1−t)x+ty∀x,y∈X,∀t∈[0,1]. (25)

The function is well defined since is a convex set. More formally, can be written as the projection or push-forward of the optimal coupling by :

 ρt=(st)#π∗ (26)

We recall that for any a measurable map, and any , the push-forward measure is characterized by:

 ∫y∈Xϕ(y)dT#ρ(y)=∫x∈Xϕ(T(x))dρ(x) for % every measurable and bounded function ϕ. (27)

It is easy to see that Equation 26 satisfies the following boundary conditions at :

 ρ0=νρ1=μ. (28)

Paths of the form of Equation 26 are called displacement geodesics. They can be seen as the shortest paths from to in terms of mass transport (Santambrogio:2015 Theorem 5.27). It can be shown that there exists a velocity vector field with values in such that satisfies the continuity equation:

 ∂tρt+div(ρtVt)=0∀t∈[0,1]. (29)

This equation expresses two facts, the first one is that reflects the infinitesimal changes in as dictated by the vector field (also referred to as velocity field) , the second one is that the total mass of does not vary in time as a consequence of the divergence theorem. Equation Equation 29 is well defined in the distribution sense even when does not have a density. At each time , can be interpreted as a tangent vector to the curve so that the length of the curve would be given by:

 l((ρt)t∈[0,1])2=∫10∥Vt∥2L2(ρt)dt where ∥Vt∥2L2(ρt)=∫∥Vt(x)∥2dρt(x) (30)

This perspective allows to provide a dynamical interpretation of the as the length of the shortest path from to and is summarized by the celebrated Benamou-Brenier formula (benamou2000computational):

 W2(ν,μ)=inf(ρt,Vt)t∈[0,1]l((ρt)t∈[0,1]) (31)

where the infimum is taken over all couples and satisfying Equation 29 with boundary conditions given by Equation 28. If satisfies Equation 29 and Equation 28 and realizes the infimum in Equation 31, it is then simply called a geodesic between and ; moreover it is called a constant-speed geodesic if, in addition, the norm of is constant for all . As a consequence, Equation 26 is a constant-speed displacement geodesic.

###### Remark 1.

Such paths should not be confused with another kind of paths called mixture geodesics. The mixture geodesic from to is obtained by first choosing either or

according to a Bernoulli distribution of parameter

and then sampling from the chosen distribution:

 mt=(1−t)ν+tμ∀t∈[0,1]. (32)

Paths of the form Equation 32 can be thought as the shortest paths between two distributions when distances on are measured using the MMD (see Bottou:2017 Theorem 5.3). We refer to Bottou:2017 for an overview of the notion of shortest paths in probability spaces and for the differences between mixture geodesics and displacement geodesics. Although, we will be interested in the MMD as a loss function, we will not consider the geodesics that are naturally associated to it and will rather consider the displacement geodesics defined in Equation 26 for reasons that will become clear in Section A.4.

### a.3 Gradient flows on the space of probability measures

Consider a real valued functional defined over . We call if it exists, the unique (up to additive constants) function such that for any . The function is called the first variation of evaluated at . We consider here functionals of the form:

 F(ν)=∫U(ν(x))ν(x)dx+∫V(x)ν(x)dx+∫W(x,y)ν(x)ν(y)dxdy (33)

where is the internal potential, an external potential and an interaction potential. The formal gradient flow equation associated to such functional can be written (see carrillo2006contractions, Lemma 8 to 10):

 ∂ν∂t=div(ν∇∂F∂ν)=div(ν∇(U′(ν)+V+W∗ν)) (34)

where is the divergence operator and is the strong subdifferential of associated to the metric (see ambrosio2008gradient, Lemma 10.4.1). Indeed, for some generalized notion of gradient , and for sufficiently regular and , the r.h.s. of Equation 34 can be formally written as . The dissipation of energy along the flow is then given by (see Villani:2004):

 dF(ν)dt=−D(ν) with D(ν)=∫∥∥∥∇∂F(ν(x))∂ν∥∥∥2ν(x)dx (35)

### a.4 Displacement convexity

Just as for Euclidian spaces, an important criterion to characterize the convergence of the Wasserstein gradient flow of a functional is given by displacement convexity (see (Villani:2004, Definition 16.5 (1st bullet point)))):

###### Definition 2.

[Displacement convexity] We say that a functional is displacement convex if for any and and a constant speed geodesic between and with velocity vector field as defined by Equation 29, the following holds:

 F(ρt)≤(1−t)F(ν0)+tF(ν1)∀t∈[0,1]. (36)

Definition 2 can be relaxed to a more general notion of convexity called -displacement convexity (see (Villani:2009, Definition 16.5 (3rd bullet point))). We first define an admissible functional :

###### Definition 3.

[Admissible functional] Consider a functional defined for any probability distribution and any square integrable vector field w.r.t . We say that is admissible, if it satisfies:

• For any , is a quadratic form.

• For any geodesic between two distributions and with corresponding vector fields it holds that

We can now define the notion of -convexity:

###### Definition 4.

[ convexity] We say that a functional is -convex if for any and a constant speed geodesic between and with velocity vector field as defined by Equation 29, the following holds:

 F(ρt)≤(1−t)F(ν0)+tF(ν1)−∫10Λ(ρs,Vs)G(s,t)ds∀t∈[0,1]. (37)

where satisfies Definition 3, and . A particular case is when for some . In that case, Equation 37 becomes:

 F(ρt)≤(1−t)F(ν0)+tF(ν1