# On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.

## Authors

• 9 publications
• 129 publications
• ### Sparse Optimization on Measures with Over-parameterized Gradient Descent

Minimizing a convex function of a measure with a sparsity-inducing penal...
07/24/2019 ∙ by Lenaïc Chizat, et al. ∙ 0

• ### Analysis of a Two-Layer Neural Network via Displacement Convexity

Fitting a function by using linear combinations of a large number N of `...
01/05/2019 ∙ by Adel Javanmard, et al. ∙ 0

• ### A Principle of Least Action for the Training of Neural Networks

Neural networks have been achieving high generalization performance on m...
09/17/2020 ∙ by Skander Karkar, et al. ∙ 0

• ### Unbalanced Sobolev Descent

We introduce Unbalanced Sobolev Descent (USD), a particle descent algori...
09/29/2020 ∙ by Youssef Mroueh, et al. ∙ 5

• ### SISTA: learning optimal transport costs under sparsity constraints

In this paper, we describe a novel iterative procedure called SISTA to l...
09/18/2020 ∙ by Guillaume Carlier, et al. ∙ 0

• ### Wasserstein Dictionary Learning: Optimal Transport-based unsupervised non-linear dictionary learning

08/07/2017 ∙ by Morgan A. Schmitz, et al. ∙ 0

• ### Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory

Particle-optimization sampling (POS) is a recently developed technique t...
09/05/2018 ∙ by Jianyi Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A classical task in machine learning and signal processing is to search for an element in a Hilbert space that minimizes a smooth, convex loss function and that is a linear combination of a few elements from a large given parameterized set . A general formulation of this problem is to describe the linear combination through an unknown signed measure on the parameter space and to solve for

 J∗=minμ∈M(Θ)J(μ),J(μ):=R(∫ϕdμ)+G(μ) (1)

where is the set of signed measures on the parameter space and is an optional convex regularizer, typically the total variation norm when sparse solutions are preferred. In this paper, we consider the infinite-dimensional case where the parameter space is a domain of and is differentiable. This framework covers:

• Training neural networks with a single hidden layer, where the goal is to select, within a specific class, a function that maps features in to labels in

, from the observation of a joint distribution of features and labels. This corresponds to

being the space of square-integrable real-valued functions on ,

being, e.g., the quadratic or the logistic loss function, and

, with an activation function

. Common choices are the sigmoid function or the rectified linear unit

haykin1994neural ; goodfellow2016deep , see more details in Section 4.2.

• Sparse spikes deconvolution, where one attempts to recover a signal which is a mixture of impulses on given a noisy and filtered observation (a square-integrable function on ). This corresponds to being the space of square-integrable real-valued functions on , defining the translations of the filter impulse response and , for some

that depends on the estimated noise level. Solving (

1) allows then to reconstruct the mixture of impulses with some guarantees de2012exact ; duval2015exact .

• Low-rank tensor decomposition

haeffele2017global , recovering mixture models from sketches poon2018dual , see boyd2017alternating for a detailed list of other applications. For example, with symmetric matrices, and , we recover low-rank matrix decompositions srebro .

### 1.1 Review of optimization methods and previous work

While (1) is a convex problem, finding approximate minimizers is hard as the variable is infinite-dimensional. Several lines of work provide optimization methods but with strong limitations.

This approach tackles a variant of (1) where the regularization term is replaced by an upper bound on the total variation norm; the associated constraint set is the convex hull of all Diracs and negatives of Diracs at elements of , and thus adapted to conditional gradient algorithms jaggi . At each iteration, one adds a new particle by solving a linear minimization problem over the constraint set (which correspond to finding a particle ), and then updates the weights. The resulting iterates are sparse and there is a guaranteed sublinear convergence rate of the objective function to its minimum. However, the linear minimization subroutine is hard to perform in general : it is for instance NP-hard for neural networks with homogeneous activations bach2017breaking . One thus generally resorts to space gridding (in low dimension) or to approximate steps, akin to boosting wang2015functional . The practical behavior is improved with nonconvex updates boyd2017alternating ; bredies2013inverse reminiscent of the flow studied below.

##### Semidefinite hierarchy.

Another approach is to parameterize the unknown measure by its sequence of moments. The space of such sequences is characterized by a hierarchy of SDP-representable necessary conditions. This approach concerns a large class of

generalized moment problems lasserre2010moments and can be adapted to deal with special instances of (1catala2017low . It is however restricted to which are combinations of few polynomial moments, and its complexity explodes exponentially with the dimension . For , convergence to a global minimizer is only guaranteed asymptotically, similarly to the results of the present paper.

A third approach, which exploits the differentiability of , consists in discretizing the unknown measure as a mixture of particles parameterized by their positions and weights. This corresponds to the finite-dimensional problem

 minw∈Rmθ∈ΘmJm(w,θ)% whereJm(w,θ):=J(1mm∑i=1wiδθi), (2)

which can then be solved by classical gradient descent-based algorithms. This method is simple to implement and is widely used for the task of neural network training but, a priori, we may only hope to converge to local minima since is non-convex. Our goal is to show that this method also benefits from the convex structure of (1) and enjoys an asymptotical global optimality guarantee.

There is a recent literature on global optimality results for (2) in the specific task of training neural networks. It is known that in this context,

has less, or no, local minima in an over-parameterization regime and stochastic gradient descent (SGD) finds a global minimizer under restrictive assumptions

soudry2017exponentially ; venturi2018neural ; soltanolkotabi2017theoretical ; li2017convergence ; see soltanolkotabi2017theoretical for an account of recent results. Our approach is not directly comparable to these works: it is more abstract and nonquantitative—we study an ideal dynamics that one can only hope to approximate—but also much more generic. Our objective, in the space of measures, has many local minima, but we build gradient flows that avoids them, relying mainly on the homogeneity properties of (see  haeffele2017global ; journee2010low for other uses of homogeneity in non-convex optimization). The novelty is to see (2) as a discretization of (1)—a point of view also present in nitanda2017stochastic but not yet exploited for global optimality guarantees.

### 1.2 Organization of the paper and summary of contributions

Our goal is to explain when and why the non-convex particle gradient descent finds global minima. We do so by studying the many-particle limit of the gradient flow of . More specifically:

• In Section 2, we introduce a more general class of problems and study the many-particle limit of the associated particle gradient flow. This limit is characterized as a Wasserstein gradient flow (Theorem 2.6), an object which is a by-product of optimal transport theory.

• In Section 3, under assumptions on and the initialization, we prove that if this Wasserstein gradient flow converges, then the limit is a global minimizer of . Under the same conditions, it follows that if are gradient flows for suitably initialized, then

 limm,t→∞J(μm,t)=J∗whereμm,t=1mm∑i=1w(m)i(t)δθ(m)i(t).
• Two different settings that leverage the structure of are treated: the -homogeneous and the partially -homogeneous case. In Section 4

, we apply these results to sparse deconvolution and training neural networks with a single hidden layer, with sigmoid or ReLU activation function. In each case, our result prescribes conditions on the initialization pattern.

• We perform simple numerical experiments that indicate that this asymptotic regime is already at play for small values of , even for high-dimensional problems. The method behaves incomparably better than simply optimizing on the weights with a very large set of fixed particles.

Our focus on qualitative results might be surprising for an optimization paper, but we believe that this is an insightful first step given the hardness and the generality of the problem. We suggest to understand our result as a first consistency principle for practical and a commonly used non-convex optimization methods. While we focus on the idealistic setting of a continuous-time gradient flow with exact gradients, this is expected to reflect the behavior of first order descent algorithms, as they are known to approximate the former: see scieur2017integration for (accelerated) gradient descent and (kushner2003stochastic, , Thm. 2.1) for SGD.

##### Notation.

Scalar products and norms are denoted by and respectively in , and by and in the Hilbert space . Norms of linear operators are also denoted by . The differential of a function at a point is denoted . We write for the set of finite signed Borel measures on , is a Dirac mass at a point and

is the set of probability measures endowed with the Wasserstein distance

(see Appendix A).

##### Recent related work.

Several independent works mei2018mean ; rotskoff2018neural ; sirignano2018mean have studied the many-particle limit of training a neural network with a single large hidden layer and a quadratic loss . Their main focus is on quantifying the convergence of SGD or noisy SGD to the limit trajectory, which is precisely a mean-field limit in this case. Since in our approach this limit is mostly an intermediate step necessary to state our global convergence theorems, it is not studied extensively for itself. These papers thus provide a solid complement to Section 2.4 (a difference is that we do not assume that is quadratic nor that is differentiable). Also, mei2018mean proves a quantitive global convergence result for noisy SGD to an approximate minimizer: we stress that our results are of a different nature, as they rely on homogeneity and not on the mixing effect of noise.

## 2 Particle gradient flows and many-particle limit

### 2.1 Main problem and assumptions

From now on, we consider the following class of problems on the space of non-negative finite measures on a domain which, as explained below, is more general than (1):

 F∗=minμ∈M+(Ω)F(μ)whereF(μ)=R(∫Φdμ)+∫Vdμ, (3)

and we make the following assumptions.

###### Assumptions 2.1.

is a separable Hilbert space, is the closure of a convex open set, and

1. (smooth loss) is differentiable, with a differential that is Lipschitz on bounded sets and bounded on sublevel sets,

2. (basic regularity) is (Fréchet) differentiable, is semiconvex111A function is semiconvex, or -convex, if is convex, for some . On a compact domain, any smooth fonction is semiconvex., and

3. (locally Lipschitz derivatives with sublinear growth) there exists a family of nested nonempty closed convex subsets of such that:

1. for all ,

2. and are bounded and is Lipschitz on each , and

3. there exists such that for all , where stands for the maximal norm of an element in .

Assumption 2.1-(iii) reduces to classical local Lipschitzness and growth assumptions on and if the nested sets are the balls of radius , but unbounded sets are also allowed. These sets are a technical tool used later to confine the gradient flows in areas where gradients are well-controlled. By convention, we set if is not concentrated on . Also, the integral is a Bochner integral (cohn1980measure, , App. E6). It yields a well-defined value in whenever is measurable and . Otherwise, we also set by convention.

##### Recovering (1) through lifting.

It is shown in Appendix A.2 that, for a class of admissible regularizers containing the total variation norm, problem (1) admits an equivalent formulation as (3). Indeed, consider the lifted domain , the function and . Then equals and given a minimizer of one of the problems, one can easily build minimizers for the other. This equivalent lifted formulation removes the asymmetry between weight and position—weight becomes just another coordinate of a particle’s position. This is the right point of view for our purpose and this is why is our central object of study in the following.

##### Homogeneity.

The functions and obtained through the lifting share the property of being positively -homogeneous in the variable . A function

between vector spaces is said positively

-homogeneous when for all and argument , it holds . This property is central for our global convergence results (but is not needed throughout Section 2).

We first consider an initial measure which is a mixture of particles—an atomic measure— and define the initial object in our construction: the particle gradient flow. For a number of particles, and a vector of positions, this is the gradient flow of

 Fm(u)\coloneqqF(1mm∑i=1δui)=R(1mm∑i=1Φ(ui))+1mm∑i=1V(ui), (4)

or, more precisely, its subgradient flow because can be non-smooth. We recall that a subgradient of a (possibly non-convex) function at a point is a satisfying for all . The set of subgradients at is a closed convex set called the subdifferential of at denoted  rockafellar97 .

###### Definition 2.2 (Particle gradient flow).

A gradient flow for the functional is an absolutely continuous222An absolutely continuous function is almost everywhere differentiable and satisfies for all . path which satisfies for almost every .

This definition uses a subgradient scaled by , which is the subgradient relative to the scalar product on scaled by : this normalization amounts to assigning a mass to each particle and is convenient for taking the many-particle limit . We now state basic properties of this object.

###### Proposition 2.3.

For any initialization , there exists a unique gradient flow for . Moreover, for almost every , it holds and the velocity of the -th particle is given by , where for and ,

 (5)

The expression of the velocity involves a projection because gradient flows select subgradients of minimal norm santambrogio2015optimal . We have denoted by the gradient of at and by the differential applied to the -th vector of the canonical basis of . Note that is (minus) the gradient of the first term in (4) : when is differentiable, we have and we recover the classical gradient of (4). When is non-smooth, this gradient flow can be understood as a continuous-time version of the forward-backward minimization algorithm combettes2011proximal .

The fact that the velocity of each particle can be expressed as the evaluation of a velocity field (Eq. (5)) makes it easy, at least formally, to generalize the particle gradient flow to arbitrary measure-valued initializations—not just atomic ones. On the one hand, the evolution of a time-dependent measure under the action of instantaneous velocity fields can be formalized by a conservation of mass equation, known as the continuity equation, that reads where is the divergence operator333For a smooth vector field , its divergence is given by . (see Appendix B). On the other hand, there is a direct link between the velocity field (5) and the functional . The differential of evaluated at is represented by the function defined as

 F′(μ)(u):=⟨R′(∫Φdμ),Φ(u)⟩+V(u).

Thus is simply a field of (minus) subgradients of —it is in fact the field of minimal norm subgradients. We write this relation . The set is called the Wasserstein subdifferential of , as it can be interpreted as the subdifferential of relatively to the Wasserstein metric on (see Appendix B.2.1

). We thus expect that for initializations with arbitrary probability distributions, the generalization of the gradient flow coindices with the following object.

###### Definition 2.4 (Wasserstein gradient flow).

A Wasserstein gradient flow for the functional on a time interval is an absolutely continuous path in that satisfies, distributionally on ,

 ∂tμt=−div(vtμt)wherevt∈−∂F′(μt). (6)

This is a proper generalization of Definition 2.2 since, whenever is a particle gradient flow for , then is a Wasserstein gradient flow for in the sense of Definition 2.4 (see Proposition B.1). By leveraging the abstract theory of gradient flows developed in ambrosio2008gradient , we show in Appendix B.2.1 that these Wasserstein gradient flows are well-defined.

###### Proposition 2.5 (Existence and uniqueness).

Under Assumptions 2.1, if is concentrated on a set , then there exists a unique Wasserstein gradient flow for starting from . It satisfies the continuity equation with the velocity field defined in (5) (with in place of ).

Note that the condition on the initialization is automatically satisfied in Proposition 2.3 because there the initial measure has a finite discrete support: it is thus contained in any for large enough.

### 2.4 Many-particle limit

We now characterize the many-particle limit of classical gradient flows, under Assumptions 2.1.

###### Theorem 2.6 (Many-particle limit).

Consider a sequence of classical gradient flows for initialized in a set . If converges to some for the Wasserstein distance , then converges, as , to the unique Wasserstein gradient flow of starting from .

Given a measure , an example for the sequence is where are independent samples distributed according to

. By the law of large numbers for empirical distributions, the sequence of empirical distributions

converges (almost surely, for ) to . In particular, our proof of Theorem 2.6 gives an alternative proof of the existence claim in Proposition 2.5 (the latter remains necessary for the uniqueness of the limit).

## 3 Convergence to global minimizers

### 3.1 General idea

As can be seen from Definition 2.4, a probability measure is a stationary point of a Wasserstein gradient flow if and only if . It is proved in nitanda2017stochastic that these stationary points are, in some cases, optimal over probabilities that have a smaller support. However, they are not in general global minimizers of over , even when is convex. Such global minimizers are indeed characterized as follows.

###### Proposition 3.1 (Minimizers).

Assume that is convex. A measure such that minimizes on iff and for -a.e. .

Despite these strong differences between stationarity and global optimality, we show in this section that Wasserstein gradient flows converge to global minimizers, under two main conditions:

• On the structure: and must share a homogeneity direction (see Section 2.1 for the definition of homogeneity), and

• On the initialization: the support of the initialization of the Wasserstein gradient flow satisfies a “separation” property. This property is preserved throughout the dynamic and, combined with homogeneity, allows to escape from neighborhoods of non-optimal points.

We turn these general ideas into concrete statements for two cases of interest, that exhibit different structures and behaviors: (i) when and are positively -homogeneous and (ii) when and are positively -homogeneous with respect to one variable.

### 3.2 The 2-homogeneous case

In the -homogeneous case a rich structure emerges, where the -dimensional sphere plays a special role. This covers the case of lifted problems of Section 2.1 when is -homogeneous and neural networks with ReLU activation functions.

###### Assumptions 3.2.

The domain is with and is differentiable with locally Lipschitz, is semiconvex and and are both positively -homogeneous. Moreover,

1. (smooth convex loss) The loss is convex, differentiable with differential Lipschitz on bounded sets and bounded on sublevel sets,

2. (Sard-type regularity) For all , the set of regular values444For a function , a regular value is a real number in the range of such that is included in an open set where is differentiable and where does not vanish. of is dense in its range (it is in fact sufficient that this holds for functions which are of the form for some ).

Taking the balls of radius as the family , these assumptions imply Assumptions 2.1. We believe that Assumption 3.2-(4) is not of practical importance: it is only used to avoid some pathological cases in the proof of Theorem 3.3. By applying Morse-Sard’s lemma abraham1967transversal , it is anyways fulfilled if the function in question is times continuously differentiable. We now state our first global convergence result. It involves a condition on the initialization, a separation property, that can only be satisfied in the many-particle limit. In an ambient space , we say that a set separates the sets and if any continuous path in with endpoints in and intersects .

###### Theorem 3.3.

Under Assumptions 3.2, let be a Wasserstein gradient flow of such that, for some , the support of is contained in and separates the spheres and . If converges to in , then is a global minimizer of over . In particular, if is a sequence of classical gradient flows initialized in such that converges weakly to then (limits can be interchanged)

 limt,m→∞F(μm,t)=minμ∈M+(Ω)F(μ).

A proof and stronger statements are presented in Appendix C. There, we give a criterion for Wasserstein gradient flows to escape neighborhoods of non-optimal measures—also valid in the finite-particle setting—and then show that it is always satisfied by the flow defined above. We also weaken the assumption that converges: we only need a certain projection of to converge weakly. Finally, the fact that limits in and can be interchanged is not anecdotal: it shows that the convergence is not conditioned on a relative speed of growth of both parameters.

This result might be easier to understand by drawing an informal distinction between (i) the structural assumptions which are instrumental and (ii) the technical conditions which have a limited practical interest. The initialization and the homogeneity assumptions are of the first kind. The Sard-type regularity is in contrast a purely technical condition: it is generally hard to check and known counter-examples involve artificial constructions such as the Cantor function whitney1935function . Similarly, when there is compactness, a gradient flow that does not converge is an unexpected (in some sense adversarial) behavior, see a counter-example in absil2005convergence . We were however not able to exclude this possibility under interesting assumptions (see a discussion in Appendix C.5).

### 3.3 The partially 1-homogeneous case

Similar results hold in the partially -homogeneous setting, which covers the lifted problems of Section 2.1 when is bounded (e.g., sparse deconvolution and neural networks with sigmoid activation).

###### Assumptions 3.4.

The domain is with , and where and are bounded, differentiable with Lipschitz differential. Moreover,

1. (smooth convex loss) The loss is convex, differentiable with differential Lipschitz on bounded sets and bounded on sublevel sets,

2. (Sard-type regularity) For all , the set of regular values of is dense in its range, and

3. (boundary conditions) The function behaves nicely at the boundary of the domain: either

1. and for all , converges, uniformly in as , to a function satisfying the Sard-type regularity, or

2. is the closure of an bounded open convex set and for all , satisfies Neumann boundary conditions (i.e., for all , where is the normal to at ).

With the family of nested sets , , these assumptions imply Assumptions 2.1. The following theorem mirrors the statement of Theorem 3.3, but with a different condition on the initialization. The remarks after Theorem 3.3 also apply here.

###### Theorem 3.5.

Under Assumptions 3.4, let be a Wasserstein gradient flow of such that for some , the support of is contained in and separates from . If converges to in , then is a global minimizer of over . In particular, if is a sequence of classical gradient flows initialized in such that converges to in then (limits can be interchanged)

 limt,m→∞F(μm,t)=minμ∈M+(Ω)F(μ).

## 4 Case studies and numerical illustrations

In this section, we apply the previous abstract statements to specific examples and show on synthetic experiments that the particle-complexity to reach global optimality is very favorable.

### 4.1 Sparse deconvolution

For sparse deconvolution, it is typical to consider a signal on the -torus . The loss function is for some , a parameter that increases with the noise level and the regularization is . Consider a filter impulse response and let . The object sought after is a signed measure on , which is obtained from a probability measure on by applying a operator defined by for all measurable . We show in Appendix D that Theorem 3.5 applies.

###### Proposition 4.1 (Sparse deconvolution).

Assume that the filter impulse response is times continuously differentiable, and that the support of contains . If the projection of the Wasserstein gradient flow of weakly converges to , then is a global minimizer of

 minμ∈M(Θ)12λ∥∥y−∫ψdμ∥∥2L2+|μ|(Θ).

We show an example of such a reconstruction on the -torus on Figure 1, where the ground truth consists of weighted spikes, is an ideal low pass filter (a Dirichlet kernel of order ) and is a noisy observation of the filtered spikes. The particle gradient flow is integrated with the forward-backward algorithm combettes2011proximal and the particles initialized on a uniform grid on .

### 4.2 Neural networks with a single hidden layer

We consider a joint distribution of features and labels and the marginal distribution of features. The loss is the expected risk defined on , where is either the squared loss or the logistic loss. Also, we set for an activation function . Depending on the choice of , we face two different situations.

##### Sigmoid activation.

If is a sigmoid, say , then Theorem 3.5, with domain applies. The natural (optional) regularization term is , which amounts to penalizing the norm of the weights.

###### Proposition 4.2 (Sigmoid activation).

Assume that has finite moments up to order , that the support of is and that boundary condition 3.4-(iii)-(a) holds. If the Wasserstein gradient flow of converges in to , then is a global minimizer of .

Note that we have to explicitly assume the boundary condition 3.4-(iii)-(a) because the Sard-type regularity at infinity cannot be checked a priori (this technical detail is discussed in Appendix D.3).

##### ReLU activation.

The activation function is positively -homogeneous: this makes -homogeneous and corresponds, at a formal level, to the setting of Theorem 3.3. An admissible choice of regularizer here would be the (semi-convex) function  bach2017breaking . However, as shown in Appendix D.4, the differential has discontinuities: this prevents altogether from defining gradient flows, even in the finite-particle regime.

Still, a statement holds for a different parameterization of the same class of functions, which makes differentiable. To see this, consider a domain which is the disjoint union of copies of . On the first copy, define where is the signed square function. On the second copy, has the same definition but with a minus sign. This trick allows to have the same expression power than classical ReLU networks. In practice, it corresponds to simply putting, say, random signs in front of the activation. The regularizer here can be .

###### Proposition 4.3 (Relu activation).

Assume that has finite second moments, that the support of is for some (on both copies of ) and that the Sard-type regularity Assumption 3.2-(4) holds. If the Wasserstein gradient flow of converges in to , then is a global minimizer of .

We display on Figure 2 particle gradient flows for training a neural network with a single hidden layer and ReLU activation in the classical (non-differentiable) parameterization, with

(no regularization). Features are normally distributed, and the ground truth labels are generated with a similar network with

neurons. The particle gradient flow is “integrated” with mini-batch SGD and the particles are initialized on a small centered sphere.

### 4.3 Empirical particle-complexity

Since our convergence results are non-quantitative, one might argue that similar—and much simpler to prove—asymptotical results hold for the method of distributing particles on the whole of and simply optimizing on the weights, which is a convex problem. Yet, the comparison of the particle-complexity shown in Figure 3 stands strongly in favor of particle gradient flows. While exponential particle-complexity is unavoidable for the convex approach, we observed on several synthetic problems that particle gradient descent only needs a slight over-parameterization to find global minimizers within optimization error (see details in Appendix D.5).

## 5 Conclusion

We have established asymptotic global optimality properties for a family of non-convex gradient flows. These results were enabled by the study of a Wasserstein gradient flow: this object simplifies the handling of many-particle regimes, analogously to a mean-field limit. The particle-complexity to reach global optimality turns out very favorable on synthetic numerical problems. This confirms the relevance of our qualitative results and calls for quantitative ones that would further exploit the properties of such particle gradient flows. Multiple layer neural networks are also an interesting avenue for future research.

#### Acknowledgments

We acknowledge supports from grants from Région Ile-de-France and the European Research Council (grant SEQUOIA 724063).

## Supplementary material

Supplementary material for the paper: “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport” authored by Lénaïc Chizat and Francis Bach (NIPS 2018).

This appendix is organized as follows:

• Appendix A: Introductory facts

• Appendix B: Many-particle limit and Wasserstein gradient flow

• Appendix C: Convergence to global minimizers

• Appendix D: Case studies and numerical experiments

## Appendix A Introductory facts

### a.1 Tools from measure theory

In this paper, the term measure refers to a finite signed measure on , , endowed with its Borel -algebra. We write for the set of such measures concentrated on a measurable set . Hereafter, we gather some concepts and facts from measure theory that are used in the proofs.

##### Variation of a signed measure.

The Jordan decomposition theorem [10, Cor. 4.1.6] asserts that any finite signed measure can be decomposed as where . If and are chosen with minimal total mass, the variation of is the nonnegative measure and is the total variation norm of .

##### Support and concentration set.

The support of a measure is the complement of the largest open set of measure , or, equivalently, the set of points which neighborhoods have positive measure. We say that is concentrated on a set if the complement of is included in a measurable set of measure . In particular, is concentrated on .

##### Pushforward.

Let and be measurable subsets of and let be a measurable map. To any measure corresponds a measure called the pushfoward of by . It is defined as for all measurable set and corresponds to the distribution of the “mass" of after it has been displaced by the map . It satisfies whenever is a measurable function such that is -integrable [10, Prop. 2.6.8]. In particular, with a projection map , the pushforward is the marginal of on the -th factor.

##### Weak convergence and Bounded Lipschitz norm.

We say that a sequence of measures weakly (or narrowly) converges to if, for all continuous and bounded function it holds . For sequences which are bounded in total variation norm, this is equivalent to the convergence in Bounded Lipschitz norm. The latter is defined, for , as

 ∥μ∥BL:=sup{∫φdμ;φ:Rd→R,Lip(φ)≤1,∥φ∥∞≤1} (7)

where is the smallest Lipschitz constant of and the supremum norm.

##### Wasserstein metric.

The -Wasserstein distance between two probability measures is defined as

 Wp(μ,ν):=(min∫|y−x|pdγ(x,y))1/p

where the minimization is over the set of probability measures such that the marginal on the first factor is and is on the second factor. The set of probability measures with finite second moments endowed with the metric is a complete metric space that we denote . A sequence converges in iff for all continuous function with at most quadratic growth it holds  [3, Prop. 7.1.5] (this is stronger than weak convergence). Using, respectively, the duality formula for  [29, Eq. (3.1)] and Jensen’s inequality, it holds

 ∥μ−ν∥BL≤W1(μ,ν)≤W2(μ,ν).

###### Lemma A.1 (Wasserstein continuity of F).

Under Assumptions 2.1, the function is continuous for the Wasserstein metric .

###### Proof.

Let be such that . By Assumption 2.1-(iii)-(c), and have at most quadratic growth. It follows and since, by the properties of Bochner integrals [10, Prop. E.5], it holds we also have strongly in . As is continuous in the strong topology of , it follows . ∎

### a.2 Lifting to the space of probability measures

Let us give technical details about the lifting introduced in Section 2.1 that allows to pass from a problem on the space of signed measures on (the minimization of defined in (1)) to an equivalent problem on the space of probability measures on a bigger space (the minimization of defined in (3)).

##### Homogeneity.

We recall that a function from to a vector space is said positively -homogeneous, with if for all and it holds . We often use without explicit mention the properties related to homogeneity such as the fact that the (sub)-derivative of a positively -homogeneous function is positively -homogeneous and, for differentiable (except possibly at ), the identity for .

#### a.2.1 The partially 1-homogeneous case

We take , and for some continuous functions and . This setting covers the lifted problems mentioned in Section 2.1. We first show that can be indifferently minimized over or over , thanks to the homogeneity of and in the variable .

###### Proposition A.2.

For all , there is such that .

###### Proof.

If then where is any point in . Otherwise, we define the map and the probability measure , which satisfies . ∎

We now introduce a projection operator that is adapted to the partial homogeneity of and . It is defined by for all and measurable set or, equivalently, by the property that for all continuous and bounded test function ,

 ∫Θφ(θ)dh1(μ)(θ)=∫R×Θwφ(θ)dμ(w,θ).

This operator is well defined whenever is -integrable.

###### Proposition A.3 (Equivalence under lifting).

It holds . For a regularizer on of the form , it holds . If the infimum defining is attained and if minimizes , then there exists that minimizes over .

###### Proof.

A signed measure can be expressed as where and (take for instance the normalized variation of if ). The measure

 μ:=(f×id)#σ (8)

belongs to and satisfies . This proves that is surjective. It is clear by the definition of that for all , it holds hence , with equality when is the minimizer in the definition of . ∎

The class of regularizer considered in Proposition A.3 includes the total variation norm.

###### Proposition A.4 (Total variation).

Let . For , it holds with equality if, for instance, is a lift of of the form (8).

###### Proof.

Let and . We define and . Clearly, and by the definition of the total variation of a signed measure, . There is equality whenever has -measure (see [10, Cor. 4.1.6]), a condition which is satisfied by the lift in (8). ∎

#### a.2.2 The 2-homogeneous case

Another structure that is studied in this paper is when and are defined on and are positively -homogeneous. In this case, the role played by is the previous section is played by the unit sphere of . We could again make links between (defined as in Eq. (3)) and a functional on nonnegative measures on the sphere (playing the role of ) but here we will limit ourselves to defining the projection operator relevant in this setting. It is characterized by the relationship, for all continuous and bounded function (with the convention ):

 ∫Sd−1φ(θ)dh2(μ)(θ)=∫Rd|u|2φ(u/|u|)dμ(u).

This operator is well-defined iff has finite second order moments.

## Appendix B Many-particle limit and Wasserstein gradient flow

### b.1 Proof of Proposition 2.3

As the sum of a continuously differentiable and a semiconvex function, is locally semiconvex and the existence of a unique gradient flow on a maximal interval with the claimed properties is standard, see [30, Sec. 2.1]. Now, a general property of gradient flows is that for a.e , the derivative is (minus) the subgradient of minimal norm. This leads to the explicit formula involving the velocity field with pointwise minimal norm:

 vt(u) =argmin{|v|2;~vt(u)−v∈∂V(u)} =~vt(u)−argmin{|~vt(u)−z|2;z∈∂V(u)} =(id−proj∂V(u))(~vt(u)).

In the specific case of gradient flows of lower bounded functions, we can derive estimates that imply that (even if is not globally semiconvex). Indeed, for all , it holds

 Fm(u(0))−Fm(u(t)) =−∫t0ddsFm(u(s))ds=1m∫t0|u′(s)|2ds≥tm(∫t0|u′(s)|ds)2

by Jensen’s inequality. Since is lower bounded, this proves that the gradient flow has bounded length on bounded time intervals. By compactness, if was finite then would exist, thus contradicting the maximality of , hence and the gradient flow is globally defined.