A classical task in machine learning and signal processing is to search for an element in a Hilbert space that minimizes a smooth, convex loss function and that is a linear combination of a few elements from a large given parameterized set . A general formulation of this problem is to describe the linear combination through an unknown signed measure on the parameter space and to solve for
where is the set of signed measures on the parameter space and is an optional convex regularizer, typically the total variation norm when sparse solutions are preferred. In this paper, we consider the infinite-dimensional case where the parameter space is a domain of and is differentiable. This framework covers:
Training neural networks with a single hidden layer, where the goal is to select, within a specific class, a function that maps features in to labels in
, from the observation of a joint distribution of features and labels. This corresponds tobeing the space of square-integrable real-valued functions on ,
being, e.g., the quadratic or the logistic loss function, and
, with an activation functionhaykin1994neural ; goodfellow2016deep , see more details in Section 4.2.
Sparse spikes deconvolution, where one attempts to recover a signal which is a mixture of impulses on given a noisy and filtered observation (a square-integrable function on ). This corresponds to being the space of square-integrable real-valued functions on , defining the translations of the filter impulse response and , for some
that depends on the estimated noise level. Solving (1) allows then to reconstruct the mixture of impulses with some guarantees de2012exact ; duval2015exact .
1.1 Review of optimization methods and previous work
While (1) is a convex problem, finding approximate minimizers is hard as the variable is infinite-dimensional. Several lines of work provide optimization methods but with strong limitations.
Conditional gradient / Frank-Wolfe.
This approach tackles a variant of (1) where the regularization term is replaced by an upper bound on the total variation norm; the associated constraint set is the convex hull of all Diracs and negatives of Diracs at elements of , and thus adapted to conditional gradient algorithms jaggi . At each iteration, one adds a new particle by solving a linear minimization problem over the constraint set (which correspond to finding a particle ), and then updates the weights. The resulting iterates are sparse and there is a guaranteed sublinear convergence rate of the objective function to its minimum. However, the linear minimization subroutine is hard to perform in general : it is for instance NP-hard for neural networks with homogeneous activations bach2017breaking . One thus generally resorts to space gridding (in low dimension) or to approximate steps, akin to boosting wang2015functional . The practical behavior is improved with nonconvex updates boyd2017alternating ; bredies2013inverse reminiscent of the flow studied below.
Another approach is to parameterize the unknown measure by its sequence of moments. The space of such sequences is characterized by a hierarchy of SDP-representable necessary conditions. This approach concerns a large class ofgeneralized moment problems lasserre2010moments and can be adapted to deal with special instances of (1) catala2017low . It is however restricted to which are combinations of few polynomial moments, and its complexity explodes exponentially with the dimension . For , convergence to a global minimizer is only guaranteed asymptotically, similarly to the results of the present paper.
Particle gradient descent.
A third approach, which exploits the differentiability of , consists in discretizing the unknown measure as a mixture of particles parameterized by their positions and weights. This corresponds to the finite-dimensional problem
which can then be solved by classical gradient descent-based algorithms. This method is simple to implement and is widely used for the task of neural network training but, a priori, we may only hope to converge to local minima since is non-convex. Our goal is to show that this method also benefits from the convex structure of (1) and enjoys an asymptotical global optimality guarantee.
There is a recent literature on global optimality results for (2) in the specific task of training neural networks. It is known that in this context,
has less, or no, local minima in an over-parameterization regime and stochastic gradient descent (SGD) finds a global minimizer under restrictive assumptionssoudry2017exponentially ; venturi2018neural ; soltanolkotabi2017theoretical ; li2017convergence ; see soltanolkotabi2017theoretical for an account of recent results. Our approach is not directly comparable to these works: it is more abstract and nonquantitative—we study an ideal dynamics that one can only hope to approximate—but also much more generic. Our objective, in the space of measures, has many local minima, but we build gradient flows that avoids them, relying mainly on the homogeneity properties of (see haeffele2017global ; journee2010low for other uses of homogeneity in non-convex optimization). The novelty is to see (2) as a discretization of (1)—a point of view also present in nitanda2017stochastic but not yet exploited for global optimality guarantees.
1.2 Organization of the paper and summary of contributions
Our goal is to explain when and why the non-convex particle gradient descent finds global minima. We do so by studying the many-particle limit of the gradient flow of . More specifically:
In Section 3, under assumptions on and the initialization, we prove that if this Wasserstein gradient flow converges, then the limit is a global minimizer of . Under the same conditions, it follows that if are gradient flows for suitably initialized, then
Two different settings that leverage the structure of are treated: the -homogeneous and the partially -homogeneous case. In Section 4
, we apply these results to sparse deconvolution and training neural networks with a single hidden layer, with sigmoid or ReLU activation function. In each case, our result prescribes conditions on the initialization pattern.
We perform simple numerical experiments that indicate that this asymptotic regime is already at play for small values of , even for high-dimensional problems. The method behaves incomparably better than simply optimizing on the weights with a very large set of fixed particles.
Our focus on qualitative results might be surprising for an optimization paper, but we believe that this is an insightful first step given the hardness and the generality of the problem. We suggest to understand our result as a first consistency principle for practical and a commonly used non-convex optimization methods. While we focus on the idealistic setting of a continuous-time gradient flow with exact gradients, this is expected to reflect the behavior of first order descent algorithms, as they are known to approximate the former: see scieur2017integration for (accelerated) gradient descent and (kushner2003stochastic, , Thm. 2.1) for SGD.
Scalar products and norms are denoted by and respectively in , and by and in the Hilbert space . Norms of linear operators are also denoted by . The differential of a function at a point is denoted . We write for the set of finite signed Borel measures on , is a Dirac mass at a point and
is the set of probability measures endowed with the Wasserstein distance(see Appendix A).
Recent related work.
Several independent works mei2018mean ; rotskoff2018neural ; sirignano2018mean have studied the many-particle limit of training a neural network with a single large hidden layer and a quadratic loss . Their main focus is on quantifying the convergence of SGD or noisy SGD to the limit trajectory, which is precisely a mean-field limit in this case. Since in our approach this limit is mostly an intermediate step necessary to state our global convergence theorems, it is not studied extensively for itself. These papers thus provide a solid complement to Section 2.4 (a difference is that we do not assume that is quadratic nor that is differentiable). Also, mei2018mean proves a quantitive global convergence result for noisy SGD to an approximate minimizer: we stress that our results are of a different nature, as they rely on homogeneity and not on the mixing effect of noise.
2 Particle gradient flows and many-particle limit
2.1 Main problem and assumptions
From now on, we consider the following class of problems on the space of non-negative finite measures on a domain which, as explained below, is more general than (1):
and we make the following assumptions.
is a separable Hilbert space, is the closure of a convex open set, and
(smooth loss) is differentiable, with a differential that is Lipschitz on bounded sets and bounded on sublevel sets,
(basic regularity) is (Fréchet) differentiable, is semiconvex111A function is semiconvex, or -convex, if is convex, for some . On a compact domain, any smooth fonction is semiconvex., and
(locally Lipschitz derivatives with sublinear growth) there exists a family of nested nonempty closed convex subsets of such that:
for all ,
and are bounded and is Lipschitz on each , and
there exists such that for all , where stands for the maximal norm of an element in .
Assumption 2.1-(iii) reduces to classical local Lipschitzness and growth assumptions on and if the nested sets are the balls of radius , but unbounded sets are also allowed. These sets are a technical tool used later to confine the gradient flows in areas where gradients are well-controlled. By convention, we set if is not concentrated on . Also, the integral is a Bochner integral (cohn1980measure, , App. E6). It yields a well-defined value in whenever is measurable and . Otherwise, we also set by convention.
Recovering (1) through lifting.
It is shown in Appendix A.2 that, for a class of admissible regularizers containing the total variation norm, problem (1) admits an equivalent formulation as (3). Indeed, consider the lifted domain , the function and . Then equals and given a minimizer of one of the problems, one can easily build minimizers for the other. This equivalent lifted formulation removes the asymmetry between weight and position—weight becomes just another coordinate of a particle’s position. This is the right point of view for our purpose and this is why is our central object of study in the following.
The functions and obtained through the lifting share the property of being positively -homogeneous in the variable . A function
between vector spaces is said positively-homogeneous when for all and argument , it holds . This property is central for our global convergence results (but is not needed throughout Section 2).
2.2 Particle gradient flow
We first consider an initial measure which is a mixture of particles—an atomic measure— and define the initial object in our construction: the particle gradient flow. For a number of particles, and a vector of positions, this is the gradient flow of
or, more precisely, its subgradient flow because can be non-smooth. We recall that a subgradient of a (possibly non-convex) function at a point is a satisfying for all . The set of subgradients at is a closed convex set called the subdifferential of at denoted rockafellar97 .
Definition 2.2 (Particle gradient flow).
A gradient flow for the functional is an absolutely continuous222An absolutely continuous function is almost everywhere differentiable and satisfies for all . path which satisfies for almost every .
This definition uses a subgradient scaled by , which is the subgradient relative to the scalar product on scaled by : this normalization amounts to assigning a mass to each particle and is convenient for taking the many-particle limit . We now state basic properties of this object.
For any initialization , there exists a unique gradient flow for . Moreover, for almost every , it holds and the velocity of the -th particle is given by , where for and ,
The expression of the velocity involves a projection because gradient flows select subgradients of minimal norm santambrogio2015optimal . We have denoted by the gradient of at and by the differential applied to the -th vector of the canonical basis of . Note that is (minus) the gradient of the first term in (4) : when is differentiable, we have and we recover the classical gradient of (4). When is non-smooth, this gradient flow can be understood as a continuous-time version of the forward-backward minimization algorithm combettes2011proximal .
2.3 Wasserstein gradient flow
The fact that the velocity of each particle can be expressed as the evaluation of a velocity field (Eq. (5)) makes it easy, at least formally, to generalize the particle gradient flow to arbitrary measure-valued initializations—not just atomic ones. On the one hand, the evolution of a time-dependent measure under the action of instantaneous velocity fields can be formalized by a conservation of mass equation, known as the continuity equation, that reads where is the divergence operator333For a smooth vector field , its divergence is given by . (see Appendix B). On the other hand, there is a direct link between the velocity field (5) and the functional . The differential of evaluated at is represented by the function defined as
Thus is simply a field of (minus) subgradients of —it is in fact the field of minimal norm subgradients. We write this relation . The set is called the Wasserstein subdifferential of , as it can be interpreted as the subdifferential of relatively to the Wasserstein metric on (see Appendix B.2.1
). We thus expect that for initializations with arbitrary probability distributions, the generalization of the gradient flow coindices with the following object.
Definition 2.4 (Wasserstein gradient flow).
A Wasserstein gradient flow for the functional on a time interval is an absolutely continuous path in that satisfies, distributionally on ,
This is a proper generalization of Definition 2.2 since, whenever is a particle gradient flow for , then is a Wasserstein gradient flow for in the sense of Definition 2.4 (see Proposition B.1). By leveraging the abstract theory of gradient flows developed in ambrosio2008gradient , we show in Appendix B.2.1 that these Wasserstein gradient flows are well-defined.
Proposition 2.5 (Existence and uniqueness).
Note that the condition on the initialization is automatically satisfied in Proposition 2.3 because there the initial measure has a finite discrete support: it is thus contained in any for large enough.
2.4 Many-particle limit
We now characterize the many-particle limit of classical gradient flows, under Assumptions 2.1.
Theorem 2.6 (Many-particle limit).
Consider a sequence of classical gradient flows for initialized in a set . If converges to some for the Wasserstein distance , then converges, as , to the unique Wasserstein gradient flow of starting from .
Given a measure , an example for the sequence is where are independent samples distributed according to
. By the law of large numbers for empirical distributions, the sequence of empirical distributionsconverges (almost surely, for ) to . In particular, our proof of Theorem 2.6 gives an alternative proof of the existence claim in Proposition 2.5 (the latter remains necessary for the uniqueness of the limit).
3 Convergence to global minimizers
3.1 General idea
As can be seen from Definition 2.4, a probability measure is a stationary point of a Wasserstein gradient flow if and only if . It is proved in nitanda2017stochastic that these stationary points are, in some cases, optimal over probabilities that have a smaller support. However, they are not in general global minimizers of over , even when is convex. Such global minimizers are indeed characterized as follows.
Proposition 3.1 (Minimizers).
Assume that is convex. A measure such that minimizes on iff and for -a.e. .
Despite these strong differences between stationarity and global optimality, we show in this section that Wasserstein gradient flows converge to global minimizers, under two main conditions:
On the structure: and must share a homogeneity direction (see Section 2.1 for the definition of homogeneity), and
On the initialization: the support of the initialization of the Wasserstein gradient flow satisfies a “separation” property. This property is preserved throughout the dynamic and, combined with homogeneity, allows to escape from neighborhoods of non-optimal points.
We turn these general ideas into concrete statements for two cases of interest, that exhibit different structures and behaviors: (i) when and are positively -homogeneous and (ii) when and are positively -homogeneous with respect to one variable.
3.2 The -homogeneous case
In the -homogeneous case a rich structure emerges, where the -dimensional sphere plays a special role. This covers the case of lifted problems of Section 2.1 when is -homogeneous and neural networks with ReLU activation functions.
The domain is with and is differentiable with locally Lipschitz, is semiconvex and and are both positively -homogeneous. Moreover,
(smooth convex loss) The loss is convex, differentiable with differential Lipschitz on bounded sets and bounded on sublevel sets,
(Sard-type regularity) For all , the set of regular values444For a function , a regular value is a real number in the range of such that is included in an open set where is differentiable and where does not vanish. of is dense in its range (it is in fact sufficient that this holds for functions which are of the form for some ).
Taking the balls of radius as the family , these assumptions imply Assumptions 2.1. We believe that Assumption 3.2-(4) is not of practical importance: it is only used to avoid some pathological cases in the proof of Theorem 3.3. By applying Morse-Sard’s lemma abraham1967transversal , it is anyways fulfilled if the function in question is times continuously differentiable. We now state our first global convergence result. It involves a condition on the initialization, a separation property, that can only be satisfied in the many-particle limit. In an ambient space , we say that a set separates the sets and if any continuous path in with endpoints in and intersects .
Under Assumptions 3.2, let be a Wasserstein gradient flow of such that, for some , the support of is contained in and separates the spheres and . If converges to in , then is a global minimizer of over . In particular, if is a sequence of classical gradient flows initialized in such that converges weakly to then (limits can be interchanged)
A proof and stronger statements are presented in Appendix C. There, we give a criterion for Wasserstein gradient flows to escape neighborhoods of non-optimal measures—also valid in the finite-particle setting—and then show that it is always satisfied by the flow defined above. We also weaken the assumption that converges: we only need a certain projection of to converge weakly. Finally, the fact that limits in and can be interchanged is not anecdotal: it shows that the convergence is not conditioned on a relative speed of growth of both parameters.
This result might be easier to understand by drawing an informal distinction between (i) the structural assumptions which are instrumental and (ii) the technical conditions which have a limited practical interest. The initialization and the homogeneity assumptions are of the first kind. The Sard-type regularity is in contrast a purely technical condition: it is generally hard to check and known counter-examples involve artificial constructions such as the Cantor function whitney1935function . Similarly, when there is compactness, a gradient flow that does not converge is an unexpected (in some sense adversarial) behavior, see a counter-example in absil2005convergence . We were however not able to exclude this possibility under interesting assumptions (see a discussion in Appendix C.5).
3.3 The partially -homogeneous case
Similar results hold in the partially -homogeneous setting, which covers the lifted problems of Section 2.1 when is bounded (e.g., sparse deconvolution and neural networks with sigmoid activation).
The domain is with , and where and are bounded, differentiable with Lipschitz differential. Moreover,
(smooth convex loss) The loss is convex, differentiable with differential Lipschitz on bounded sets and bounded on sublevel sets,
(Sard-type regularity) For all , the set of regular values of is dense in its range, and
(boundary conditions) The function behaves nicely at the boundary of the domain: either
and for all , converges, uniformly in as , to a function satisfying the Sard-type regularity, or
is the closure of an bounded open convex set and for all , satisfies Neumann boundary conditions (i.e., for all , where is the normal to at ).
With the family of nested sets , , these assumptions imply Assumptions 2.1. The following theorem mirrors the statement of Theorem 3.3, but with a different condition on the initialization. The remarks after Theorem 3.3 also apply here.
Under Assumptions 3.4, let be a Wasserstein gradient flow of such that for some , the support of is contained in and separates from . If converges to in , then is a global minimizer of over . In particular, if is a sequence of classical gradient flows initialized in such that converges to in then (limits can be interchanged)
4 Case studies and numerical illustrations
In this section, we apply the previous abstract statements to specific examples and show on synthetic experiments that the particle-complexity to reach global optimality is very favorable.
4.1 Sparse deconvolution
For sparse deconvolution, it is typical to consider a signal on the -torus . The loss function is for some , a parameter that increases with the noise level and the regularization is . Consider a filter impulse response and let . The object sought after is a signed measure on , which is obtained from a probability measure on by applying a operator defined by for all measurable . We show in Appendix D that Theorem 3.5 applies.
Proposition 4.1 (Sparse deconvolution).
Assume that the filter impulse response is times continuously differentiable, and that the support of contains . If the projection of the Wasserstein gradient flow of weakly converges to , then is a global minimizer of
We show an example of such a reconstruction on the -torus on Figure 1, where the ground truth consists of weighted spikes, is an ideal low pass filter (a Dirichlet kernel of order ) and is a noisy observation of the filtered spikes. The particle gradient flow is integrated with the forward-backward algorithm combettes2011proximal and the particles initialized on a uniform grid on .
4.2 Neural networks with a single hidden layer
We consider a joint distribution of features and labels and the marginal distribution of features. The loss is the expected risk defined on , where is either the squared loss or the logistic loss. Also, we set for an activation function . Depending on the choice of , we face two different situations.
If is a sigmoid, say , then Theorem 3.5, with domain applies. The natural (optional) regularization term is , which amounts to penalizing the norm of the weights.
Proposition 4.2 (Sigmoid activation).
Assume that has finite moments up to order , that the support of is and that boundary condition 3.4-(iii)-(a) holds. If the Wasserstein gradient flow of converges in to , then is a global minimizer of .
The activation function is positively -homogeneous: this makes -homogeneous and corresponds, at a formal level, to the setting of Theorem 3.3. An admissible choice of regularizer here would be the (semi-convex) function bach2017breaking . However, as shown in Appendix D.4, the differential has discontinuities: this prevents altogether from defining gradient flows, even in the finite-particle regime.
Still, a statement holds for a different parameterization of the same class of functions, which makes differentiable. To see this, consider a domain which is the disjoint union of copies of . On the first copy, define where is the signed square function. On the second copy, has the same definition but with a minus sign. This trick allows to have the same expression power than classical ReLU networks. In practice, it corresponds to simply putting, say, random signs in front of the activation. The regularizer here can be .
Proposition 4.3 (Relu activation).
We display on Figure 2 particle gradient flows for training a neural network with a single hidden layer and ReLU activation in the classical (non-differentiable) parameterization, with
(no regularization). Features are normally distributed, and the ground truth labels are generated with a similar network withneurons. The particle gradient flow is “integrated” with mini-batch SGD and the particles are initialized on a small centered sphere.
4.3 Empirical particle-complexity
Since our convergence results are non-quantitative, one might argue that similar—and much simpler to prove—asymptotical results hold for the method of distributing particles on the whole of and simply optimizing on the weights, which is a convex problem. Yet, the comparison of the particle-complexity shown in Figure 3 stands strongly in favor of particle gradient flows. While exponential particle-complexity is unavoidable for the convex approach, we observed on several synthetic problems that particle gradient descent only needs a slight over-parameterization to find global minimizers within optimization error (see details in Appendix D.5).
We have established asymptotic global optimality properties for a family of non-convex gradient flows. These results were enabled by the study of a Wasserstein gradient flow: this object simplifies the handling of many-particle regimes, analogously to a mean-field limit. The particle-complexity to reach global optimality turns out very favorable on synthetic numerical problems. This confirms the relevance of our qualitative results and calls for quantitative ones that would further exploit the properties of such particle gradient flows. Multiple layer neural networks are also an interesting avenue for future research.
We acknowledge supports from grants from Région Ile-de-France and the European Research Council (grant SEQUOIA 724063).
- (1) Ralph Abraham and Joel Robbin. Transversal mappings and flows. WA Benjamin New York, 1967.
- (2) Pierre-Antoine Absil, Robert Mahony, and Benjamin Andrews. Convergence of the iterates of descent methods for analytic cost functions. SIAM Journal on Optimization, 16(2):531–547, 2005.
- (3) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017.
- (5) Adrien Blanchet and Jérôme Bolte. A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. arXiv preprint arXiv:1612.02619, 2016.
- (6) Nicholas Boyd, Geoffrey Schiebinger, and Benjamin Recht. The alternating descent conditional gradient method for sparse inverse problems. SIAM Journal on Optimization, 27(2):616–639, 2017.
- (7) Kristian Bredies and Hanna Katriina Pikkarainen. Inverse problems in spaces of measures. ESAIM: Control, Optimisation and Calculus of Variations, 19(1):190–218, 2013.
- (8) Felix E. Browder. Fixed point theory and nonlinear problems. Proc. Sym. Pure. Math, 39:49–88, 1983.
- (9) Paul Catala, Vincent Duval, and Gabriel Peyré. A low-rank approach to off-the-grid sparse deconvolution. Journal of Physics: Conference Series, 904(1):012015, 2017.
- (10) Donald L. Cohn. Measure theory, volume 165. Springer, 1980.
- (11) Patrick L. Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185–212. Springer, 2011.
- (12) Yohann De Castro and Fabrice Gamboa. Exact reconstruction using Beurling minimal extrapolation. Journal of Mathematical Analysis and applications, 395(1):336–354, 2012.
- (13) Vincent Duval and Gabriel Peyré. Exact support recovery for sparse spikes deconvolution. Foundations of Computational Mathematics, 15(5):1315–1355, 2015.
- (14) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
- (15) Suriya Gunasekar, Blake E. Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems 30, 2017.
- (16) Benjamin D. Haeffele and René Vidal. Global optimality in neural network training. In , pages 7331–7339, 2017.
- (17) Daniel Hauer and José Mazón. Kurdyka-Łojasiewicz-Simon inequality for gradient flows in metric spaces. arXiv preprint arXiv:1707.03129, 2017.
- (18) Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 1994.
- (19) Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the International Conference on Machine Learning (ICML), 2013.
- (20) Michel Journée, Francis Bach, P-A Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, 20(5):2327–2351, 2010.
- (21) Harold Kushner and G. George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
- (22) Jean-Bernard Lasserre. Moments, positive polynomials and their applications, volume 1. World Scientific, 2010.
- (23) Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
- (24) Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- (25) Atsushi Nitanda and Taiji Suzuki. Stochastic particle gradient descent for infinite ensembles. arXiv preprint arXiv:1712.05438, 2017.
- (26) Clarice Poon, Nicolas Keriven, and Gabriel Peyré. A dual certificates analysis of compressive off-the-grid recovery. arXiv preprint arXiv:1802.08464, 2018.
- (27) Ralph T. Rockafellar. Convex Analysis. Princeton University Press, 1997.
- (28) Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.
- (29) Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 2015.
- (30) Filippo Santambrogio. Euclidean, metric, and Wasserstein gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
- (31) Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d’Aspremont. Integration methods and optimization algorithms. In Advances in Neural Information Processing Systems, pages 1109–1118, 2017.
- (32) Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks. arXiv preprint arXiv:1805.01053, 2018.
- (33) Mahdi Soltanolkotabi, Adel Javanmard, and Jason D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.
- (34) Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.
- (35) Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys. arXiv preprint arXiv:1802.06384, 2018.
- (36) Chu Wang, Yingfei Wang, Robert Schapire, et al. Functional Frank-Wolfe boosting for general loss functions. arXiv preprint arXiv:1510.02558, 2015.
- (37) Hassler Whitney et al. A function not constant on a connected set of critical points. Duke Mathematical Journal, 1(4):514–517, 1935.
Supplementary material for the paper: “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport” authored by Lénaïc Chizat and Francis Bach (NIPS 2018).
Appendix A Introductory facts
a.1 Tools from measure theory
In this paper, the term measure refers to a finite signed measure on , , endowed with its Borel -algebra. We write for the set of such measures concentrated on a measurable set . Hereafter, we gather some concepts and facts from measure theory that are used in the proofs.
Variation of a signed measure.
The Jordan decomposition theorem [10, Cor. 4.1.6] asserts that any finite signed measure can be decomposed as where . If and are chosen with minimal total mass, the variation of is the nonnegative measure and is the total variation norm of .
Support and concentration set.
The support of a measure is the complement of the largest open set of measure , or, equivalently, the set of points which neighborhoods have positive measure. We say that is concentrated on a set if the complement of is included in a measurable set of measure . In particular, is concentrated on .
Let and be measurable subsets of and let be a measurable map. To any measure corresponds a measure called the pushfoward of by . It is defined as for all measurable set and corresponds to the distribution of the “mass" of after it has been displaced by the map . It satisfies whenever is a measurable function such that is -integrable [10, Prop. 2.6.8]. In particular, with a projection map , the pushforward is the marginal of on the -th factor.
Weak convergence and Bounded Lipschitz norm.
We say that a sequence of measures weakly (or narrowly) converges to if, for all continuous and bounded function it holds . For sequences which are bounded in total variation norm, this is equivalent to the convergence in Bounded Lipschitz norm. The latter is defined, for , as
where is the smallest Lipschitz constant of and the supremum norm.
The -Wasserstein distance between two probability measures is defined as
where the minimization is over the set of probability measures such that the marginal on the first factor is and is on the second factor. The set of probability measures with finite second moments endowed with the metric is a complete metric space that we denote . A sequence converges in iff for all continuous function with at most quadratic growth it holds [3, Prop. 7.1.5] (this is stronger than weak convergence). Using, respectively, the duality formula for [29, Eq. (3.1)] and Jensen’s inequality, it holds
Note that the functional of interest in this article is continuous for the Wasserstein metric. This strong regularity is rather rare in the study of Wasserstein gradient flows.
Lemma A.1 (Wasserstein continuity of ).
Under Assumptions 2.1, the function is continuous for the Wasserstein metric .
a.2 Lifting to the space of probability measures
Let us give technical details about the lifting introduced in Section 2.1 that allows to pass from a problem on the space of signed measures on (the minimization of defined in (1)) to an equivalent problem on the space of probability measures on a bigger space (the minimization of defined in (3)).
We recall that a function from to a vector space is said positively -homogeneous, with if for all and it holds . We often use without explicit mention the properties related to homogeneity such as the fact that the (sub)-derivative of a positively -homogeneous function is positively -homogeneous and, for differentiable (except possibly at ), the identity for .
a.2.1 The partially -homogeneous case
We take , and for some continuous functions and . This setting covers the lifted problems mentioned in Section 2.1. We first show that can be indifferently minimized over or over , thanks to the homogeneity of and in the variable .
For all , there is such that .
If then where is any point in . Otherwise, we define the map and the probability measure , which satisfies . ∎
We now introduce a projection operator that is adapted to the partial homogeneity of and . It is defined by for all and measurable set or, equivalently, by the property that for all continuous and bounded test function ,
This operator is well defined whenever is -integrable.
Proposition A.3 (Equivalence under lifting).
It holds . For a regularizer on of the form , it holds . If the infimum defining is attained and if minimizes , then there exists that minimizes over .
A signed measure can be expressed as where and (take for instance the normalized variation of if ). The measure
belongs to and satisfies . This proves that is surjective. It is clear by the definition of that for all , it holds hence , with equality when is the minimizer in the definition of . ∎
The class of regularizer considered in Proposition A.3 includes the total variation norm.
Proposition A.4 (Total variation).
Let . For , it holds with equality if, for instance, is a lift of of the form (8).
a.2.2 The -homogeneous case
Another structure that is studied in this paper is when and are defined on and are positively -homogeneous. In this case, the role played by is the previous section is played by the unit sphere of . We could again make links between (defined as in Eq. (3)) and a functional on nonnegative measures on the sphere (playing the role of ) but here we will limit ourselves to defining the projection operator relevant in this setting. It is characterized by the relationship, for all continuous and bounded function (with the convention ):
This operator is well-defined iff has finite second order moments.
Appendix B Many-particle limit and Wasserstein gradient flow
b.1 Proof of Proposition 2.3
As the sum of a continuously differentiable and a semiconvex function, is locally semiconvex and the existence of a unique gradient flow on a maximal interval with the claimed properties is standard, see [30, Sec. 2.1]. Now, a general property of gradient flows is that for a.e , the derivative is (minus) the subgradient of minimal norm. This leads to the explicit formula involving the velocity field with pointwise minimal norm:
In the specific case of gradient flows of lower bounded functions, we can derive estimates that imply that (even if is not globally semiconvex). Indeed, for all , it holds
by Jensen’s inequality. Since is lower bounded, this proves that the gradient flow has bounded length on bounded time intervals. By compactness, if was finite then would exist, thus contradicting the maximality of , hence and the gradient flow is globally defined.
b.2 Link between classical and Wasserstein gradient flows
We first give a rigorous definition of the continuity equation which appear in the definition of Wasserstein gradient flows (Definition 2.4).
Considerations from fluid mechanics suggest that if a time dependent distribution of mass is displaced under the action of a velocity field , then the continuity equation is satisfied: