# Global convergence of neuron birth-death dynamics

Neural networks with a large number of parameters admit a mean-field description, which has recently served as a theoretical explanation for the favorable training properties of "overparameterized" models. In this regime, gradient descent obeys a deterministic partial differential equation (PDE) that converges to a globally optimal solution for networks with a single hidden layer under appropriate assumptions. In this work, we propose a non-local mass transport dynamics that leads to a modified PDE with the same minimizer. We implement this non-local dynamics as a stochastic neuronal birth-death process and we prove that it accelerates the rate of convergence in the mean-field limit. We subsequently realize this PDE with two classes of numerical schemes that converge to the mean-field equation, each of which can easily be implemented for neural networks with finite numbers of parameters. We illustrate our algorithms with two models to provide intuition for the mechanism through which convergence is accelerated.

## Authors

• 2 publications
• 10 publications
• 77 publications
• 11 publications
• ### Scaling limit of the Stein variational gradient descent part I: the mean field regime

We study an interacting particle system in R^d motivated by Stein variat...
05/10/2018 ∙ by Jianfeng Lu, et al. ∙ 0

• ### Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

We consider learning two layer neural networks using stochastic gradient...
02/16/2019 ∙ by Song Mei, et al. ∙ 0

• ### Accelerating Langevin Sampling with Birth-death

A fundamental problem in Bayesian inference and statistical machine lear...
05/23/2019 ∙ by Yulong Lu, et al. ∙ 0

• ### On Sparsity in Overparametrised Shallow ReLU Networks

The analysis of neural network training beyond their linearization regim...
06/18/2020 ∙ by Jaume de Dios, et al. ∙ 0

• ### PDE-constrained Models with Neural Network Terms: Optimization and Global Convergence

Recent research has used deep learning to develop partial differential e...
05/18/2021 ∙ by Justin Sirignano, et al. ∙ 0

• ### On the Global Convergence of Gradient Descent for multi-layer ResNets in the mean-field regime

Finding the optimal configuration of parameters in ResNet is a nonconvex...
10/06/2021 ∙ by Zhiyan Ding, et al. ∙ 0

• ### Rigorous modelling of nonlocal interactions determines a macroscale advection-diffusion PDE

A slowly-varying or thin-layer multiscale assumption empowers macroscale...
01/21/2020 ∙ by A. J. Roberts, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As a consequence of the universal approximation theorems, sufficiently wide single layer neural networks are expressive enough to accurately represent a broad class of functions [Cyb89, Bar93, PS91]. The existence of a neural network function arbitrarily close to a given target function, however, is not a guarantee that any particular optimization procedure can identify the optimal parameters. Recently, using mathematical tools from optimal transport theory and interacting particle systems, it was shown that gradient descent [RVE18b, MMN18, SS18, CB18b]

and stochastic gradient descent converge asymptotically to the target function in the large data limit.

This analysis relies on taking a “mean-field” limit in which the number of parameters tends to infinity. In this setting, gradient descent optimization dynamics is described by a partial differential equation (PDE), corresponding to a Wasserstein gradient flow on a convex energy functional. While this PDE provides a powerful conceptual framework for analyzing the properties of neural networks evolving under gradient descent dynamics, the formula confers few immediate practical advantages. Nevertheless, analysis of this Wasserstein gradient flow motivates the interesting possibility of altering the dynamics to accelerate convergence.

In this work, we propose a dynamical scheme involving a parameter birth-death process. It can be defined on systems of interacting (e.g., neural network optimization) or non-interacting particles. We prove that the resulting modified transport equation converges to the global minimum of the loss in both interacting and non-interacting regimes (under appropriate assumptions), and we provide an explicit rate of convergence in the latter case for the mean-field limit. Interestingly—and unlike the gradient flow—the only

fixed point of the dynamics we introduce is the global minimum of the loss function. We study the fluctuations of finite-particle dynamics around this mean-field convergent solution, showing that they are of the same order throughout the dynamics and therefore providing algorithmic guarantees directly applicable to finite single-layer neural network optimization. Finally, we derive algorithms that converge to the birth-death PDEs and verify numerically that these schemes accelerate convergence even for finite numbers of parameters.

Summarily, we describe:

Global convergence and monotonicity of the energy with birth-death dynamics — We propose in Section 3 two distinct modifications of the original gradient flow that can be interpreted as birth-death processes. In this sense, the processes we describe amount to non-local mass transport in the equation governing the parameter distribution. We prove that the schemes we introduce increase the rate of contraction of the energy compared to gradient descent and stochastic gradient descent for fixed , and derive asymptotic rates of convergence (Section 4).

Analysis of fluctuations and quenching protocol — The birth-death dynamics introduces additional fluctuations that are not present in gradient descent dynamics. In Section 5 we calculate these fluctuations using tools for measure-valued Markov processes and propose a simple quenching procedure to control fluctuations and ensure long-time convergence.

Algorithms for realizing the birth-death schemes — In Section 6 we detail numerical schemes (and provide implementations in PyTorch) of the birth-death schemes described below. The computational cost of implementing our procedure is minimal because no additional gradient computations are required. We demonstrate the efficacy of these algorithms on simple, illustrative examples in Section 7.

## 2 Related Works

Non-local update rules appear in various areas of machine learning and optimization. Derivative-free optimization

[RS13]

offers a general framework for optimizing complex non-convex functions using non-local search heuristics. Some notable examples include Particle Swarm Optimization

[Ken11] and Evolutionary Strategies, such as the Covariance Matrix Adaptation method [Han06]

. These approaches have found some renewed interest in the optimization of neural networks in the context of Reinforcement Learning

[SHC17, SMC17]

and hyperparameter optimization

[JDO17].

Our setup of non-interacting potentials is closely related to the so-called Estimation of Distribution Algorithms

[BC95, LL01]

, which define update rules for a probability distribution over a search space by querying the values of a given function to be optimized. In particular, Information Geometric Optimization Algorithms

[OAAH17]

study the dynamics of parametric densities using ordinary differential equations, focusing on invariance properties. In contrast, our focus in on the combination of transport (gradient-based) and birth-death dynamics.

Dropout [SHK14] is a regularization technique popularized by the AlexNet CNN [KSH12] reminiscent of a birth-death process, but we note that its mechanism is very different: rather than killing a neuron and replacing it by a new one with some rate, Dropout momentarily masks neurons, which become active again at the same position; in other words, Dropout implements a purely local transport scheme, as opposed to our non-local dynamics.

Finally, closest to our motivation is [WLLM18], who, building on the recent body of works that leverage optimal-transport techniques to study optimization in the large parameter limit [RVE18a, CB18b, MMN18, SS18], proposed a modification of the dynamics that replaced traditional stochastic noise by a resampling of a fraction of neurons from a base, fixed measure. Our model has significant differences to this scheme, namely we show that the dynamics preserves the same global minimizers and accelerates the rate of convergence.

## 3 Mean-field PDE and Birth-Death Dynamics

### 3.1 Mean-Field Limit and Liouville dynamics

Gradient descent propagates the parameters locally in proportion to the gradient of the objective function. In some cases, an optimization algorithm can benefit from nonlocal dynamics, for example, by allowing new parameters to appear at favorable values and existing parameters to be removed if they diminish the quality of the representation. In order to exploit a nonlocal dynamical scheme, it is useful to interpret the parameters as a system of particles, -dimensional differentiable manifold, , evolving on a landscape determined by the objective function . Here we will focus on situations where the objective function may involve interactions between pairs of parameters:

 ℓ(θ1,…,θn)=n∑i=1F(θi)+12nn∑i,j=1K(θi,θj) (1)

where is a single particle energy function and is a symmetric semi-positive definite interaction kernel. Interestingly, optimizing neural networks with the mean-squared loss function fits precisely this framework [RVE18b, MMN18, CB18b]

. Consider a supervised learning problem using a neural network with nonlinearity

. If we write the neural network as

 fn(x;θ1,…,θn)=1nn∑i=1φ(x,θi) (2)

and expand the loss function,

 ℓ(θ1,…,θn)=12Ey,x|y−fn(x;θ1,…,θn)|2, (3)

we see that, up to an irrelevant constant depending only on the data distribution, we arrive at (1) with

 F(θ)=−Ey,x[yφ(x,θ)], (4)

and,

 K(θ,θ′)=Ex[φ(x,θ)φ(x,θ′)]. (5)

We also consider non-interacting objective functions in which in (1). Optimization problems that fit this framework include resource allocation tasks in which, e.g., weak performers are eliminated, Evolution Strategies, and Information Geometric Optimization [OAAH17].

In the case of gradient descent dynamics, the evolution of the particles is governed for by

 ˙θi=−∇θiℓ(θ1,…,θn). (6)

To analyze the dynamics of this particle system, we consider the “mean-field” limit . As the number of particles becomes large, the empirical distribution of particles

 μ(n)t(dθ)=1nn∑j=1δθj(t)(dθ). (7)

leads to a deterministic partial differential equation at first order [RVE18b, MMN18, CB18b, SS18],

 ∂tμt=∇⋅(μt∇V), (8)

where is the weak limit of and is some distribution from which the initial particle positions are drawn independently. The potential is specified by the objective function as

 V(θ,[μ])=F(θ)+∫DK(θ,θ′)μ(dθ′). (9)

and (8) should be interpreted in the weak sense in general:

 ∀ϕ∈C∞b(D) :∂t∫Dϕ(θ)μt(dθ)=−∫D∇ϕ(θ)⋅∇V(θ,[μt])μt(dθ), (10)

where denotes the space of smooth functions with compact support on .

Interestingly, is the gradient with respect to of an energy functional ,

 E[μ]=∫DF(θ)μ(dθ)+12∫D×DK(θ,θ′)μ(dθ)μ(dθ′). (11)

As a result, the nonlinear Liouville equation (8) is the Wasserstein gradient flow with respect to the energy functional . Local minima of (where ) are clearly fixed points of this gradient flow, but these fixed points may not always be minimizers of the energy when . When the initial distribution of parameters has full support, neural networks evolving with gradient descent avoid these spurious fixed points under appropriate assumptions about their nonlinearity [CB18b, RVE18b, MMN18].

### 3.2 Birth-Death augmented Dynamics

Here we consider a more general dynamical scheme that involves nonlocal transport of particle mass; remarkably, this dynamics avoids spurious fixed points and local minima, and converges asymptotically to the global minimum. Consider the following modification of the Wasserstein gradient flow above:

 ∂tμt=∇⋅(μt∇V)−αVμt(α>0). (12)

The additional term is a birth-death term that modifies the mass of . If is positive, this mass will decrease, corresponding to the removal or “death” of parameters. If is negative, this mass will increase, which can be implemented as duplication or “cloning” of parameters. For a finite number of parameters, this dynamics could lead to changes in the architecture of the network. In many applications it is preferable to fix the total population, achieved by simply adding a conservation term to the dynamics,

 ∂tμt=∇⋅(μt∇V)−αVμt+α¯Vμt. (13)

where .

The dynamics defined by (13) are well-defined in the space of measures (i.e., they preserve the mass and the positivity of ), and the birth-death terms improve the rate of energy decay, as shown by the following proposition:

###### Proposition 3.1

Let be a solution of (13) for the initial condition , the space of probability measures on . Then, for all , and satisfies

 ˙E(t)=−∫D|∇V(θ,[μt])|2μt(dθ)−α∫D(V(θ,[μt])−¯V[μt])2μt(dθ)≤0. (14)

Proof: Both statements are direct consequences of the definition, and are obtained by using and respectively as test functions in (13), together with .

The birth-death term thus contributes to increase the rate of decay of the energy at all times. A natural question is whether such improved energy decay can lead to global convergence of the dynamics to the global minimum of the energy. As it turns out, the answer is yes: the fixed points of the birth-death PDEs (12) and (13) are the global minimizers of the energy , as we prove in Section 4. How to implement a particle dynamics consistent with (13) is discussed in Sections 5 and 6.

While the birth-death dynamics described above ensures convergence in the mean-field limit, when is finite, particles can only be created in proportion to the empirical distribution In particular, such a birth process corresponds to “cloning” or creating identical replicas of existing particles. In practice, there may be an advantage to exploring parameter space with a distribution distinct from the instantaneous empirical particle distribution (7). To enable this exploration we introduce a birth term proportional to a distribution which we will assume has full support on . In this case, the time evolution of the distribution is described by

 ∂tμt=∇⋅(μt∇V) −αV+μt+α¯V+μb1V≤0μb(V≤0) (15) +α′V−μb−α′¯Vb−μt1V>0μt(V>0),

where , , , , and , That is, we kill particles in proportion to in region where but create new particles from in regions where .

This new birth-death dynamics also satisfies the consistency conditions of Proposition 3.1:

###### Proposition 3.2

Let be a solution of (15), with . Then, for all , and satisfies

 ˙E(t)≤−∫D|∇V(θ,[μt])|2μt(dθ). (16)

Proof: By considering again and as a test function in (15), we verify that and

 ˙E(t) =∫DV(θ,[μt])∂tμt(dθ) =−∫D|∇V|2dμt−α∫DVV+dμt+α¯V+μb(V≤0)∫DV1V≤0dμb +α′∫DVV−dμb−α′¯Vb−μt(V>0)∫DV1V>0dμt =−∫D|∇V|2dμt−α∫D|V+|2dμt−α¯V+¯Vb−μb(V≤0)−α′∫D|V−|2dμb−α′¯V+¯Vb−μt(V>0) ,

which proves (16) since all the terms at the right hand side of this equation are negative individually.

## 4 Convergence of Transport Dynamics with Birth-Death

Here, we compare the solutions of the original PDE (8) with those of the PDE (13) with birth-death. We restrict ourselves to situations where and in (11) are such that is bounded from below. Our main technical contributions are results about convergence towards global energy minimizer as well as convergence rates as the dynamics approaches these minimizers. We consider separately the non-interacting and the interacting cases.

Under gradient descent dynamics, global convergence can be established with appropriate assumptions on the initialization and architecture of the neural network. [MMN18]

establishes global convergence and provides a rate for neural networks with bounded activation functions evolving under stochastic gradient descent. Similar results were obtained in

[CB18b, RVE18b], in which it is proven that gradient descent converges to the globally optimal solution for neural networks with particular homogeneity conditions on the activation functions and regularizers. Closely related to the present work, [WLLM18] provides a convergence rate for a “perturbed” gradient flow in which uniform noise is added to the PDE (8). It should be emphasized that, unlike our formulation, the addition of uniform noise changes the fixed point of the PDE and convergence to only an approximate global solution can be obtained in that setting.

### 4.1 Non-interacting Case

We consider first the non-interacting case with and , under

###### Assumption 4.1

is a Morse function, coercive, and with a single global minimum located at .

With no loss of generality we set since adding an offset to in (13) does not affect the dynamics. We also denote by the Hessian of at : recall that a Morse function is such that its Hessian is nondegenerate at all its critical points (where ) and it is coercive if . Our main result is

###### Theorem 4.2 (Global Convergence and Rate: Non-interacting Case)

Assume that the initial condition of the PDE (12) has a density positive everywhere in . Then under Assumption 4.1 the solution of (12) satisfies

 limt→∞μt=δθ∗. (17)

In addition we can quantify the convergence rate: if , then such that , the time needed to reach satisfies

 tϵ≤Cϵ−(d+2)/2. (18)

Furthermore the rate of convergence becomes exponential in time asymptotically: then for all , such that

 ¯F(t)≤α−1tr(H∗e−2H∗(t−δ))if \ \ t≥tδ. (19)

In fact we show that

 limt→∞α¯F(t)tr(H∗e−2H∗t)=1. (20)

The theorem is proven in Appendix A and is based on the following lemma obtained by solving the PDE (12) by the method of characteristics:

###### Lemma 4.3

Denote by the solution of the ODE

 ˙Θ(t,θ)=−∇F(Θ(t,θ)),Θ(0,θ)=θ (21)

Then under the conditions of Theorem 4.2, the solution of the PDE (12) has a density given by

 ρt(dθ)=exp(∫0−tG(Θ(s,θ))ds)ρ0(Θ(−t,dθ))∫Dexp(∫0−tG(Θ(s,θ′))ds)ρ0(Θ(−t,dθ′))dθ′ (22)

where .

The proof of Theorem 4.2 shows that the additional birth-death terms in the PDE (12) allow the measure to concentrate rapidly in the vicinity of ; subsequently, the transport term takes over and leads to the exponential rate of energy decay in (19). The proof also shows that, if we remove the transportation term in the PDE (12), the energy only decreases linearly in time asymptotically. This means that the combination of the transportation and the birth-death terms accelerates convergence. A similar theorem can be proven for the PDE (15).

### 4.2 Interacting Case

Let us now consider the interacting case, when is given by (9) with . We make

###### Assumption 4.4

The set is a -dimensional differentiable manifold which is either closed (i.e. compact, with no boundaries), or open (i.e. with no closed subset), or the Cartesian product of a closed and an open manifold.

As well as,

###### Assumption 4.5

The kernel is symmetric, positive semi-definite, and twice differentiable in its arguments, ; . If is not closed, is bounded on , and is bounded from below and coercive in the sense that for any , there exists a compact such that .

Assumption 4.5 guarantees that the quadratic energy in (11) is convex. As a result it has a minimum value. Assumption 4.5 also guarantees that this minimum is only achieved by minimizers; this is a necessary condition that can perhaps be obtained under weaker assumptions. It should be noted that the technical assumptions on

in the non-closed case prohibit some nonlinearities, such as ReLU. Furthermore, if the nonlinearity is homogeneous of degree one, then the proof can be adapted to the case that

with compact. These global minimizers may not be unique, and satisfy:

 {V(θ,[μ])=¯V[μ]∀θ∈suppμV(θ,[μ])≥¯V[μ]∀θ∈D. (23)

where . These equations are well known [Ser15], and it is also known that any solution to these equations has compact support (which may be if is closed). We recall the derivation of these conditions in Appendix B.

A well-known issue with the PDE (8) is that it potentially has many more fixed points than has minimizers: Indeed, rather than (23), these fixed points only need to satisfy

 ∇V(θ,[μ])=0∀θ∈suppμ. (24)

It is therefore remarkable that, if we pick an initial condition for the birth-death PDE (13) that has full support, the solution to this equation converges to a global minimizer of :

###### Theorem 4.6 (Global Convergence to Global Minimizers: Interacting Case)

Let denote the solution of (13) for the initial condition with and with a density . Then under Assumption 4.5.

 limt→∞μt=μ∗ (25)

where is a global minimizer of .

This theorem is proven in Appendix C. One aspect of the argument is based on the evolution equation (14) for . Since , by the bounded convergence theorem, the evolution must stop eventually since is bounded from below by Assumption 4.5. This happens when both integrals in (14) are zero, i.e. when the first equation in (23) as well as (24) are satisfied. What remains to be shown is that the second equation in (23) must also be satisfied, which is done in Appendix C.

Regarding the rate of convergence, we have the following result:

###### Theorem 4.7 (Asymptotic Convergence Rate: Interacting Case)

Let denote the solution of (13) for the initial condition with and with a density . Assume that, as where has a density such that . Let . Then, for any , such that

 E(t)≤Ct−1if \ \ t≥tC (26)

The proof of this theorem is given in Appendix D where we prove

 limt→∞tE(t)=0. (27)

If , then a proof following the same steps carries through and rate of convergence proportional to holds.

## 5 From Mean-field to Particle Dynamics with Birth-Death

Here we show how to design a particle dynamics consistent with (13) at mean field level () and analyze the scale of the fluctuations above that limit at finite . For simplicity we consider the non-interacting case when and, without loss of generality, we assume that –particle dynamics consistent with (13) in the general case are considered below in Sec. 6. We proceed formally, and leave the details of a rigorous derivation of these results to a future publication.

The birth-death dynamics is represented as a continuous time Markov process in which the particles have an exponential survival/duplication probability. To realize this process, we equip each particle evolving by the GD flow (6) with an exponential clock with rate such that: with probability , the particle will be duplicated during the small interval , and a particle chosen at random in the stack will be killed to preserve the population size. If we focus only on this part of the dynamics, it is easy to write the infinitesimal generator of the measure valued Markov process [Daw06] it induces at the level of the particle distribution defined in (7): the action of this generator on a functional evaluated on reads

 αnn∑i,j=1∫R2kF(θi)δθi(dθ)δθj(dθ′)(Φ[μ(n)+n−1(δθj−δθi)]−Φ[μ(n)]) (28) =nα∫R2kF(θ)μn(dθ)μ(n)(dθ′)(Φ[μ(n)+n−1(δθ′−δθ)]−Φ[μ(n)])=:(LnΦ)[μ(n)]

The operator in the second equality is obtained using the properties of the Dirac in the first equality, and this operator is now defined on any probability measure . If we formally take the limit as , we deduce that with

 (LΦ)[μ] =α∫R3kF(θ)μ(dθ)μ(dθ′)(δθ′(dθ′′)−δθ(dθ′′))δΦ[μ]δμ(θ′′) (29) =α∫R2k(F(θ′)−F(θ))μ(dθ)μ(dθ′)δΦ[μ]δμ(θ)

This is the generator of the PDE (using )

 ∂tμt=−αFμt+α¯Fμt, (30)

with . This equation is nothing but the PDE (13) when

and we neglect the transport term. That is, we have formally established that this PDE arises in the mean-field limit as a consequence of the Law of Large Numbers (LLN) for the process for the particles described above.

Quantifying the fluctuations above the LLN essentially amounts to going to next order in the expansion in . Proceeding similarly as we did to derive (29), we deduce that, as ,

 n((LnΦ)[μ]−(LΦ)[μ]) (31) →α∫R3kF(θ)μ(dθ)μ(dθ′)(δθ′(dθ′′)−δθ(dθ′′))(δθ′(dθ′′′)−δθ(dθ′′′))δ2Φ[μ]δμ(θ′′)δμ(θ′′′) =α∫R2kF(θ)μ(dθ)μ(dθ′)(δ2Φ[μ]δ2μ(θ)+δ2Φ[μ]δ2μ(θ′)−2δ2Φ[μ]δμ(θ)δμ(θ′))

This is a second order operator, which indicates that at next order we should formally turn (30

) into a stochastic differential equation by adding to this equation a Gaussian white-noise term of order

with a covariance structure consistent with (31). Since this noise is small when

is large, it is simpler and more precise to phrase this result in term of a Central Limit Theorem (CLT). Specifically, if

and is the solution of (30) then, as ,

 √n(μ(n)t−μt)⇀ωt%inlaw (32)

where is Gaussian random distribution whose equation can be obtained by linearizing the aforementioned stochastic equation around the solution of (30). Formally

 ∂tωt=−αFωt+α¯Fωt+α(∫DFdωt)μt+√2η(t), (33)

where is a white-noise term with covariance consistent with (31):

 Eη(t)η(t′) =α(F(θ)+¯F)μt(dθ)δθ(dθ′)δ(t−t′) (34) −α(F(θ)+F(θ′))μt(dθ)μt(dθ′)δ(t−t′)

Since is Gaussian with zero mean, all its information is contained in its covariance , for which we can derive an equation from (33)

 ∂tΣt =−α(F(θ)+F(θ′))Σt+2α¯FΣt (35) +αμt(dθ)∫RkF(θ′′)Σt(dθ′′,dθ)+αμt(dθ′)∫RkF(θ′′)Σt(dθ′′,dθ′) +α(F(θ)+¯F)μt(dθ)δθ(dθ′)−α(F(θ)+F(θ′))μt(dθ)μt(dθ′)

It is easy to see that the fluctuations are mass-conserving, in the sense that for all since this is true initially and . If we add the transport (i.e. the particles evolve by GD beside the birth-death process described above), we get an additional term at the right hand side of (35). A similar equation also holds for the interacting case, and essentially amounts to replacing by —the particle process with transport and birth-death that arises in this case is described in Sec. 6.

The main conclusion to draw from the developments above is that implementing a birth-death process on top of the GD dynamics is simple and it leads to the PDE (13) in the mean field limit, with small Gaussian fluctuations above that limit that scale as . These fluctuations are controlled on finite time intervals, and can be made to disappear at long time scales, when is closed to converged, by decreasing to zero according to some schedule. Note also that the argument above explains why should be kept in to control the fluctuations, even though sending to infinity would help at mean field level— in practice, remains finite, and the limits , do not commute, which requires keeping independent of the size of the network.

## 6 Algorithms

Numerical schemes that converge to the PDEs presented in Sec. 3 are both straightforward to interpret and easy to implement. We denote by a configuration of particles in the interacting potential in (1), and by the solution of the GD flow in this potential

 ˙Θi(t)=−∇F(Θi(t))−1nn∑j=1∇K(Θi(t),Θj(t)) (36)

with and . The parameters evolve according to the procedure defined in Algorithm 1.

The algorithm implements gradient descent with birth-death for the particles . After each time interval , the contribution of each particle to the empirical potential is computed. The continuous time Markov process described in Sec. 5 is then realized by removing/duplicating particles depending on the sign of Positive corresponds to decreasing mass in the PDE (12); as such, those particle are survive with exponentially decaying probability. On the other hand, negative means that the mass is locally increasing at mean-field level, so the particles duplicate with an exponential rate.

For some optimization problems, e.g. neural networks, calculating

directly is not possible, as it requires an integral over the data distribution. However, in this case, stochastic gradient descent provides an unbiased estimator for

We define the procedures clone and kill so that the particle system asymptotically converges to the birth-death dynamics (12). An algorithm for the reinjection dynamics (15) is similar and is given explicitly in Appendix E.

If the prior is used in the dynamics, the kill and clone procedures must be implemented differently. Particles that have can be duplicated in precisely the same fashion as in the first scheme. When , however, the kill procedure amounts to replacing the particle with a randomly sampled In our instantiation of the reinjection scheme, we use

 μb(dθ)=δ0(dθ1)¯μb(dθ2,…,dθk) (37)

where we denote componentwise and we have assumed homogeneity of degree one of the nonlinearity in the first component, , . With this choice, the terms involving in (15) disappear since the prior has no mass on .

## 7 Numerical Experiments

### 7.1 Mixture of Gaussians

We take as an illustrative example a mixture of Gaussians in dimension ,

 f(x)=1mm∑i=1¯ci(2πσ2i)d/2e−|x−¯yi|2/(2σ2i), (38)

which we approximate as a neural network with Gaussian nonlinearities with fixed standard deviation

,

 fn(x;c,y)=1nn∑i=1ci(2πσ2)d/2ne−|x−yi|2/(2σ2), (39)

denoting the parameters This is a useful test of our results because we can do exact gradient descent dynamics on the mean-squared loss function:

 ℓ(c,y)=12∫Rd|f(x)−fn(x;c,y)|2dx (40)

Because all the integrals are Gaussian, this loss can be computed analytically, and so can and its gradient.

In Fig. 1, we show convergence to the energy minimizer for a mixture of three Gaussians (details and source code are provided in the SM). The non-local mass transport dynamics dramatically accelerates convergence towards the minimizer. While gradient descent eventually converges in this setting—there is no metastability—the dynamics are particularly slow as the mass concentrates near the minimum and maxima of the target function. However, with the birth-death dynamics, this mass readily appears at those locations. The advantage of the birth-death dynamics with a reinjection distribution is highlighted by choosing an unfavorable initialization in which the particle mass is concentrated around In this case, both GD and GD with birth-death (12) do not converge on the timescale of the dynamics. With the reinjection distribution, new mass is created near and convergence is achieved.

### 7.2 Student-Teacher ReLU Network

In many optimization problems, it is not possible to evaluate exactly. Instead, typically is estimated as a sample mean over a batch of data. We consider a student-teacher set-up similar to [CB18a] in which we use single hidden layer ReLU networks to approximate a network of the same type with fewer neurons. We use as the target function a ReLU network with 50- input and 10 hidden units. We approximate the teacher with neural networks with neurons (see SM). The networks are trained with stochastic gradient descent (SGD) and the mini-batch estimate of the gradient of output layer, which is computed at each step of SGD, is used to compute which determines the rate of birth-death. In experiments with the reinjection distribution, we use (37) with Gaussian

As shown in Fig. 2, we find that the birth-death dynamics accelerates convergence to the teacher network. We emphasize that because the birth-death dynamics is stochastic at finite particle numbers, the fluctuations associated with the process could be unfavorable in some cases. In such situations, it is useful to reduce as a function of time. On the other hand, in some cases we have observed much more dramatic accelerations from the birth-death dynamics.

## 8 Conclusions

The success of an optimization algorithm based on gradient descent requires good coverage of the parameter space so that local updates can reach the minima of the loss function quickly. Our approach liberates the parameters from a purely local dynamics and allows rapid reallocation to values at which they can best reduce the approximation error. Importantly, we have constructed the non-local birth-death dynamics so that it converges to the minimizers of the loss function. For a very general class of minimization problems—both interacting and non-interacting potentials—we have established convergence to energy minimizers under the dynamics described by the mean-field PDE with birth-death. Remarkably, for interacting systems with sufficiently short-ranged interaction terms we can guarantee global convergence. We have also computed the asymptotic rate of convergence with birth-death dynamics.

These theoretical results translate to dramatic reductions in convergence time for our illustrative examples. It is worth emphasizing that the schemes we have described are straightforward to implement and come with little computational overhead. Extending this type of dynamics to deep neural network architectures could accelerate the slow dynamics at the initial layers often observed in practice. Hyperparameter selection strategies based on evolutionary alogorithms [SMC17] provide another interesting potential application of our approach.

While we have characterized the basic behavior of optimization under the birth-death dynamics, many theoretical questions remain. More explicit calculations of global convergence rates for the interacting case and tighter rates for the non-interacting case would be exciting additions. The proper choice of is another question worth exploring because, as highlighted in our simple example, favorable reinjection distributions can rapidly overcome slow dynamics. Finally, a mean-field perspective on deep neural networks would enable us to translate some of the guarantees here to deep architectures.

## Appendix A Convergence and Rates in the Non-interacting Case

### a.1 Non-interacting Case without the Transportation Term

Let us look first at the PDE satisfied by the measure in the non-interacting case, i.e. with satisfying Assumption 4.1, and without the transportation term:

 ∂tμt=−αF(θ)μt+α¯F(t)μt, (41)

where . This equation can be solved exactly. Assuming that has a density everywhere positive on , has a density given by

 ρt(θ)=eα∫t0¯F(s)ds−αtF(θ)ρ0(θ). (42)

 eα∫t0¯F(s)ds∫Rke−αtF(θ′)ρ0(θ′)dθ′=1 ⇔ e−α∫t0¯F(s)ds=∫Rke−αtF(θ′)ρ0(θ′)dθ′.

Therefore, by plugging this last expression in equation (42), we obtain the explicit expression

 ρt(θ)=e−αtF(θ)ρ0(θ)∫Rke−αtF(θ′)ρ0(θ′)dθ′. (43)

We can use this equation to express the energy :

 ¯F(t)=∫RkF(θ)e−αtF(θ)ρ0(θ)dθ∫Rke−αtF(θ)ρ0(θ)dθ=ddαtG(αt), (44)

where is the function defined as:

 G(αt)=−log∫Rke−αtF(θ)ρ0(θ)dθ. (45)

At late times, the factor focuses all the mass in the vicinity of the global minimum of . Therefore, we can neglect the influence of the density in this integral. More precisely a calculation using the Laplace method indicates that

 ∫Rke−αtF(θ)dθ∼(2π)d/2(αt)−d/2(det(H∗))−1/2. (46)

where is the Hessian at the global minimum located at , and indicates that the ratio of both sides of the equation tend to 1 as . This shows that

 ¯F(t)∼12d(αt)−1as \ \ αt→∞ (47)

### a.2 Non-interacting Case with Transportation and Birth-Death

#### a.2.1 Proof of Theorem 4.2

We first prove the following intermediate result

###### Lemma A.1

Let arbitrary, and define

 ϕδ(θ)=max(0,1−δ−1F(θ)) , fδ=∫Rkϕδ(θ)μ0(dθ) .

Then

 ∀t :E(t)≤δ+1αtfδ . (48)

Proof: By slightly abusing notation, we define

 fδ(t)=∫Rkϕδ(θ)μt(dθ) .

We consider the following Lyapunov function:

 Lδ(t)=αt(E(t)−δ)+1fδ(t) . (49)

Its time derivative is

 ˙Lδ(t)=α(E(t)−δ)+αt˙E(t)−˙fδ(t)f2δ(t) . (50)

By definition, we have

 ˙E(t)=−∫Rk|∇F(θ)|2μt(dθ)−α∫Rk(F(θ)−F(t))2μt(dθ)≤0 . (51)

We also have

 ˙fδ(t) =−∫Rk⟨∇ϕδ(θ),∇F(θ)⟩μt(dθ)−α∫Rkϕδ(θ)F(θ)μt