1 Introduction
Neural networks have become stateoftheart models in numerous machine learning tasks and strong empirical performance is often achieved by deeper networks. One landmark example is the residual network (ResNet)
[he2016deep, he2016identity], which can be efficiently optimized even at extremely large depth such as 1000 layers. However, there exists a gap between this empirical success and the theoretical understanding: ResNets can be trained to almost zero loss with standard stochastic gradient descent, yet it is known that larger depth leads to increasingly nonconvex landscape even the the presence of residual connections
[yun2019deep]. While global convergence can be obtained in the socalled “lazy” regime e.g. [jacot2018neural, du2018gradient], such kernel models cannot capture fullytrained neural networks [suzuki2018adaptivity, chizat2019lazy, ghorbani2019limitations].In this work, we aim to demonstrate the provable optimization of ResNet beyond the restrictive “lazy” regime. To do so, we build upon recent works that connect ordinary differential equation (ODE) models to infinitedepth neural networks [weinan2017proposal, lu2017beyond, sonoda2017double, haber2017stable, chen2018neural, dupont2019augmented, zhang2018dynamically, thorpe2018deep, sonoda2019transport, lu2019understanding]. Specifically, each residual block of a ResNet can be written as , which can be seen as the Euler discretization of the ODE . This turns training the neural network into solving an optimal control problem [li2017maximum, weinan2019mean, liu2019deep]
, under which backpropagation can be understood as simulating the adjoint equation
[chen2018neural, li2017maximum, pmlrv80li18b, zhang2019you, li2020scalable]. However, this analogy does not directly provide guarantees of global convergence even in the continuum limit.To address the problem of global convergence, we propose a new limiting ODE model of ResNets. Formally, we model deep ResNets via a meanfield ODE model
This model considers every residual block as a particle and optimizes over the empirical distribution of particles , where denotes the weight of the residual block and denotes the layer index of the residual block. We consider properties of the loss landscape with respect to the distribution of weights, an approach similar to [bengio2006convex, bach2017breaking]. Inspired by [veit2016residual] that a deep ResNet behaves like an ensemble of shallow models, we compare a deep ResNet with its counterpart twolayer network and show that the gradients of the two models are close to each other. This leads us to conclude that, although the loss landscape may not be convex, every local minimizer is a global one.
1.1 Contribution
Our contributions can be summarized as follows:

We derive a new continuous depth limit of deep ResNets. In this new model, each residual block is regarded as a particle and the training dynamics is captured by the gradient flow on the distribution of the particles .

We analyze the loss landscape with respect to and show that all local minima have zero loss, which indicates that every local optima is global. This property leads to the conclusion that a full support stationary point of the Wasserstein gradient flow is a global optimum. To the best of our knowledge, this is the first global convergence result for multilayer neural networks in the meanfield regime without the convexity assumption on the loss landscape.

We propose novel numerical schemes to approximate the meanfield limit of the deep ResNets and demonstrate that they achieves superior empirical results on realworld datasets.
1.2 Related Work
MeanField Limit and Global Convergence.
Recent works have explored the global convergence of twolayer neural networks by studying suitable scaling limits of the stochastic gradient descent of twolayer neural network when the width is sent to infinity and the second layer scaled by one over the width of the neural network [nitanda2017stochastic, mei2018mean, rotskoff2018neural, chizat2018global, sirignano2019mean]. Though global convergence can be obtained under certain conditions for twolayer networks, it is highly nontrivial to extend this framework to multilayer neural networks: recent attempts [araujo2019mean, sirignano2019mean, nguyen2019mean, fang2019convex] do not address realistic neural architectures directly or provide conditions for global convergence. Parallel to the meanfield regime, [jacot2018neural, du2018gradient, allen2018convergence, zou2018stochastic, oymak2019towards] provided global convergence results for multilayer networks in the socalled ”lazy” or kernel regime. However, this description of deep neural networks is rather limited: the scaling of initialization forces the distance traveled by each parameter to vanish asymptotically [chizat2019lazy], and thus training becomes equivalent to kernel regression with respect to neural tangent kernel [arora2019exact, jacot2018neural]. On the other hand, it is wellknown that properly trained neural networks can outperform kernel models in learning various target functions [wei2019regularization, suzuki2018adaptivity, ghorbani2019limitations, ba2020generalization, allen2019can]
. In contrast, the meanfield regime considered in this work does not reduce training into kernel regression; in other words, the meanfield setting allows neurons to travel further and learn adaptive features.
Landscape of ResNets.
[li2017convergence, liu2019towards] provided convergence results of gradient descent on twolayer residual neural networks and showed that the global minimum is unique. In parallel, [shamir2018resnets, kawaguchi2019depth]
showed that when the network consists of one residual block the gradient descent solution is provably better than a linear classifier. However, recent work also pointed out that these positive results may not hold true for deep ResNets composed of multiple residual blocks. Regarding deeper models,
[hardt2016identity, bartlett2019gradient, wu2019global] proved the global convergence of the gradient descent for training deep linearResNets. Yet it is known that even mild nonlinear activation functions can destroy these good landscape properties
[yun2018small]. In addition, [bartlett2018representing] considered a ResNet model with compositions of closetoidentity functions, and provided convergence result regarding the Fréchet gradient. However, [bartlett2018representing] also pointed out that such conclusion may no longer hold for a realistic ResNet model. Our paper fills this gap by introducing a new continuous model and providing conditions for the global convergence beyond the previously considered kernel regime [du2018gradient, zhang2019training, allen2018convergence, zhang2019training].1.3 Notations and Preliminaries
Notations.
Let denote the Dirac mass and be the indicator function on . We denote by
the set of probability measures endowed with the Wasserstein2 distance (see below for definition). Let
be the population distribution of the input data and the induced norm by .Fréchet Derivative.
We extend the notion of the gradient to infinite dimensional space. For a functional defined on a Banach space , the Fréchet derivative is an element in the dual space that satisfies
In this paper, is used to denote the Fréchet derivative.
Wasserstein Space.
The Wasserstein distance between two probability measures is defined as
Here denotes the set of all couplings between and , i.e., all probability measures with marginals on the first factor and on the second.
Bounded Lipschitz norm.
We say that a sequence of measures weakly (or narrowly) converges to if, for all continuous and bounded function it holds . For sequences which are bounded in total variation norm, this is equivalent to the convergence in Bounded Lipschitz norm. The latter is defined, for , as
(1) 
where is the smallest Lipschitz constant of and the supremum norm.
2 Limiting Model
Following the observation that each residual block of a ResNet can be considered as one step of the forward Euler approximation of the ODE [weinan2017proposal, lu2017beyond, sonoda2017double, haber2017stable], a series of recent papers [zhang2018dynamically, zhang2019you, chen2018neural, li2020scalable, li2017maximum, li2020scalable] analyzed the deep neural networks in the continuous limit. [thorpe2018deep] proved the Gammaconvergence of ResNets in the asymptotic limit. However, there are two points of that approach that require further investigation. First, [thorpe2018deep] introduced a regularization term , where is the depth of the network. This regularization becomes stronger as the network gets deeper, which implies a more constrained space of functions that the network can represent.
Second, while the Gammaconvergence result is concerned with the convergence of the global minima of a sequence of energy functionals, it gives rather little information about the landscape of the limiting functional, which can be quite complicated for nonconvex objective functions. Later work [avelin2019neural] proved that stochastic gradient descent of a deep ResNet with constant weight across layers converges to the gradient flow of loss using the ODE model. However, letting the weights of the ResNet be the same across all layers weakens the approximation power and makes optimization landscape more complicated. To address the reason behind the global convergence of the gradient flow, in this section, we propose a new continuous limiting model of the deep residual network.
2.1 A New Continuous Model
The goal is to minimize the loss function
(2) 
over parameter distributions for in a compact set and . Here is the solution of the ODE
(3) 
The ODE (3) is understood in the integrated sense, i.e., for fixed distribution and input , the solution path satisfies
Here
is the function to be estimated. The parameter
represents the first convolution layer in the ResNet [he2016deep, he2016identity], which extracts feature before sending them to the residual blocks. To simplify the analysis, we letto a predefined linear transformation (
i.e. not training the first layer parameters ) with the technical assumption that and , wheredenotes the set of singular values. We remark that this assumption is not unrealistic, for example
[oyallon2017scaling] let be a predefined wavelet transform and still achieved the stateoftheart result on several benchmark datasets. Here is the residual block with parameter that aims to learn a feature transformation from to . For simplicity, we assume that the residual block is a two layer neural network, thus andis an activation function, such as sigmoid and relu. Note that in our notation
the activation functionis applied separately to each component of the vector.
Finally, is a pooling operator that transfers the final feature to the classification result and an loss function is used for example. We also assume that is a predefined linear transform with satisfies , which can be easily achieved via an operator used in realistic architecture such as the global average pooling [lin2013network]. Before starting the analysis, we first list the necessary regularity assumptions.

(Boundedness of data and target distribution) The input data lies almost surely in a compact ball, i.e. for some constant . At the same time the target function is also bounded for some constant .

(Lipschitz continuity of distribution with respect to depth) There exists a constant such that
for all .

The kernel is a universal kernel [micchelli2006universal], i.e. the span of is dense in .

(Locally Lipschitz derivative with sublinear growth [chizat2018global]) There exists a family of nested nonempty closed convex subsets of that satisfies:

for all .

There exist constants such that
holds for all . Also the gradient of with respect to is a Lipschitz function with Lipschitz constant .

For each , the gradient respect to the parameter is also bounded
for some constant .

Remark 1
Let us elaborate on these assumptions in the neural network setting. For Assumption 1.4, is a universal kernel holds for the sigmoid and ReLU activation function. The local regularity Assumption 1.5 concerning function can easily be satisfied, for and . Hence, in order to satisfy the local regularity condition, one possible solution is that we utlize a Lipschitz gradient activation function and set the local set to be a ball with radius centered at origin.
Under these assumptions, we can establish the existence, uniqueness, stability, and wellposedness of our forward model.
Theorem 1 (Wellposedness of the Forward Model)
Under Assumption 1 and we further assume that there exists a constant such that is concentrated on one of the nested sets . Then, the ODE in (3) has a unique solution in for any initial condition . Moreover, for any pair of distributions and , there exists a constant such that
(4) 
where is the 2Wasserstein distance between and .
Proof
We first show the existence and uniqueness of . From now on, let
(5) 
Then, the ODE (3) becomes
(6) 
and by the condition of the theorem and assumption 2.1 we have
(7) 
This is because, for the continuous function is now defined on the domain for which lies in a compact set and , which leads to an upper bound such that holds for all . The notation will continuously used in the following section.
Hence, is bounded. On the other hand, is integrable with respect to and Lipschitz continuous with respect to in any bounded region (by 2 of assumption 2.1). Therefore, consider the region , where . By the existence and uniqueness theorem of ODE (the Picard–Lindelöf theorem), the solution of (6) initialized from exists and is unique on .
Next, we show the continuity of with respect to . Letting , we have
(8) 
Let . For the first term in (8), since both and are controlled by , by 2 of Assumption 2.1 we have the following Lipschitz condition for
(9) 
For the second term of (8), we have
(10) 
Since is Lipschitz continuous with respect to and also bounded by , we have is Lipschitz continuous w.r.t . On the other hand, still by Assumption 2.1, is Lipschitz with respect to . As a result, the function is Lipschitz continuous on with , which implies
(11) 
Finally, by defining
(12) 
we have by (8)
(13) 
Applying the Gronwall’s inequality gives
(14) 
and specifically for we have
(15) 
2.2 Deep Residual Network Behaves Like an Ensemble Of Shallow Models
In this section, we briefly explain the intuition behind our analysis, i.e. deep residual network can be approximated by a twolayer neural network. [veit2016residual] introduced an unraveled view of the ResNets and showed that deep ResNets behave like ensembles of shallow models. First, we offer a formal derivation to reveal how to make connection between a deep ResNet and a twolayer neural network. The first residual block is formulated as
By Taylor expansion, the second layer output is given by
Iterating this expansion gives rise to
Here we only keep the terms that are at most quadratic in . A similar derivation shows that at order in there are terms with coefficient each. This implies that the th order term in decays as , suggesting that one can approximate a deep network by the keeping a few leading orders.
3 Landscape Analysis of the MeanField Model
In the following, we show that the landscape of a deep residual network enjoys the extraordinary property that any local optima is global, by comparing the gradient of deep residual network with the meanfield model of twolayer neural network [mei2018mean, chizat2019lazy, nitanda2017stochastic]. To estimate the accuracy of the first order approximation (i.e. linearization), we apply the adjoint sensitivity analysis [boltyanskiy1962mathematical] and show that the difference between the gradient of two models can be bounded via the stability constant of the backward adjoint equation. More precisely, the goal is to show the backward adjoint equation will only affect the gradient in a bounded constant.
3.1 Gradient via the Adjoint Sensitivity Method
Adjoint Equation.
To optimize the objective (2), we calculate the gradient via the adjoint sensitivity method [boltyanskiy1962mathematical]. To derive the adjoint equation, we first view our generative models where is treated as a parameter as
(16) 
with
(17) 
The loss function can be written as
(18) 
Define
(19) 
The derivative of with respect to , denoted by the Jacobian , satisfies at any previous time the adjoint equation of the ODE
(20) 
Next, the perturbation of by
is given by chain rule as
(21)  
where (the derivative of with respect to ) satisfies the adjoint equation
which represents the gradient as a second backwardsintime augmented ODE. Here the Hamiltonian is defined as .
Utilizing the adjoint equation, we can characterize the gradient of our model with respect to the distribution . More precisely, we may characterize the variation of the loss function with respect to the distribution as the following theorem.
Theorem 2 (Gradient of the parameter)
For let
Then for every , we have
for the convex combination with .
Proof
To simplify the notation, we use , From Theorem 1 (the wellposeness of the model), we know that the function is a continuous function with and thus
Now we bound . First, notice that the adjoint equation is a linear equation:
with solution
Next, we bound in order to show that . The way to estimate the difference is to utilize the Duhamel’s principle.
At the same time we have
Here and the last equality holds because . This leads us to
Thus
Combining with the definition of the adjoint equation and , we have
3.2 Landscape Analysis
In this section we aim to show that the proposed model enjoys a good landscape in the geometry. Specifically, we can always find a descent direction around a point whose loss is strictly larger than 0, which means that all local minimum is a global one.
Theorem 3
If
for some probability distribution
which concentrates on one of the nested sets , then there exists a descend direction s.t.
Proof
First we lower bound the gradient respect to the feature map by the loss function to show that changing feature map can always leads to a lower loss. This is observed by [bartlett2018representing, bartlett2019gradient] where they mean by
Lemma 1
The norm of the solution to the adjoint equation can be bounded by the loss
Proof
By definition,
which implies that .
By assumption there exist a constant such that
Integrating the inequality above with respect to over , and using the fact that , one obtains that .
Recall that solves the adjoint equation
(22) 
where by the assumption on and the above bound on , we have for any
It then follows from the Gronwall’s inequality that