A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth

03/11/2020 ∙ by Yiping Lu, et al. ∙ Duke University Princeton University Stanford University 42

Training deep neural networks with stochastic gradient descent (SGD) can often achieve zero training loss on real-world tasks although the optimization landscape is known to be highly non-convex. To understand the success of SGD for training deep neural networks, this work presents a mean-field analysis of deep residual networks, based on a line of works that interpret the continuum limit of the deep residual network as an ordinary differential equation when the network capacity tends to infinity. Specifically, we propose a new continuum limit of deep residual networks, which enjoys a good landscape in the sense that every local minimizer is global. This characterization enables us to derive the first global convergence result for multilayer neural networks in the mean-field regime. Furthermore, without assuming the convexity of the loss landscape, our proof relies on a zero-loss assumption at the global minimizer that can be achieved when the model shares a universal approximation property. Key to our result is the observation that a deep residual network resembles a shallow network ensemble, i.e. a two-layer network. We bound the difference between the shallow network and our ResNet model via the adjoint sensitivity method, which enables us to apply existing mean-field analyses of two-layer networks to deep networks. Furthermore, we propose several novel training schemes based on the new continuous model, including one training procedure that switches the order of the residual blocks and results in strong empirical performance on the benchmark datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have become state-of-the-art models in numerous machine learning tasks and strong empirical performance is often achieved by deeper networks. One landmark example is the residual network (ResNet)

[he2016deep, he2016identity]

, which can be efficiently optimized even at extremely large depth such as 1000 layers. However, there exists a gap between this empirical success and the theoretical understanding: ResNets can be trained to almost zero loss with standard stochastic gradient descent, yet it is known that larger depth leads to increasingly non-convex landscape even the the presence of residual connections

[yun2019deep]. While global convergence can be obtained in the so-called “lazy” regime e.g. [jacot2018neural, du2018gradient], such kernel models cannot capture fully-trained neural networks [suzuki2018adaptivity, chizat2019lazy, ghorbani2019limitations].

In this work, we aim to demonstrate the provable optimization of ResNet beyond the restrictive “lazy” regime. To do so, we build upon recent works that connect ordinary differential equation (ODE) models to infinite-depth neural networks [weinan2017proposal, lu2017beyond, sonoda2017double, haber2017stable, chen2018neural, dupont2019augmented, zhang2018dynamically, thorpe2018deep, sonoda2019transport, lu2019understanding]. Specifically, each residual block of a ResNet can be written as , which can be seen as the Euler discretization of the ODE . This turns training the neural network into solving an optimal control problem [li2017maximum, weinan2019mean, liu2019deep]

, under which backpropagation can be understood as simulating the adjoint equation 

[chen2018neural, li2017maximum, pmlr-v80-li18b, zhang2019you, li2020scalable]. However, this analogy does not directly provide guarantees of global convergence even in the continuum limit.

Figure 1: Illustration that ResNet behaves like shallow network ensemble, i.e. a two-layer overparameterized neural network. The high-level intuition is to show that the gradient of the two models are at the same scale when the loss are comparable.

To address the problem of global convergence, we propose a new limiting ODE model of ResNets. Formally, we model deep ResNets via a mean-field ODE model

This model considers every residual block as a particle and optimizes over the empirical distribution of particles , where denotes the weight of the residual block and denotes the layer index of the residual block. We consider properties of the loss landscape with respect to the distribution of weights, an approach similar to [bengio2006convex, bach2017breaking]. Inspired by [veit2016residual] that a deep ResNet behaves like an ensemble of shallow models, we compare a deep ResNet with its counterpart two-layer network and show that the gradients of the two models are close to each other. This leads us to conclude that, although the loss landscape may not be convex, every local minimizer is a global one.

1.1 Contribution

Our contributions can be summarized as follows:

  • We derive a new continuous depth limit of deep ResNets. In this new model, each residual block is regarded as a particle and the training dynamics is captured by the gradient flow on the distribution of the particles .

  • We analyze the loss landscape with respect to and show that all local minima have zero loss, which indicates that every local optima is global. This property leads to the conclusion that a full support stationary point of the Wasserstein gradient flow is a global optimum. To the best of our knowledge, this is the first global convergence result for multi-layer neural networks in the mean-field regime without the convexity assumption on the loss landscape.

  • We propose novel numerical schemes to approximate the mean-field limit of the deep ResNets and demonstrate that they achieves superior empirical results on real-world datasets.

1.2 Related Work

Mean-Field Limit and Global Convergence.

Recent works have explored the global convergence of two-layer neural networks by studying suitable scaling limits of the stochastic gradient descent of two-layer neural network when the width is sent to infinity and the second layer scaled by one over the width of the neural network [nitanda2017stochastic, mei2018mean, rotskoff2018neural, chizat2018global, sirignano2019mean]. Though global convergence can be obtained under certain conditions for two-layer networks, it is highly nontrivial to extend this framework to multi-layer neural networks: recent attempts [araujo2019mean, sirignano2019mean, nguyen2019mean, fang2019convex] do not address realistic neural architectures directly or provide conditions for global convergence. Parallel to the mean-field regime, [jacot2018neural, du2018gradient, allen2018convergence, zou2018stochastic, oymak2019towards] provided global convergence results for multi-layer networks in the so-called ”lazy” or kernel regime. However, this description of deep neural networks is rather limited: the scaling of initialization forces the distance traveled by each parameter to vanish asymptotically [chizat2019lazy], and thus training becomes equivalent to kernel regression with respect to neural tangent kernel [arora2019exact, jacot2018neural]. On the other hand, it is well-known that properly trained neural networks can outperform kernel models in learning various target functions [wei2019regularization, suzuki2018adaptivity, ghorbani2019limitations, ba2020generalization, allen2019can]

. In contrast, the mean-field regime considered in this work does not reduce training into kernel regression; in other words, the mean-field setting allows neurons to travel further and learn adaptive features.

Landscape of ResNets.

[li2017convergence, liu2019towards] provided convergence results of gradient descent on two-layer residual neural networks and showed that the global minimum is unique. In parallel, [shamir2018resnets, kawaguchi2019depth]

showed that when the network consists of one residual block the gradient descent solution is provably better than a linear classifier. However, recent work also pointed out that these positive results may not hold true for deep ResNets composed of multiple residual blocks. Regarding deeper models,

[hardt2016identity, bartlett2019gradient, wu2019global] proved the global convergence of the gradient descent for training deep linear

ResNets. Yet it is known that even mild nonlinear activation functions can destroy these good landscape properties

[yun2018small]. In addition, [bartlett2018representing] considered a ResNet model with compositions of close-to-identity functions, and provided convergence result regarding the Fréchet gradient. However, [bartlett2018representing] also pointed out that such conclusion may no longer hold for a realistic ResNet model. Our paper fills this gap by introducing a new continuous model and providing conditions for the global convergence beyond the previously considered kernel regime [du2018gradient, zhang2019training, allen2018convergence, zhang2019training].

1.3 Notations and Preliminaries


Let denote the Dirac mass and be the indicator function on . We denote by

the set of probability measures endowed with the Wasserstein-2 distance (see below for definition). Let

be the population distribution of the input data and the induced norm by .

Fréchet Derivative.

We extend the notion of the gradient to infinite dimensional space. For a functional defined on a Banach space , the Fréchet derivative is an element in the dual space that satisfies

In this paper, is used to denote the Fréchet derivative.

Wasserstein Space.

The Wasserstein- distance between two probability measures is defined as

Here denotes the set of all couplings between and , i.e., all probability measures with marginals on the first factor and on the second.

Bounded Lipschitz norm.

We say that a sequence of measures weakly (or narrowly) converges to if, for all continuous and bounded function it holds . For sequences which are bounded in total variation norm, this is equivalent to the convergence in Bounded Lipschitz norm. The latter is defined, for , as


where is the smallest Lipschitz constant of and the supremum norm.

2 Limiting Model

Following the observation that each residual block of a ResNet can be considered as one step of the forward Euler approximation of the ODE  [weinan2017proposal, lu2017beyond, sonoda2017double, haber2017stable], a series of recent papers [zhang2018dynamically, zhang2019you, chen2018neural, li2020scalable, li2017maximum, li2020scalable] analyzed the deep neural networks in the continuous limit. [thorpe2018deep] proved the Gamma-convergence of ResNets in the asymptotic limit. However, there are two points of that approach that require further investigation. First, [thorpe2018deep] introduced a regularization term , where is the depth of the network. This regularization becomes stronger as the network gets deeper, which implies a more constrained space of functions that the network can represent.

Second, while the Gamma-convergence result is concerned with the convergence of the global minima of a sequence of energy functionals, it gives rather little information about the landscape of the limiting functional, which can be quite complicated for non-convex objective functions. Later work [avelin2019neural] proved that stochastic gradient descent of a deep ResNet with constant weight across layers converges to the gradient flow of loss using the ODE model. However, letting the weights of the ResNet be the same across all layers weakens the approximation power and makes optimization landscape more complicated. To address the reason behind the global convergence of the gradient flow, in this section, we propose a new continuous limiting model of the deep residual network.

2.1 A New Continuous Model

The goal is to minimize the loss function


over parameter distributions for in a compact set and . Here is the solution of the ODE


The ODE (3) is understood in the integrated sense, i.e., for fixed distribution and input , the solution path satisfies


is the function to be estimated. The parameter

represents the first convolution layer in the ResNet [he2016deep, he2016identity], which extracts feature before sending them to the residual blocks. To simplify the analysis, we let

to a predefined linear transformation (

i.e. not training the first layer parameters ) with the technical assumption that and , where

denotes the set of singular values. We remark that this assumption is not unrealistic, for example

[oyallon2017scaling] let be a predefined wavelet transform and still achieved the state-of-the-art result on several benchmark datasets. Here is the residual block with parameter that aims to learn a feature transformation from to . For simplicity, we assume that the residual block is a two layer neural network, thus and

is an activation function, such as sigmoid and relu. Note that in our notation

the activation function

is applied separately to each component of the vector.

Finally, is a pooling operator that transfers the final feature to the classification result and an loss function is used for example. We also assume that is a predefined linear transform with satisfies , which can be easily achieved via an operator used in realistic architecture such as the global average pooling [lin2013network]. Before starting the analysis, we first list the necessary regularity assumptions.

  1. (Boundedness of data and target distribution) The input data lies -almost surely in a compact ball, i.e. for some constant . At the same time the target function is also bounded for some constant .

  2. (Lipschitz continuity of distribution with respect to depth) There exists a constant such that

    for all .

  3. The kernel is a universal kernel [micchelli2006universal], i.e. the span of is dense in .

  4. (Locally Lipschitz derivative with sub-linear growth [chizat2018global]) There exists a family of nested nonempty closed convex subsets of that satisfies:

    • for all .

    • There exist constants such that

      holds for all . Also the gradient of with respect to is a Lipschitz function with Lipschitz constant .

    • For each , the gradient respect to the parameter is also bounded

      for some constant .

Remark 1

Let us elaborate on these assumptions in the neural network setting. For Assumption 1.4, is a universal kernel holds for the sigmoid and ReLU activation function. The local regularity Assumption 1.5 concerning function can easily be satisfied, for and . Hence, in order to satisfy the local regularity condition, one possible solution is that we utlize a Lipschitz gradient activation function and set the local set to be a ball with radius centered at origin.

Under these assumptions, we can establish the existence, uniqueness, stability, and well-posedness of our forward model.

Theorem 1 (Well-posedness of the Forward Model)

Under Assumption 1 and we further assume that there exists a constant such that is concentrated on one of the nested sets . Then, the ODE in (3) has a unique solution in for any initial condition . Moreover, for any pair of distributions and , there exists a constant such that


where is the 2-Wasserstein distance between and .


We first show the existence and uniqueness of . From now on, let


Then, the ODE (3) becomes


and by the condition of the theorem and assumption 2.1 we have


This is because, for the continuous function is now defined on the domain for which lies in a compact set and , which leads to an upper bound such that holds for all . The notation will continuously used in the following section.

Hence, is bounded. On the other hand, is integrable with respect to and Lipschitz continuous with respect to in any bounded region (by 2 of assumption 2.1). Therefore, consider the region , where . By the existence and uniqueness theorem of ODE (the Picard–Lindelöf theorem), the solution of (6) initialized from exists and is unique on .

Next, we show the continuity of with respect to . Letting , we have


Let . For the first term in (8), since both and are controlled by , by 2 of Assumption 2.1 we have the following Lipschitz condition for


For the second term of (8), we have


Since is -Lipschitz continuous with respect to and also bounded by , we have is -Lipschitz continuous w.r.t . On the other hand, still by Assumption 2.1, is -Lipschitz with respect to . As a result, the function is -Lipschitz continuous on with , which implies


Finally, by defining


we have by (8)


Applying the Gronwall’s inequality gives


and specifically for we have


2.2 Deep Residual Network Behaves Like an Ensemble Of Shallow Models

In this section, we briefly explain the intuition behind our analysis, i.e. deep residual network can be approximated by a two-layer neural network. [veit2016residual] introduced an unraveled view of the ResNets and showed that deep ResNets behave like ensembles of shallow models. First, we offer a formal derivation to reveal how to make connection between a deep ResNet and a two-layer neural network. The first residual block is formulated as

By Taylor expansion, the second layer output is given by

Iterating this expansion gives rise to

Here we only keep the terms that are at most quadratic in . A similar derivation shows that at order in there are terms with coefficient each. This implies that the -th order term in decays as , suggesting that one can approximate a deep network by the keeping a few leading orders.

3 Landscape Analysis of the Mean-Field Model

In the following, we show that the landscape of a deep residual network enjoys the extraordinary property that any local optima is global, by comparing the gradient of deep residual network with the mean-field model of two-layer neural network [mei2018mean, chizat2019lazy, nitanda2017stochastic]. To estimate the accuracy of the first order approximation (i.e. linearization), we apply the adjoint sensitivity analysis [boltyanskiy1962mathematical] and show that the difference between the gradient of two models can be bounded via the stability constant of the backward adjoint equation. More precisely, the goal is to show the backward adjoint equation will only affect the gradient in a bounded constant.

3.1 Gradient via the Adjoint Sensitivity Method

Adjoint Equation.

To optimize the objective (2), we calculate the gradient via the adjoint sensitivity method [boltyanskiy1962mathematical]. To derive the adjoint equation, we first view our generative models where is treated as a parameter as




The loss function can be written as




The derivative of with respect to , denoted by the Jacobian , satisfies at any previous time the adjoint equation of the ODE


Next, the perturbation of by

is given by chain rule as


where (the derivative of with respect to ) satisfies the adjoint equation

which represents the gradient as a second backwards-in-time augmented ODE. Here the Hamiltonian is defined as .

Utilizing the adjoint equation, we can characterize the gradient of our model with respect to the distribution . More precisely, we may characterize the variation of the loss function with respect to the distribution as the following theorem.

Theorem 2 (Gradient of the parameter)

For let

Then for every , we have

for the convex combination with .


To simplify the notation, we use , From Theorem 1 (the well-poseness of the model), we know that the function is a continuous function with and thus

Now we bound . First, notice that the adjoint equation is a linear equation:

with solution

Next, we bound in order to show that . The way to estimate the difference is to utilize the Duhamel’s principle.

At the same time we have

Here and the last equality holds because . This leads us to


Combining with the definition of the adjoint equation and , we have

3.2 Landscape Analysis

In this section we aim to show that the proposed model enjoys a good landscape in the geometry. Specifically, we can always find a descent direction around a point whose loss is strictly larger than 0, which means that all local minimum is a global one.

Theorem 3


for some probability distribution

which concentrates on one of the nested sets , then there exists a descend direction s.t.


First we lower bound the gradient respect to the feature map by the loss function to show that changing feature map can always leads to a lower loss. This is observed by [bartlett2018representing, bartlett2019gradient] where they mean by

Lemma 1

The norm of the solution to the adjoint equation can be bounded by the loss


By definition,

which implies that .

By assumption there exist a constant such that

Integrating the inequality above with respect to over , and using the fact that , one obtains that .

Recall that solves the adjoint equation


where by the assumption on and the above bound on , we have for any

It then follows from the Gronwall’s inequality that