Neural networks have become state-of-the-art models in numerous machine learning tasks and strong empirical performance is often achieved by deeper networks. One landmark example is the residual network (ResNet)[he2016deep, he2016identity]
, which can be efficiently optimized even at extremely large depth such as 1000 layers. However, there exists a gap between this empirical success and the theoretical understanding: ResNets can be trained to almost zero loss with standard stochastic gradient descent, yet it is known that larger depth leads to increasingly non-convex landscape even the the presence of residual connections[yun2019deep]. While global convergence can be obtained in the so-called “lazy” regime e.g. [jacot2018neural, du2018gradient], such kernel models cannot capture fully-trained neural networks [suzuki2018adaptivity, chizat2019lazy, ghorbani2019limitations].
In this work, we aim to demonstrate the provable optimization of ResNet beyond the restrictive “lazy” regime. To do so, we build upon recent works that connect ordinary differential equation (ODE) models to infinite-depth neural networks [weinan2017proposal, lu2017beyond, sonoda2017double, haber2017stable, chen2018neural, dupont2019augmented, zhang2018dynamically, thorpe2018deep, sonoda2019transport, lu2019understanding]. Specifically, each residual block of a ResNet can be written as , which can be seen as the Euler discretization of the ODE . This turns training the neural network into solving an optimal control problem [li2017maximum, weinan2019mean, liu2019deep]
, under which backpropagation can be understood as simulating the adjoint equation[chen2018neural, li2017maximum, pmlr-v80-li18b, zhang2019you, li2020scalable]. However, this analogy does not directly provide guarantees of global convergence even in the continuum limit.
To address the problem of global convergence, we propose a new limiting ODE model of ResNets. Formally, we model deep ResNets via a mean-field ODE model
This model considers every residual block as a particle and optimizes over the empirical distribution of particles , where denotes the weight of the residual block and denotes the layer index of the residual block. We consider properties of the loss landscape with respect to the distribution of weights, an approach similar to [bengio2006convex, bach2017breaking]. Inspired by [veit2016residual] that a deep ResNet behaves like an ensemble of shallow models, we compare a deep ResNet with its counterpart two-layer network and show that the gradients of the two models are close to each other. This leads us to conclude that, although the loss landscape may not be convex, every local minimizer is a global one.
Our contributions can be summarized as follows:
We derive a new continuous depth limit of deep ResNets. In this new model, each residual block is regarded as a particle and the training dynamics is captured by the gradient flow on the distribution of the particles .
We analyze the loss landscape with respect to and show that all local minima have zero loss, which indicates that every local optima is global. This property leads to the conclusion that a full support stationary point of the Wasserstein gradient flow is a global optimum. To the best of our knowledge, this is the first global convergence result for multi-layer neural networks in the mean-field regime without the convexity assumption on the loss landscape.
We propose novel numerical schemes to approximate the mean-field limit of the deep ResNets and demonstrate that they achieves superior empirical results on real-world datasets.
1.2 Related Work
Mean-Field Limit and Global Convergence.
Recent works have explored the global convergence of two-layer neural networks by studying suitable scaling limits of the stochastic gradient descent of two-layer neural network when the width is sent to infinity and the second layer scaled by one over the width of the neural network [nitanda2017stochastic, mei2018mean, rotskoff2018neural, chizat2018global, sirignano2019mean]. Though global convergence can be obtained under certain conditions for two-layer networks, it is highly nontrivial to extend this framework to multi-layer neural networks: recent attempts [araujo2019mean, sirignano2019mean, nguyen2019mean, fang2019convex] do not address realistic neural architectures directly or provide conditions for global convergence. Parallel to the mean-field regime, [jacot2018neural, du2018gradient, allen2018convergence, zou2018stochastic, oymak2019towards] provided global convergence results for multi-layer networks in the so-called ”lazy” or kernel regime. However, this description of deep neural networks is rather limited: the scaling of initialization forces the distance traveled by each parameter to vanish asymptotically [chizat2019lazy], and thus training becomes equivalent to kernel regression with respect to neural tangent kernel [arora2019exact, jacot2018neural]. On the other hand, it is well-known that properly trained neural networks can outperform kernel models in learning various target functions [wei2019regularization, suzuki2018adaptivity, ghorbani2019limitations, ba2020generalization, allen2019can]
. In contrast, the mean-field regime considered in this work does not reduce training into kernel regression; in other words, the mean-field setting allows neurons to travel further and learn adaptive features.
Landscape of ResNets.
[li2017convergence, liu2019towards] provided convergence results of gradient descent on two-layer residual neural networks and showed that the global minimum is unique. In parallel, [shamir2018resnets, kawaguchi2019depth]
showed that when the network consists of one residual block the gradient descent solution is provably better than a linear classifier. However, recent work also pointed out that these positive results may not hold true for deep ResNets composed of multiple residual blocks. Regarding deeper models,[hardt2016identity, bartlett2019gradient, wu2019global] proved the global convergence of the gradient descent for training deep linear
ResNets. Yet it is known that even mild nonlinear activation functions can destroy these good landscape properties[yun2018small]. In addition, [bartlett2018representing] considered a ResNet model with compositions of close-to-identity functions, and provided convergence result regarding the Fréchet gradient. However, [bartlett2018representing] also pointed out that such conclusion may no longer hold for a realistic ResNet model. Our paper fills this gap by introducing a new continuous model and providing conditions for the global convergence beyond the previously considered kernel regime [du2018gradient, zhang2019training, allen2018convergence, zhang2019training].
1.3 Notations and Preliminaries
Let denote the Dirac mass and be the indicator function on . We denote by
the set of probability measures endowed with the Wasserstein-2 distance (see below for definition). Letbe the population distribution of the input data and the induced norm by .
We extend the notion of the gradient to infinite dimensional space. For a functional defined on a Banach space , the Fréchet derivative is an element in the dual space that satisfies
In this paper, is used to denote the Fréchet derivative.
The Wasserstein- distance between two probability measures is defined as
Here denotes the set of all couplings between and , i.e., all probability measures with marginals on the first factor and on the second.
Bounded Lipschitz norm.
We say that a sequence of measures weakly (or narrowly) converges to if, for all continuous and bounded function it holds . For sequences which are bounded in total variation norm, this is equivalent to the convergence in Bounded Lipschitz norm. The latter is defined, for , as
where is the smallest Lipschitz constant of and the supremum norm.
2 Limiting Model
Following the observation that each residual block of a ResNet can be considered as one step of the forward Euler approximation of the ODE [weinan2017proposal, lu2017beyond, sonoda2017double, haber2017stable], a series of recent papers [zhang2018dynamically, zhang2019you, chen2018neural, li2020scalable, li2017maximum, li2020scalable] analyzed the deep neural networks in the continuous limit. [thorpe2018deep] proved the Gamma-convergence of ResNets in the asymptotic limit. However, there are two points of that approach that require further investigation. First, [thorpe2018deep] introduced a regularization term , where is the depth of the network. This regularization becomes stronger as the network gets deeper, which implies a more constrained space of functions that the network can represent.
Second, while the Gamma-convergence result is concerned with the convergence of the global minima of a sequence of energy functionals, it gives rather little information about the landscape of the limiting functional, which can be quite complicated for non-convex objective functions. Later work [avelin2019neural] proved that stochastic gradient descent of a deep ResNet with constant weight across layers converges to the gradient flow of loss using the ODE model. However, letting the weights of the ResNet be the same across all layers weakens the approximation power and makes optimization landscape more complicated. To address the reason behind the global convergence of the gradient flow, in this section, we propose a new continuous limiting model of the deep residual network.
2.1 A New Continuous Model
The goal is to minimize the loss function
over parameter distributions for in a compact set and . Here is the solution of the ODE
The ODE (3) is understood in the integrated sense, i.e., for fixed distribution and input , the solution path satisfies
is the function to be estimated. The parameterrepresents the first convolution layer in the ResNet [he2016deep, he2016identity], which extracts feature before sending them to the residual blocks. To simplify the analysis, we let
to a predefined linear transformation (i.e. not training the first layer parameters ) with the technical assumption that and , where
denotes the set of singular values. We remark that this assumption is not unrealistic, for example[oyallon2017scaling] let be a predefined wavelet transform and still achieved the state-of-the-art result on several benchmark datasets. Here is the residual block with parameter that aims to learn a feature transformation from to . For simplicity, we assume that the residual block is a two layer neural network, thus and
is an activation function, such as sigmoid and relu. Note that in our notationthe activation function
is applied separately to each component of the vector.
Finally, is a pooling operator that transfers the final feature to the classification result and an loss function is used for example. We also assume that is a predefined linear transform with satisfies , which can be easily achieved via an operator used in realistic architecture such as the global average pooling [lin2013network]. Before starting the analysis, we first list the necessary regularity assumptions.
(Boundedness of data and target distribution) The input data lies -almost surely in a compact ball, i.e. for some constant . At the same time the target function is also bounded for some constant .
(Lipschitz continuity of distribution with respect to depth) There exists a constant such that
for all .
The kernel is a universal kernel [micchelli2006universal], i.e. the span of is dense in .
(Locally Lipschitz derivative with sub-linear growth [chizat2018global]) There exists a family of nested nonempty closed convex subsets of that satisfies:
for all .
There exist constants such that
holds for all . Also the gradient of with respect to is a Lipschitz function with Lipschitz constant .
For each , the gradient respect to the parameter is also bounded
for some constant .
Let us elaborate on these assumptions in the neural network setting. For Assumption 1.4, is a universal kernel holds for the sigmoid and ReLU activation function. The local regularity Assumption 1.5 concerning function can easily be satisfied, for and . Hence, in order to satisfy the local regularity condition, one possible solution is that we utlize a Lipschitz gradient activation function and set the local set to be a ball with radius centered at origin.
Under these assumptions, we can establish the existence, uniqueness, stability, and well-posedness of our forward model.
Theorem 1 (Well-posedness of the Forward Model)
Under Assumption 1 and we further assume that there exists a constant such that is concentrated on one of the nested sets . Then, the ODE in (3) has a unique solution in for any initial condition . Moreover, for any pair of distributions and , there exists a constant such that
where is the 2-Wasserstein distance between and .
We first show the existence and uniqueness of . From now on, let
Then, the ODE (3) becomes
and by the condition of the theorem and assumption 2.1 we have
This is because, for the continuous function is now defined on the domain for which lies in a compact set and , which leads to an upper bound such that holds for all . The notation will continuously used in the following section.
Hence, is bounded. On the other hand, is integrable with respect to and Lipschitz continuous with respect to in any bounded region (by 2 of assumption 2.1). Therefore, consider the region , where . By the existence and uniqueness theorem of ODE (the Picard–Lindelöf theorem), the solution of (6) initialized from exists and is unique on .
Next, we show the continuity of with respect to . Letting , we have
For the second term of (8), we have
Since is -Lipschitz continuous with respect to and also bounded by , we have is -Lipschitz continuous w.r.t . On the other hand, still by Assumption 2.1, is -Lipschitz with respect to . As a result, the function is -Lipschitz continuous on with , which implies
Finally, by defining
we have by (8)
Applying the Gronwall’s inequality gives
and specifically for we have
2.2 Deep Residual Network Behaves Like an Ensemble Of Shallow Models
In this section, we briefly explain the intuition behind our analysis, i.e. deep residual network can be approximated by a two-layer neural network. [veit2016residual] introduced an unraveled view of the ResNets and showed that deep ResNets behave like ensembles of shallow models. First, we offer a formal derivation to reveal how to make connection between a deep ResNet and a two-layer neural network. The first residual block is formulated as
By Taylor expansion, the second layer output is given by
Iterating this expansion gives rise to
Here we only keep the terms that are at most quadratic in . A similar derivation shows that at order in there are terms with coefficient each. This implies that the -th order term in decays as , suggesting that one can approximate a deep network by the keeping a few leading orders.
3 Landscape Analysis of the Mean-Field Model
In the following, we show that the landscape of a deep residual network enjoys the extraordinary property that any local optima is global, by comparing the gradient of deep residual network with the mean-field model of two-layer neural network [mei2018mean, chizat2019lazy, nitanda2017stochastic]. To estimate the accuracy of the first order approximation (i.e. linearization), we apply the adjoint sensitivity analysis [boltyanskiy1962mathematical] and show that the difference between the gradient of two models can be bounded via the stability constant of the backward adjoint equation. More precisely, the goal is to show the backward adjoint equation will only affect the gradient in a bounded constant.
3.1 Gradient via the Adjoint Sensitivity Method
To optimize the objective (2), we calculate the gradient via the adjoint sensitivity method [boltyanskiy1962mathematical]. To derive the adjoint equation, we first view our generative models where is treated as a parameter as
The loss function can be written as
The derivative of with respect to , denoted by the Jacobian , satisfies at any previous time the adjoint equation of the ODE
Next, the perturbation of by
is given by chain rule as
where (the derivative of with respect to ) satisfies the adjoint equation
which represents the gradient as a second backwards-in-time augmented ODE. Here the Hamiltonian is defined as .
Utilizing the adjoint equation, we can characterize the gradient of our model with respect to the distribution . More precisely, we may characterize the variation of the loss function with respect to the distribution as the following theorem.
Theorem 2 (Gradient of the parameter)
Then for every , we have
for the convex combination with .
To simplify the notation, we use , From Theorem 1 (the well-poseness of the model), we know that the function is a continuous function with and thus
Now we bound . First, notice that the adjoint equation is a linear equation:
Next, we bound in order to show that . The way to estimate the difference is to utilize the Duhamel’s principle.
At the same time we have
Here and the last equality holds because . This leads us to
Combining with the definition of the adjoint equation and , we have
3.2 Landscape Analysis
In this section we aim to show that the proposed model enjoys a good landscape in the geometry. Specifically, we can always find a descent direction around a point whose loss is strictly larger than 0, which means that all local minimum is a global one.
If for some probability distribution
for some probability distributionwhich concentrates on one of the nested sets , then there exists a descend direction s.t.
First we lower bound the gradient respect to the feature map by the loss function to show that changing feature map can always leads to a lower loss. This is observed by [bartlett2018representing, bartlett2019gradient] where they mean by
The norm of the solution to the adjoint equation can be bounded by the loss
which implies that .
By assumption there exist a constant such that
Integrating the inequality above with respect to over , and using the fact that , one obtains that .
Recall that solves the adjoint equation
where by the assumption on and the above bound on , we have for any
It then follows from the Gronwall’s inequality that