ResNet and Neural ODEs
Ever since the very popular ResNet paper  was published several authors have made the observation that the ResNet architecture is structurally similar to the Euler discretization of an ODE ([10, 26]). However, the original ‘ResNet block’ considered in  is defined as
are linear operators. The activation function
is applied component-wise on vectors in. is the number of layers used in the construction. For standard neural networks the operators
are matrices, but for convolutional neural networks (CNNs) the operators are discrete convolution operators.
Obviously creftype 1.1 cannot be considered as the Euler discretization of an ODE as the activation function is applied after the addition of and . However, removing this activation function and instead considering the difference equation
we see that this is the Euler discretization (with time-step 1) of the ODE
and where is the short notation for . Note that in creftype 1.3 the time-step is 1 and hence the time horizon will be the number of layers . This timescale is not optimal in the sense that if the ODE is stable then the system will attract to zero as .
In this paper we consider (general) autonomous ODEs as in creftype 1.3, with time independent and a time horizon of , i.e.
This type of model is called a Neural ODE (NODE)  and has recently garnered a lot of attention as it has been proven to work very well in practice. Naturally we also consider the Euler discretization of creftype 1.4, which can be written as the difference equation
Empirical risk minimization
Assume that we are given data distributed according to a probability measurewhere , and , . Let be a non-negative convex function. The learning problem for a model , where indicates that the function is parameterized by weights , can be formulated as
where is often referred to as the risk or the risk function. In practice, the risk we have to minimize is the empirical risk, and it is a well-established fact that for neural networks the minimization problem in creftype 1.6 is, in general, a non-convex minimization problem [33, 2, 35, 8]. As such many search algorithms may get trapped at, or converge to, local minima which are not global minima 
. Currently, a variety of different methods are used in deep learning when training the model, i.e. when trying to find an approximate solution to the problem increftype 1.6, we refer to  for an overview of various methods. One of the most popular methods, and perhaps the standard way of approaching the problem in creftype 1.6, is back-propagation using stochastic gradient descent, see  for a more recent outline of the method. While much emphasis is put on back-propagation in the deep learning community, from a theoretical perspective it does not matter if we use a forward or a backward mode of auto-differentiation.
Continuous approximation of SGD
where is a covariance matrix and is a standard -dimensional Wiener process defined on a probability space. The idea of approximating stochastic gradient descent with a continuous time process has been noted by several authors, see [3, 4, 6, 13, 30, 31]. A special case of what we prove in this paper, see Theorem 2.7 below, is that the stochastic gradient descent creftype 1.7 used to minimize the risk for the ResNet model in creftype 1.5 converges to the stochastic gradient descent used to minimize the risk for the Neural ODE model in creftype 1.4. This convergence is proved in the sense of expectation with respect to the random initialization of the weights in the stochastic gradient descent. Furthermore, we prove that the corresponding discrepancy errors decay as where is the number of layers or discretization steps.
Novelty and significance
It is fair to say that in general there are very few papers making more fundamental and theoretical contributions to the understanding of deep learning and more specifically ResNet like neural networks. However, in the latter case there is a strain of recent and interesting contributions. In  the authors allow the parameters of the model to be layer- and time-dependent resulting in non-autonomous ODEs with corresponding Euler discretization:
In particular, based on more restrictive assumptions on , more restrictive compared to what we use, it is proved in  that as the number of layers tends to infinity in creftype 1.8, the risk associated to creftype 1.8 and defined as in creftype 1.6, converges in the sense of gamma convergence to the risk associated to the corresponding (continuous) ODE in creftype 1.8: we refer to Theorem 2.1 in  and to  for an introduction to gamma convergence. The authors obtain that the minima for finite layer networks converge to minima of the continuous limit, infinite layer, network. To prove that the limit exists and has nice properties they introduce a regularization which penalizes the norm of the difference between the weights in subsequent layers. We emphasize that in  the authors only consider the convergence of minimizers and not the convergence of the actual optimization procedure. In  the authors study the limit problem directly and reformulates the problem as an optimal control problem for an ODE acting on measures. However, the complexity of such networks can be quite substantial due to the time-dependency of the weights, and it is unclear what would be the best way to construct a penalization such that the limit has nice properties.
As we mentioned before, in  the authors consider the autonomous ODE in creftype 1.4, i.e. they make the assumption that all layers share the same weights, and they develop two things. Firstly, they develop an adjoint equation that allows them to approximately compute the gradient with a depth independent memory cost. Secondly, they show through numerical examples that the approach works surprisingly well for some problems.
In general, the upshot of the ODE approach is the increased regularity, since trajectories are continuous and do not intersect, for autonomous ODEs, they are reversible, see . However, the increased regularity comes with a cost as Neural ODEs can can have difficulties solving certain classification problems, see Section 7.
Our main contribution is that we establish, in the context of minimization by stochastic gradient descent, a theoretical foundation for considering Neural ODEs as the deep limit of ResNets.
Overview of the paper
The rest of the paper is organized as follows. In Section 2 we introduce the necessary formalism and notation and state our results: Theorems 2.10, 2.7 and 2.5. In Section 3 we estimate, for fixed, the discretization error arising as a consequence of the Euler scheme, and we prove some technical estimates. In Section 4 we collect and develop the results concerning stochastic differential equations and Fokker-Planck equations that are needed in the proofs of Theorems 2.10, 2.7 and 2.5. Section 5 is devoted to the Fokker-Planck equations for the probability densities associated to the stochastic gradient descent for the Euler scheme and the continuous ODE, respectively. We establish some slightly delicate decay estimates for these densities, of Gaussian nature, assuming that the initialization density has compact support: see Lemma 5.1 below. In Section 6 we prove Theorems 2.10, 2.7 and 2.5. In Section 7 we discuss a number of numerical experiments. These experiments indicate that in practice the rate of convergence is highly problem dependent, and that it can be considerably faster than indicated by our theoretical bounds. Finally, Section 8 is devoted to a few concluding remarks.
Acknowledgment The authors were partially supported by a grant from the Göran Gustafsson Foundation for Research in Natural Sciences and Medicine.
2. Statement of main results
Our main results concern (general) ResNet type deep neural networks defined on the interval (). To outline our setup we consider
is a potentially high-dimensional vector of parameters acting as a degree of freedom. Givenwe consider as divided into intervals each of length , and we define , , recursively as
We define whenever . We will not indicate the dependency on , when it is unambiguous.
We are particularly interested in the case when is a general vector valued (deep) neural network having parameters but in the context of ResNets a specification for is, as discussed in the introduction,
where are parameters and is a globally Lipschitz activation function. However, our arguments rely only on certain regularity and growth properties of . We will formulate our results using the following classes of functions.
Let , , be a non-decreasing function. We say that the function is in regularity class if
whenever , , .
Some remarks are in order concerning Definition 2.1. Firstly, it contains the prototype neural network in creftype 2.2. Secondly, it is essentially closed under compositions, see Lemma 2.2 below. Therefore, finite layer neural networks satisfy Definition 2.1. We defer the proof of Lemma 2.2 to Section 3.
Let . Then for .
Certain results in this paper require us to control the second derivatives of the risk. We therefore also introduce the following class of functions.
Let , , be a non-decreasing function. We say that the function is in regularity class if and if there exists a polynomial such that
whenever , .
Let , then for .
Given a probability measure , , , , and with defined as in Section 2, we consider the penalized risk
where is a hyper-parameter and is a non-negative and convex regularization. The finite layer model in Section 2 is, as described in Section 1, the forward Euler discretization of the autonomous system of ordinary differential equations
where . Given data from the distribution and with solving the system of Neural ODEs in creftype 2.4, we consider the penalized risk
Throughout this paper we will assume that moments of all orders are finite for the probability measure . By construction the input data, , as well as are vectors in . In the case when the output data is , we need to modify the and by performing final transformations of and achieving outputs in . These modifications are trivial to incorporate and throughout the paper we will in our derivations therefore simply assume that .
for . Throughout the paper we will assume, for simplicity, that the constant covariance matrix has full rank something which, in reality, may not be the case, see . We want to understand in what sense approximates as . To answer this, we first need to make sure that and exist. Note that and can, as functions of , grow exponentially: simply consider the ODE which has as a solution. This creates problems as the setting does not fit the standard theory of SDEs, see , a theory which most commonly requires that the growth of the drift term is at most linear. However, if the drift terms are confining potentials, i.e.
then we have existence and uniqueness for the SDEs in Section 2, see Section 4. In particular, if we have a bound on the growth of , then, as we will see, we can choose to have sufficiently strong convexity to ensure the existence of a large constant such that the drift terms are confining potentials in the sense stated.
If we choose so that and are convex outside some large ball then and can be seen as bounded perturbations of strictly convex potentials, see the proof of Theorem 2.5. Using this we can use the log-Sobolev inequality and hyper-contractivity properties of certain semi-groups, see Section 4, to obtain very good tail bounds for the densities of and . Actually these tail bounds are good enough for us to prove that the expected risks are bounded, expectation is over trajectories, and that in probability. However, these bounds do not seem to be strong enough to give us quantitative convergence estimates for the difference . The main reason for this is that even though we have good tail bounds for the densities of and we do not have good estimates for in terms of . The following is our first theorem.
Let , , be a non-decreasing function and assume that . Given , , there exists a regularizing function such that if we consider the corresponding penalized risks and , defined using and with , then and are bounded perturbations of strictly convex functions. Furthermore, given and a compactly supported initial distribution for , we have
Theorem 2.5 remains true but with a different rate of convergence, if we replace with .
There are a number of ways to introduce more restrictive assumptions on the risk in order to strengthen the convergence and in order to obtain quantitative bounds for the difference . Our approach is to truncate the loss function. This can be done in several ways, but a very natural choice is to simply restrict the hypothesis space by not allowing weights with too large norm. Specifically, we let be a large degree of freedom, and we consider
instead of and , where is a smooth function such that if and if .
Having truncated the loss functions we run continuous forms of SGDs
to minimize the modified risks. Using this setup, assuming also that when is large, the modified risks in Section 2 will satisfy quadratic growth estimates at infinity, and the modified risk will be globally Lipschitz. As a consequence all the tools from classical stochastic differential equations are at our disposal, see Section 4. This allows us to prove that in the sense of mean square convergence. However, still the classical SDE theory does not easily seem to allow us to conclude that converges in any meaningful way to . To overcome this difficulty we develop a PDE based approach to obtain further estimates for the densities of and , and their differences. In particular, we prove the following theorem.
Let , , be a non-decreasing function and assume that . Let and be fixed and assume that has compact support in , . Assume also that
on , and for some . Then there exists a positive and finite constant , depending on the function as well as , , , , , , , and , such that
Furthermore, if , then
To prove Theorem 2.7 we develop certain estimates for , , i.e. for the to , , associate probability densities, by exploring the associated Fokker-Planck equations: see Section 5. In fact, we prove several estimates which give that in a very strong sense, stronger than initially is recognized from the statement of Theorem 2.7. Particular consequences of our proofs are the estimates
whenever , and where . In particular, these estimates indicate that , and have Gaussian tails away from the (compact) support of .
The estimates creftypeplural 2.10 and 2.9 can be interpreted from a probabilistic point of view. The bound creftype 2.9 for is equivalent to , which implies that all moments of are finite. Secondly we can interpret creftype 2.10 as the total variation distance between and being of order with respect to the measure .
A direct consequence of creftype 2.10 is the following corollary which states that is a weak order 1 approximation of .
Let be as in Theorem 2.7. Let be a continuous function satisfying the growth condition
for some polynomial of order . Then there exists a constant depending on as well as , , , , , , , and such that
Using our estimates we can also derive the following theorem from the standard theory.
Let , , be a non-decreasing function and assume that . Let and be fixed and assume that . Then
3. ResNet and the Neural ODE: error analysis
In this section we will estimate the error formed by the discretization of the ODE. This will be done on the level of the trajectory as well as on the level of the first and second derivatives with respect to . Note that by construction, see creftypeplural 2.4 and 2 we have
for all . We assume consistently that the paths and start at for , and are driven by the parameters . We will in the following, for simplicity, use the notation , , . Recall that and are introduced in Section 2 using the cut-off parameter . Let and . In this section we prove estimates on the discrepancy error between the trajectories of and the discrete trajectories . We begin by bounding the difference.
Let , , be a non-decreasing function and assume that as in Definition 2.1. Then there exists a function such that
holds for all , and such that
holds for all .
To start the proof we first write
To bound the second term in creftype 3.3 we note that
where for , and . creftype 3.7 can be rewritten as . Iterating this inequality we see that
for any . Elementary calculations give us
Choosing small enough i.e. we deduce
Iterating the above inequality times we obtain
It remains to prove creftype 3.2 and to start the proof note that the first term on the left in creftype 3.2 is already bounded by creftype 3.10. Thus, we only need to establish that is bounded. We first note, using the definition of and Definition 2.1, that
By the triangle inequality we get by rearrangement that
This is again an estimate of the form , where . Iterating this we obtain as in creftype 3.8 that
and by elementary calculations as in creftype 3.9 we can conclude that
This proves the final estimate stated in creftype 3.2 and finishes the proof. ∎
We next upgrade the previous lemma to the level of the gradient of the trajectories with respect to .
Let , , be a non-decreasing function and assume that as in Definition 2.1. Then there exists a function such that
holds for all , and such that