1 Introduction
Dynamical systems are widely applied to characterize scientific principles and phenomena in various fields such as physics, biology, chemistry, economics, etc. In many situations, the observational data are accessible, whereas the underlying dynamics remain elusive. Datadriven discovery of dynamical systems is, therefore, an important research direction. There have been extensive study on datadriven discovery using Gaussian processes [Kocijan2005, Raissi2017_1, Raissi2017_2, Raissi2018_2], symbolic regression [Schmidt2009], Ssystems formalism [Daniels2015], sparse regression [Brunton2016, Rudy2017, Zhang2018, Zhang2021], numerical PDE analysis [Kang2019], statistical learning [Lu2019]
, etc. Recently, along with the rapid advancements of deep learning, the discovery of dynamics using neural networks has also been proposed
[Raissi2018_1, Tipireddy2019, Rudy2019, Gulgec2019, HARLIM2021109922, Qin2019, Sun2019, Long2019, Wu2020]. This paper studies highorder schemes for the discovery of dynamics using deep learning.In numerical analysis, developing highorder methods is an important topic in many applications. In solving dynamical systems, highorder discretization techniques such as linear multistep methods (LMMs) and RungeKutta methods have been welldeveloped [Atkinson2011, Gautschi1997, Mayers2003]. Recently, LMMs have been employed for the discovery of dynamics [Raissi2018_3, Tipireddy2019, Xie2019, Keller2019]. In [Keller2019], a rigorous framework based on refined notions of consistency and stability is established to yield the convergence of LMMbased discovery for three popular LMM schemes (the AdamsBashforth, AdamsMoulton, and Backwards Differentiation Formula schemes). However, the theory in [Keller2019] is specialized for methods that cannot provide a closedform expression for the governing function, which is needed in many applications. Therefore, this paper studies the convergence theory of LMM schemes and deep learning, which can provide a closedform description of the governing equation.
This paper concentrates on two types of discovery problems. The first type is to do the discovery on a trajectory of the dynamical system as in [Raissi2018_3, Tipireddy2019, Xie2019]. In this case, the observational data are collected from a specific trajectory, and the purpose is to identify the governing function on this trajectory with a closedform expression in a form of a neural network, the parameters of which are trained by minimizing the square residual of the corresponding LMM scheme. Through this work, we can forecast the future behavior of the same dynamics or predict the dynamics on nearby trajectories. The second type is the discovery on a compact region consisting of a bunch of trajectories on which the observational data are collected. The purpose is to identify the governing function in a connected compact region of the domain of the governing equation, which may not have been discussed in the literature.
In this paper, we perform a convergence analysis of these methods based on the LMM framework discussed in [Keller2019]. We first consider the LMMs using an abstract approximation set . The main result indicates that when using a th order LMM with a step size in time, the grid error of the obtained approximate function is bounded by where is the condition number of the corresponding matrix of the LMM and is any upper bound of the approximation error between and the target function to be discovered (Theorem 5.1). Next, based on Theorem 5.1, we develop the error estimate of the networkbased LMMs by using the recently developed approximation theory of deep networks [Shen2019, Shen2020, Lu2020, Shen2020_2]. Note that Theorem 5.1 can also be used for the error estimate of LMMs using other approximation structures. Moreover, in connection with the stability theory developed in [Keller2019], we discuss the situations that is uniformly bounded with respect to . Therefore, the grid error decays to zero as and the network size approaches to infinity.
In numerical examples, several artificial models and physically relevant benchmark problems are solved by the proposed networkbased LMMs. It is observed the numerical error orders are consistent with our theory. We also conduct experiments to simulate the optimization errors in practice and the implicit regularization of deep learning. The results indicate that thanks to the implicit regularization, the networkbased methods without auxiliary initial conditions can still find correct solutions numerically. Moreover, if the LMM scheme is unstable, the corresponding matrix is highly illconditioned, making traditional approximations (e.g., grid functions and polynomials) less robust, while the network approximation with gradient descent still manages to find solutions with similar ranges of errors as some of the stable counterparts.
This paper is organized as follows. In Section 2, background knowledge of dynamical systems and LMMs is introduced. In Section 3, we will introduce the LMM approach for the discovery of dynamics and discuss auxiliary conditions for unique recovery. In Section 4, the networkbased LMM approach with ReLU neural networks is described. In Section 5, we discuss the convergence rate of the preceding approach with various LMM families. Numerical experiments are provided in Section 6 to validate the theoretical results. Finally, we conclude this paper in Section 7.
2 Dynamical Systems
In this section, we introduce some basic notations and definitions, as adopted by [Keller2019]. Most of the materials on LMMs can be found in [dahlquist56, dahlquist63, henrici62].
2.1 Initial Value Problem
Suppose is the dimension of the dynamics, let us consider the following dynamical system with an initial condition
(1)  
(2) 
where
is an unknown vectorvalued state function;
is a given vectorvalued governing function; is a given initial vector. To seek a numerical solution, one usually discretizes the problem by setting equidistant grid points in the time interval. Let be an integer, we define and for . The objective for solving the initial value problem in (1)(2) is to find an approximate value for each when is given.2.2 Linear Multistep Method
LMMs are widely utilized in solving dynamical systems. Suppose are given states, then for can be computed by the following linear step scheme,
(3) 
where for are specified coefficients and is always nonzero. By the scheme, all are evaluated iteratively from to . In each step, are all given or computed previously such that can be computed by solving algebraic equations. If , the scheme is called explicit since does not appear on the right hand side of (3) and can be computed directly by
(4) 
Otherwise, the scheme is called implicit and it requires solving nonlinear equations for . The first value is simply set as , while other initial values need to be computed by other approaches before performing the LMM if . Common types of LMMs include AdamsBashforth (AB) schemes, AdamsMoulton (AM) schemes, and Backwards Differentiation Formula (BDF) schemes.
2.3 Consistency
An LMM is effective for a dynamical system only if it is consistent; that is, the discrete scheme (3) approximates the original differential equation (1) accurately as is small enough. More specifically, we first define the local truncation error as
(5) 
for . Note that is a numeric vector. It is clear that the local truncation error is defined by substituting the true function into the discrete scheme (3), and measures the extent to which the true solution satisfies the discrete equation. By (1), we have the expression
(6) 
3 Discovery of Dynamics
In this section, we introduce the discovery of dynamics on a single trajectory, on which a time series of the state is available. Conventional LMMs with auxiliary conditions for this type of discovery are introduced. Note that these methods can be simply generalized for the discovery on a compact region, which will be discussed in Section 4.5.
The discovery of dynamics is essentially an inverse process of solving a dynamical system (1)(2) with given observations on the state. That is, suppose that only the information of the state at the equidistant time steps are provided, we would like to recover , namely, the governing function of the state.
3.1 Linear Multistep Method
Let and be two vectorvalued functions satisfying the dynamics (1), and we assume and are both unknown. Now given for , the objective is to determine , i.e. to find a closedform expression for or to evaluate for all . One effective approach is to build a discrete relation between and by LMMs [Keller2019], namely,
(9) 
where is an approximation of . Note that (9) directly follows the LMM scheme (3). Different from (3) that evaluates given , (9) computes from the data . It indicates the dynamics discovery is actually an inverse process of solving the dynamical system [Keller2019]. Moreover, we note that the components of can be discovered independently. Thus, in the remainder of this paper, without loss of generality, we work with a scalarvalued system to simplify (9) using notation in a scalar form as the following general equation,
(10) 
It is worth noting that may not be involved in (10) for some indices between and . For example, in AB schemes, does not appear in (10) since . In general, given an LMM, we use and to denote the first and last indices such that and are involved in (10) with nonzero coefficients (correspondingly, and are both nonzero). We also write as the total number of involved in (10). We briefly list , , , and the truncation error orders of AB, AM, and BDF schemes in Table 1.
Method  
step AB  
step AM  
step BDF 
3.2 Auxiliary Conditions
For each linear step method, it is supposed to compute all unknowns by the linear relation (10). In the following, we will use the special notation and bold fonts to denote column vectors of size , distinguishing them from other vectors or vector functions. We write
(11)  
(12) 
and
(13) 
then (10) leads to the following linear system,
(14) 
However, the number of equations and unknowns may not be equal in (10). For AB and AM schemes, it is insufficient to determine by (10) since equations are less than unknowns. This is also implied by that the linear system (14) is underdetermined. For this issue, a natural solution is to provide auxiliary linear conditions to make unique. For example, we can compute certain unknown directly by firstorder (derivative) finite difference method (FDM) using related data. For consistency, the selected FDM should be of the same error order as the LMM. Assume the LMM has order , one straightforward way is to compute the initial unknowns by onesided FDM of order , i.e.,
(15) 
where are the corresponding finite difference coefficients. Note that (15) has the error estimate
(16) 
If we write
(17) 
then combining the LMM (10) and the initial condition (15) leads to the following augmented linear system
(18) 
where
(19) 
with being the identity matrix and
being the zero matrix of size
. Clearly, (18) has a unique solution since the coefficient matrix is lower triangular with nonzero diagonals. Moreover, if , the linear system (18) is sparse.In general, as pointed out in [Keller2019], we can formulate the auxiliary conditions in various ways, not just as discussed above. Different auxiliary conditions, such as initial and terminal conditions have different effects on the stability and the convergence of the method, see further discussions in [Keller2019]. An interesting question is whether the regularization effect provided by the neural network approximations could help mitigate these effects.
4 Neural Network Approximation
In this section, we first introduce the concept of fully connected neural networks (FNNs) and their approximation properties. Next, the networkbased LMMs for the discovery on a trajectory will be presented together with the discussion on implicit regularization. In the end, we will generalize the methods for the discovery on a compact region.
4.1 Preliminaries
We introduce the fully connected neural network (FNN) which is widely used in deep learning. Mathematically speaking, given an activation function
, , and for , an FNN is the composition of simple nonlinear functions, called hidden layer functions, in the following formulation:(20) 
where ; with and for . With the abuse of notations, means that is applied entrywise to a vector to obtain another vector of the same size. is the width of the th layer and is the depth of the FNN. is the set of all parameters in
to determine the underlying neural network. Common types of activation functions include the rectified linear unit (ReLU)
and the sigmoid function
.4.2 Approximation Property
Now let us introduce existing results on the approximation property of ReLU FNNs. Given a function on a compact subset in , we can define the modulus of continuity by
(21) 
where is the Euclidean norm of a vector in . Suppose is any subset in , we define the norm in ,
(22) 
Besides, we define
(23) 
Approximation properties of ReLU FNNs for continuous functions and smooth functions are indicated as follows. Given any and a function on a compact subset of ,

if , there exists a ReLU FNN with width and depth such that
(24) 
if with , there exists a ReLU FNN with width and depth such that
(25)
The estimate (24) directly follows Theorem 4.3 in [Shen2019], and the estimate (25) can be derived from Theorem 1.1 in [Lu2020] by generalizing the results in the regular domain to a compact subset .
Note that the error bounds in (24) and (25
) suffer the curse of dimensionality; namely, they exponentially depend on the dimension of the whole space
. However, if we are only interested in the approximation on a lowdimensional manifold rather than a general compact subset in , the following stronger results can be adopted.Given , , . Let be a compact dimensional Riemannian submanifold having condition number , volume , and geodesic covering regularity , and define the neighborhood of as
(26) 
Suppose is a function defined in ,

if , there exists a ReLU FNN with width and depth such that
(27) 
if with , there exists a ReLU FNN with width and depth such that
(28)
where
(29) 
is an integer such that .
Equation (27) in Proposition 4.2 is an immediate result of Theorem 1.2 in [Shen2019] and Equation (28) can be derived from Theorem 1.1 in [Lu2020] and Theorem 4.4 in [Shen2019] similarly. In Proposition 4.2, both the error bounds and the ReLU FNN sizes depend on instead of so that the curse of dimensionality is lessened. Note that when is closer to 1, is closer to , then the approximation actually occurs in a reduced space with dimension close to instead of the whole space .
The approximation properties of other FNNs are also studied. For example, the properties of the FloorReLU FNN and a special threehiddenlayer FNN can be found in [Shen2020] and [Shen2020_2], respectively. Also, dimensionindependent error bounds of FNNs for the target functions in Barron space are investigated in [Barron1992]. It is also interesting to apply these approximation theories to develop error estimates of dynamics discovery as future work.
4.3 Networkbased Methods for Discovery
Let us review the discovery of dynamics on a single trajectory introduced in Section 3. Indeed, the discovery by conventional LMMs is simple to implement, and the solution can be found by merely solving a linear system. However, the governing function is only computed at prescribed equidistant time steps, and the relation between and the state is still unknown. One strategy to overcome this limitation is to approximate each component of by functions of specific structures such as neural networks, polynomials, splines, etc. The approximate functions can be determined through optimizations and will serve as closedform expressions for . In real applications, once has been recovered with an explicit expression, the future behavior of the on the same trajectory can be forecasted via solving (1)(2) with the given initial condition. On the other hand, the behavior of the on nearby trajectories can also be predicted via solving (1)(2) with perturbed initial conditions.
Among all structures of approximations, it is popular to employ neural networks in the discovery problems. Especially, when is moderately large, it is convenient to use neural networks to approximate the governing functions with highdimensional inputs, which is usually intractable for other structures. Therefore we focus on the networkbased methods in this paper. Note that the proposed methods can be easily generalized for other structures of approximations.
We consider the neural network approximation based on the LMM scheme (10). Generally, we use to denote the set of all neural networks with a specified architecture of a size set . For example, can be the set of all FNNs with the fixed size , where is the depth and is the width. The notation means that some of the numbers in go to infinity.
Now we introduce a network to approximate , an arbitrary component of . The method of neural network approximation can be developed by replacing each with in (10), namely,
(30) 
where for are given sample locations.
Unfortunately, if
is too small, the degree of freedom of
will be less than the number of equations in (30) and, hence, there is no such that (30) is satisfied precisely. Even if is large enough, it is usually intractable to solve (30) for directly because of the nonlinear parametrization of neural networks. Consequently, in practice, we seek by minimizing the residual of (30) under a machine learning framework. Namely, we aim to find
such that(31) 
where
(32) 
However, similar to the underdetermined linear system (14) that has infinitely many solutions, there exist infinitely many sets of real numbers such that providing
(33) 
For each set , if the degree of freedom of is large enough, there is always some such that (33) is satisfied due to overfitting. In this situation, is a global minimizer of . Consequently, admits infinitely many global minimizers, all of which lead to but take distinct values at . It implies a minimizer of might be totally different from the target governing function we aim to approximate.
To ensure the uniqueness of the minimizer in the function space at grid points, we introduce auxiliary conditions and build an augmented loss function based on (
32). For example, the initial condition (15) on the solution network is enforced by solving(34) 
where
(35) 
The augmented optimization above guarantees that for providing , for any .
Indeed, two networks that are equal at grids are not necessarily equal on the whole trajectory
. Fortunately, it is shown for regression problems and partial differential equation problems, deep learning can generalize well
[Kawaguchi2017, Mei2018, Mei2019, Luo2020]. This means the closeness of two networks at a dense set of training inputs can lead to their closeness at other nearby inputs. It can be inferred that for providing for for any as long as is moderately large.4.4 Implicit Regularization
We discuss the implicit regularization effect [Neyshabur2017, Lei2018] of gradient descent in practical deep learning. For regression problems, if we use overparameterized FNNs with the standard random initialization, gradient descent can lead to global convergence with a linear convergence rate under certain conditions [Jacot2018, Chen2019, Du2018, Zhu2019]. Similar results also exist in the problems of solving partial differential equations [Luo2020]. Even though the global convergence could be established with overparametrization, global minimizers are typically not unique. It is interesting to investigate what global minimizers would be identified by gradient descent and how the training process would reduce fitting errors. To answer these questions, it has been shown that, in regression problems, the training of FNN first captures lowfrequency components of the target function and then starts to eliminate the highfrequency fitting error [Xu2019, Luo2019]. Similar work about this spectral bias of deep learning is discussed in [Cao2019, pmlrv97rahaman19a]. In sum, all the above discussions show that neural networks trained by gradient descent in regression problems have an implicit bias towards smooth functions with low frequencies among all possible neural networks that perfectly fit training data.
Now let us consider the preceding networkbased LMM optimizations. Note that the loss function (32) without auxiliary conditions and the loss function (35) with auxiliary conditions are formally close to the loss in regression problems. Especially, for BDF schemes, and , so the loss functions (32) and (35) are exactly the loss. Hence, it is conjectured that the implicit regularization discussed above can also be applied to the LMM optimizations. Namely, the gradient descent tends to find a very smooth function among all global minimizers. Consequently, if the target governing function is also smooth enough, the gradient descent is expected to find a good approximation either through (31) without auxiliary conditions, or through (34) with auxiliary conditions. Numerical experiments in Section 6 will validate this fact.
4.5 Discovery on a Compact Region
The networkbased formulation (34)(35) is specific for the discovery on a single trajectory from which the data are collected. More generally, we can build similar formulations for the discovery on a connected compact region, from which a set of trajectories can be sampled. Through such a discovery, we can recover the whole vector field in this subset.
Suppose is the solution of (1) with initial value . Let be a compact subset in , then is a compact region filled with all trajectories starting from with time period . In practice, suppose we are given a dataset , where is a set of points densely distributed in , and suppose is densely covered by . We aim to use neural networks to approximate the governing function in the whole subset .
Note that (35) is a loss function with respect to one trajectory. For multiple trajectories, we can build a similar loss function by summing up all individual loss functions with respect to each trajectory. Specifically, let be a network that approximates a certain component of the governing function, then we can determine by
(36) 
where
(37) 
and
(38) 
5 Convergence Analysis
In this section, we consider the convergence of the preceding networkbased dynamics discovery using LMMs, namely, the convergence from the global minimizer of the optimization to the exact governing function as and . The optimization with auxiliary initial conditions is taken as a special case for analysis. For the optimization with other auxiliary conditions, a similar argument can be applied.
5.1 Error Estimates on a Trajectory
We consider the error estimation of the discovery on the specific trajectory . For leastsquare optimizations, people are usually interested in the type error estimation. Therefore, let us introduce the following seminorm with a given ,
(39) 
Note that is not a norm in since does not imply in . However, acts as a norm in the space of all grid functions merely defined on [Keller2019].
As discussed above, for a specific LMM, some states in may not be involved in the scheme. For a fair evaluation, it is proper to study the convergence at all involved states, namely, . Therefore, we rewrite as the LMMrelated seminorm
(40) 
Without ambiguity, we use the notation for all LMMs afterwards. If we write in a vector form, namely,
(41) 
then it follows
(42) 
where is the Euclidean norm of a column vector.
First, let us reformulate the optimization (34)(35) with an abstract admissible set, say,
(43) 
where is defined in (35) and is a general nonempty set of functions. We aim to estimate the distance between and .
For a given LMM, recall that defined in (13) is constructed by lining up the LMM coefficients into rows and is defined in (18). We denote the 2condition number of by . The estimation is described as follows.
In the dynamical system (1), suppose and is defined in , a small neighborhood of . Let be an arbitrary component of . Also, let be an integer and , then we have
(44) 
where is a global minimizer of defined by (35) corresponding to an LMM with order ; is any real number such that ; is a constant independent of and .
Given , similar to (5), we can define the componentwise local truncation error by
(45) 
Then by denoting
(46)  
(47) 
we have
(48) 
By the hypothesis that the LMM has order , there exists some independent of such that
(49) 
On the other hand, since , there exists a function such that
(50) 
for all .
Also, write , where is defined in (17). Then by (16), there exists some constant independent of such that
(51) 
Since is a global minimizer of , it satisfies , namely,
(55) 