# Accelerating Neural ODEs with Spectral Elements

This paper proposes the use of spectral element methods canuto_spectral_1988 for fast and accurate training of Neural Ordinary Differential Equations (ODE-Nets; Chen2018NeuralOD). This is achieved by expressing their dynamics as truncated series of Legendre polynomials. The series coefficients, as well as the network weights, are computed by minimizing the weighted sum of the loss function and the violation of the ODE-Net dynamics. The problem is solved by coordinate descent that alternately minimizes, with respect to the coefficients and the weights, two unconstrained sub-problems using standard backpropagation and gradient methods. The resulting optimization scheme is fully time-parallel and results in a low memory footprint. Experimental comparison to standard methods, such as backpropagation through explicit solvers and the adjoint technique Chen2018NeuralOD, on training surrogate models of small and medium-scale dynamical systems shows that it is at least one order of magnitude faster at reaching a comparable value of the loss function. The corresponding testing MSE is one order of magnitude smaller as well, suggesting generalization capabilities increase.

## Authors

• 6 publications
• 7 publications
• 26 publications
• 10 publications
• ### Polynomial Neural Networks and Taylor maps for Dynamical Systems Simulation and Learning

The connection of Taylor maps and polynomial neural networks (PNN) to so...
12/19/2019 ∙ by Andrei Ivanov, et al. ∙ 0

• ### Revealing hidden dynamics from time-series data by ODENet

To understand the hidden physical concepts from observed data is the mos...
05/11/2020 ∙ by Pipi Hu, et al. ∙ 0

• ### Generalization of partitioned Runge–Kutta methods for adjoint systems

This study computes the gradient of a function of numerical solutions of...
03/22/2020 ∙ by Takeru Matsuda, et al. ∙ 0

• ### Spectral Methods - Part 2: A comparative study of reduced order models for moisture transfer diffusive problems

This paper explores in details the capabilities of two model reduction t...
05/16/2017 ∙ by Suelen Gasparin, et al. ∙ 0

• ### On the overfly algorithm in deep learning of neural networks

In this paper we investigate the supervised backpropagation training of ...
07/27/2018 ∙ by Alexei Tsygvintsev, et al. ∙ 0

• ### Model inference for Ordinary Differential Equations by parametric polynomial kernel regression

Model inference for dynamical systems aims to estimate the future behavi...
08/06/2019 ∙ by David K. E. Green, et al. ∙ 12

• ### Time Dependence in Non-Autonomous Neural ODEs

Neural Ordinary Differential Equations (ODEs) are elegant reinterpretati...
05/05/2020 ∙ by Jared Quincy Davis, et al. ∙ 7

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Neural Ordinary Differential Equations (ODE-Nets; Chen et al., 2018

) can learn latent models from observations that are sparse in time. This property has the potential to enhance the performance of neural network predictive models in applications where information is sparse in time and it is important to account for exact arrival times and delays. In complex control systems and model-based reinforcement learning, planning over a long horizon is often needed, while high frequency feedback is necessary for maintaining stability

(Franklin et al., 2014). Discrete-time models, including RNNs (Jain & Medsker, 1999), often struggle to fully meet the needs of such applications due to the fixed time resolution. ODE-Nets have been shown to provide superior performance with respect to classic RNNs on time series forecasting with sparse training data. However, learning their parameters can be computationally intensive. In particular, ODE-Nets are memory efficient but time inefficient. In this paper, we address this bottleneck and propose a novel alternative strategy for ODE-Nets training.

### Summary of contributions.

We propose SNet, a compact representation of ODE-Nets that makes use of a higher-order approximation of its states by means of Legendre polynomials. This is outlined in Section 4. In order to find the optimal polynomial coefficients and network parameters, we develop a novel optimization scheme, which does not require to solve an ODE at each iteration. The resulting algorithm is detailed in Section 3 and is based on backpropagation (Linnainmaa, 1970; Werbos, 1981; Lecun, 1988) and automatic differentiation (Paszke et al., 2017). The proposed method is fully parallel with respect to time and its approximation error reduces exponentially with the Legendre polynomial order (Canuto et al., 1988).

### Summary of numerical experiments.

In Section 5, our method is tested on a 6-state vehicle dynamics identification problem, where it is at least one order or magnitude faster in each optimizer iteration than explicit and adjoint methods, while convergence is achieved in a third of the iterations. At test time, the MSE is reduced by one order of magnitude. In Section 6, the method is used for the identification of a 30-state system consisting of identical vehicles, coupled via a known collision avoidance policy. Again, our method converges in a third of the iterations required by backpropagation thourgh a solver and each iteration is x faster than the fastest explicit scheme.

## 2 Neural ordinary differential equations

The minimization of a scalar-valued loss function that depends on the output of an ODE-Net can be formulated as a general constrained optimization problem:

 minθ∈Rm∫t1t0 L(t,x(t)) dt, (1) s.t. ˙x(t)=f(t,x(t),u(t);θ), x(t0)=x0,

where is the state, is the input, the loss and ODE functions and are given, and the parameters have to be learned. Equation (1

) can be used to represent several inverse problems, for instance in machine learning, estimation, and optimal control

(Stengel, 1994; Law et al., 2015; Ross, 2009). Problem (1) can be solved using gradient-based optimization through several time-stepping schemes for solving the ODE. (Chen et al., 2018; Gholami et al., 2019) have proposed to use the adjoint method when is a neural network. These methods are typically relying on explicit time-stepping schemes (Butcher & Wanner, 1996). Limitations of these approaches are briefly summarized:

### Limitations of backpropagation through an ODE solver.

The standard approach for solving this problem is to compute the gradients using backpropagation through a discrete approximation of the constraints, such as Runge-Kutta methods (Runge, 1895; Butcher & Wanner, 1996) or multi-step solvers (Raissi et al., 2018). This ensures that the solution remains feasible (within a numerical tolerance) at each iteration of a gradient descent method. However, it has several drawbacks: 1) the memory cost of storing intermediate quantities during backpropagation can be significant, 2) the application of implicit methods would require solving a nonlinear equation at each step, 3) the numerical error can significantly affect the solution, and 4) the problem topology can be unsuitable for optimization (Petersen et al., 2018).

ODE-Nets (Chen et al., 2018) solve (1) using the adjoint method, which consists of simulating a dynamical system defined by an appropriate augmented Hamiltonian (Ross, 2009), with an additional state referred to as the adjoint. In the backward pass the adjoint ODE is solved numerically to provide the gradients of the loss function. This means that intermediate states of the forward pass do not need to be stored. An additional step of the ODE solver is needed for the backward pass. This suffers from a few drawbacks: 1) the dynamics of either the hidden state or the adjoint might be unstable, due to the symplectic structure of the underlying Hamiltonian system, referred to as the curse of sensitivity in (Ross, 2009); 2) the procedure requires solving a differential algebraic equation and a boundary value problem which is complex, time consuming, and might not have a solution (Ross & Karpenko, 2012).

### Limitations of hybrid methods.

ANODE (Gholami et al., 2019) combines the two methods above in order to mitigate their limitations. The problem is split into time batches, where the adjoint is used, and only some intermediate states from the forward pass are stored in memory. This is a compromise between time and storage and does not entirely resolve the above limitations, thus only providing results that are within the convex envelope of the two methods. Moreover, there is no clear strategy to a priori determine the size of the time chunks, which can significantly affect speed and performance.

## 3 Relaxation of the supervised learning problem

Our algorithm is based on two ingredients: i) the discretization of the problem using spectral elements leading to SNet, detailed in Section 4, and ii) the relaxation of the ODE constraint from (1), enabling efficient training through backpropagation. The latter can be applied directly at the continuous level and significantly reduces the difficulty of the optimization, as shown in our examples.

The problem in (1) is split into two smaller subproblems: one finds the trajectory that minimizes an unconstrained relaxation of (1). The other trains the network weights such that the trajectory becomes a solution of the ODE. Both are addressed using standard gradient descent and backpropagation. In particular, a fixed number of ADAM or SGD steps is performed for each problem in an alternate fashion, until convergence. In the following, the details of each subproblem are discussed.

### Step 0: Initial trajectory.

The initial trajectory is chosen by solving the problem

 minx(t)∈X∫t1t0 L(t,x(t)) dt, (2) s.t. x(t0)=x0.

If this problem does not have a unique solution, a regularization term is added. For a quadratic loss, a closed-form solution of the above problem is readily available. Otherwise, a prescribed number of SGD iterations is used.

### Step 1: Coordinate descent on residual.

Once the initial trajectory is found, is computed by solving the unconstrained problem:

 minθ∈Rm∫t1t0 R(t,x⋆(t),θ) dt,R(t,x(t),θ):=∥˙x(t)−f(t,x(t),u(t);θ)∥2 (3)

If the value of the residual at the optimum is smaller than a prescribed tolerance, then the algorithms stops. Otherwise, steps 1 and 2 are iterated until convergence.

### Step 2: Coordinate descent on relaxation.

Once the candidate parameters are found, the trajectory is updated by minimizing the relaxed objective:

 minx(t)∈X∫t1t0 γL(t,x(t))+R(t,x(t),θ⋆) dt, (4) s.t. x(t0)=x0.

### Discussion.

The proposed algorithm can be seen as an alternating coordinate gradient descent on the relaxed functional used in problem (4), i.e., by alternating a minimization with respect to and . If , multiple minima can exist, since each choice of the parameters would induce a different dynamics , solution of the original constraint. For , the loss function in (4) trades-off the ODE solution residual for the data fitting, providing a unique solution. The choice of implicitly introduces a satisfaction tolerance , i.e., similar to regularized regression (Hastie et al., 2001), implying that . Concurrently, problem (3) reduces the residual.

## 4 SNet – High-order discretization of the relaxed problem

In order to numerically solve the problems presented in the previous section, a discretization of is needed. Rather than updating the values at time points from the past to the future, we introduce a compact representation of the complete discrete trajectory by means of the spectral element method.

### Spectral approximation.

We start by representing the scalar unknown trajectory, , and the known input, , as truncated series:

 x(t)=p∑i=0xiψi(t),u(t)=p∑i=0uiψi(t), (5)

where and are a set of given basis functions that span a space . In this work, we choose orthogonal Legendre polynomials of order (Canuto et al., 1988), where

is a hyperparameter.

In order to compute the coefficients of (5), we enforce the equation at a discrete set of collocation points . Here, we choose Gauss-Lobatto nodes (Canuto et al., 1988), which include the initial point

. Introducing the vectors of series coefficients

and of evaluations at quadrature points , the collocation problem can be solved in matrix form as

 xI=M−1x(tQ),Mqi:=ψi(tq). (6)

We approximate the integral (3) as a sum of residual evaluations over . Assuming that , the integrand at all quadrature points can be computed as a component-wise norm

 (7)

### Fitting the input data.

Integral (4) entails evaluating the loss function at quadrature points, which requires the knowledge of the input data at . For the case when this is available, we propose a new direct training scheme, namely, -SNet. This is summarized in Algorithm 1.

If data at is not available, for instance when it is sparse, or if problem (2) does not admit a unique solution, a least-squares approach can be used. We approximate the integral by evaluating at the available time points. The corresponding alternating coordinate descent scheme -SNet is presented in Algorithm 2. We use fixed numbers and of updates for, respectively, and

. Both are performed with standard routines, such as SGD. In the following sections, we study the consequences of a very low-data scenario on this approach. In our experiments, we use ADAM to optimize the parameters and an interpolation order

, but any other orders and solvers are possible.

Algorithm 1 -SNet training   Input: from (6)-(7).      while  do         end while   Output: Algorithm 2 -SNet training   Input: , , , .   while  do      for  do                end for      for  do                end for   end while   Output:

### Ease of time parallelization.

If is enforced explicitly at , then the resulting discrete system can be seen as an implicit time-stepping method of order . However, while ODE integrators can only be made parallel across the different components of , the assembly of the residual can be done in parallel also across time.

### Memory cost.

It can be shown that, if an ODE admits a smooth solution, the approximation error of the SNet converges exponentially with (Canuto et al., 1988), thereby producing a very compact representation of an ODE-Net. Thanks to this property, is typically much lower than the equivalent number of time steps of explicit or implicit schemes with a fixed order. This greatly reduces the complexity and the memory requirements of the proposed method, which can be evaluated at any via (5) by only storing few coefficients.

## 5 Modeling of a planar vehicle dynamics

Let us consider the system

 ˙η=J(η)v,M˙v+d(v)+C(v)v=u, (8)

where are the states, is the control, is the Coriolis matrix, is the damping force, assumed to be linear, and encodes the coordinate transformation from the body to the world frame (Fossen, 2011).

### Identification of a surrogate model.

A model the true system is built using a neural network for each matrix

 J(η)=fJ(η;θJ),C(v)=fC(v;θC),d(v)=fd(v;,θd).

Each network consists of two layers, the first with a activation. Bias is excluded for and . For , we use and input features, where is the vehicle orientation. When inserted in (8), these discrete networks produce an ODE-Net that is a surrogate model of the physical system. Time horizon and batch size of are used. All experiments are performed with learning rates for ADAM (for all methods) and for SGD (only for -SNet). The trajectories of the system are shown in Figure 1, for a single batch, as well as the learning curve.

### Comparison of methods with full data.

The trajectories were sampled at 100 equally-spaced time points and compared the training performance of the novel and traditional training methods. We choose an order for the spectral interpolation. For the -SNet method, we use and

iterations for the SGD and ADAM algorithms at each epoch, as outlined in Algorithm

2, and perturb the initial trajectory as , . This perturbation prevents the exact convergence of Algorithm 1 during initialization, allowing to perform the alternating coordinate descent algorithm. Table 1(a) shows that -SNet outperforms BKPR-DoPr5 by a factor of 50, while producing a significantly improved generalization. The speedup reduces to 20 for -SNet, which however yields a further reduction of the testing MSE by a factor of 10, as can be seen in Figure 2.

### Comparison of methods with sparse data.

We compared again the performance of the methods, with trajectories sampled at fewer time points. In the case of -SNet and -SNet, only points are needed for very accurate integration of the loss function, if such points coincide with the Gauss-Lobatto quadrature points. We found that 100 equally-spaced points produce a comparable result. Here, the points are reduced further. Table 1(b) shows that -SNet preserves a good testing MSE, at the price of an increased number of iterations. We argue that this could be improved by a more specific Legendre interpolation of the data or by a more specific solver for (2). However, even with no modifications and only of data, -SNet is 10x faster than BKPR-DoPr5 with th of test MSE.

## 6 Learning a multi-agent simulation

Consider the multi-agent system consisting of kinematic vehicles:

 ˙ηi =J(ηi)tanh⎛⎝w+Kc(ηi)+1NaNa∑j≠iKo(ηi,ηj)⎞⎠, (9)

where are the states (planar position and orientation), is the coordinate transform from the body to world frame, common to all agents. The agents velocities are determined by known arbitrary control and collision avoidance policies, respectively, and plus some additional high frequency measurable signal , shared by all vehicles. We wish to learn their kinematics matrix by means of a neural network as in Section 5. The task is simpler here, but the resulting ODE has states, coupled by . We simulate agents in series.

### Results.

We use learning rates for ADAM (for all methods) and for SGD (only for -SNet), time horizon and batch size . The learning curves for high-data regime are in Figure 3. For method -SNet, we use and training is terminated when the loss in (4) is less than , with and 111For the case of data only we set . Table 2 summarizes results. -SNet is the fastest method, followed by -SNet which is the best performing. BKPR-Euler falls back in iteration time by a factor , with x worse test MSE. Down-sampling data by and confirms the trend. BCKPR-DoPr5 failed due to an underflow when computing the time step, solver tolerances have been increased to rtol, atol. Since the loss continued to increase, at epochs training was terminated. Test trajectories are shown in Figure 4. Additional figures and details are in Appendix.

## 7 Scope and limitations

### Test time and cross-validation

At test time, since the future outputs are unknown, an explicit integrator is needed. For cross-validation, the loss needs instead to be evaluated on a different dataset. In order to do so, one needs to solve the ODE forward in time. However, since the output data is available during cross-validation, a corresponding polynomial representation of the form (5) can be found and the relaxed loss (4) can be evaluated efficiently.

### Nonsmooth dynamics.

We have used the fact that the ODE-Net dynamics is smooth in order to take advantage of the exponential convergence of spectral methods. However, this might not be true in general. In these cases, the optimal choice would be to enrich the spectral element space with (possibly low-order) locally-supported basis functions (Canuto et al., 1988).

### Topological properties.

As discussed in (Petersen et al., 2018), the set of functions generated by a fixed neural network topology does not posses favorable topological properties for optimization. We argue that one reason for the performance improvements shown by our algorithms is that the constraint relaxation proposed in this work can improve the properties of the optimization space.

### Multiple ODEs: Synchronous vs Asynchronous.

The proposed method can be used for an arbitrary cascade of dynamical systems as they can be coded as a single ODE. In the special case when only the final state of one ODE (or its trajectory) is fed into the next block, e.g. as in (Ciccone et al., 2018), the method could be extended by means of smaller optimizations, where is the number of ODEs.

## 8 Related work

### RNN training pathologies.

One of the first RNNs to be trained successfully were LSTMs (Greff et al., 2017), due to their particular architecture. Training an arbitrary RNN effectively is generally difficult as standard RNN dynamics can become unstable or chaotic during training and this can cause the gradients to explode and SGD to fail (Pascanu et al., 2012). When RNNs consist of discretised ODEs, then stability of SGD is intrinsically related to the size of the convergence region of the solver (Ciccone et al., 2018). Since higher-order and implicit solvers have larger convergence region (Hairer et al., 1993), following (Pascanu et al., 2012) it can be argued that our method has the potential to mitigate instabilities and hence to make the learning more efficient. This is supported by our results.

### Unrolled architectures.

In (Graves, 2016), an RNN has been used with a stopping criterion, for iterative estimation with adaptive computation time. Highway (Srivastava et al., 2015) and residual networks (He et al., 2015) have been studied in (Greff et al., 2016) as unrolled estimators. In this context, (Haber & Ruthotto, 2017) treated residual networks as autonomous discrete-ODEs and investigated their stability. Finally, in (Ciccone et al., 2018) a discrete-time non-autonomous ODE based on residual networks has been made explicitly stable and convergent to an input-dependant equilibrium, then used for adaptive computation.

### Training stable ODEs.

In (Haber & Ruthotto, 2017; Ciccone et al., 2018), ODE stability conditions where used to train unrolled recurrent residual networks. Similarly, when using our method on (Ciccone et al., 2018) ODE stability can be enforced by projecting the state weight matrices, , into the Hurwitz stable space: i.e. . At test time, overall stability will also depend on the solver (Durran, 2010; Isermann, 1989). Therefore, a high order variable step method (e.g. DoPr5) should be used at test time in order to minimize the approximation error.

### Dynamics and machine learning.

A physics prior on a neural network was used by (Jia et al., 2018) in the form of a consistency loss with data from a simulation. In (De Avila Belbute-Peres et al., 2018), a differentiable physics framework was introduced for point mass planar models with contact dynamics. (Ruthotto & Haber, 2018)

looked at Partial Differential Equations (PDEs) to analyze neural networks, while

(Raissi & Karniadakis, 2018; Raissi et al., 2017) used Gaussian Processes (GP) to model PDEs. The solution of a linear ODE was used in (Soleimani et al., 2017) in conjunction with a structured multi-output GP to model patients outcome of continuous treatment observed at random times. (Pathak et al., 2017) predicted the divergence rate of a chaotic system with RNNs.

## 9 Conclusions

We proposed SNet, a novel approximation of the internal dynamics of ODE-Nets in terms of a truncated Legendre series. This allows for a very compact representation using very few coefficients thanks to its exponential convergence for smooth functions. In order to solve the associated optimization problem, a coordinate descent method is employed, where the series coefficients and net weights are updated alternately. When a good guess for the optimal trajectory can be produced from the training data, such as when the data set is rich, then the optimization converges very rapidly. SNet trains orders of magnitude faster and improves generalization with respect to the state of the art. Faster and more accurate ODE-Nets, such as SNet, are a step towards tackling larger and more complex problems.

## Appendix A Vehicle dynamics simulation

The model is formulated in a concentrated parameter form (Siciliano et al., 2008). We follow the notation of (Fossen, 2011). Recall the system definition:

 ˙η=J(η)v,M˙v+d(v)+C(v)v=u,

where are the states, namely, the , and coordinates in a fixed (world) frame, the vehicle orientation with respect this this frame, , and the body-frame velocities, , , and angular rate, . The input is a set of torques in the body-frame, . The Kinematic matrix is

 J(η)=⎡⎢⎣cos(ϕ)−sin(ϕ)0sin(ϕ)cos(ϕ)0001⎤⎥⎦,

the mass matrix is

 M=⎡⎢⎣m000m000I⎤⎥⎦,

where is the vehicle mass and represents the rotational intertia. The Coriolis matrix is

 C(v)=⎡⎢⎣0−mω0mω00000⎤⎥⎦,

and the damping force is . We set and . The input, , comes from a Fourier series with fundamental amplitude .

## Appendix B Multi-agent simulation

The multi-agent simulation consists of kinematic vehicles:

 ˙ηi=J(ηi)vi,vi=tanh⎛⎝w+Kc(ηi)+1NaNa∑j≠iKo(ηi,ηj)⎞⎠,

where are the states for each vehicle, namely, the , positions and the orientation of vehicle in the world frame, while are the controls signals, in the form of linear and angular velocities, , . The kinematics matrix is

 J(ηi)=⎡⎢⎣cos(ϕi)0sin(ϕi)001⎤⎥⎦.

The agents velocities are determined by known arbitrary control and collision avoidance policies, respectively, and . In particular:

 Kc(ηi)=[kvkϕδϕi],Ko(ηi,ηj)=[−kvoe−dlse−|δϕij+π/2|kϕoδϕij],

where , and .

### Training configuration.

We set

 kv=0.05kϕ=0.1,kvo=0.001,kϕo=0.01,ls=0.01.

The signal is generated by a Fourier series with fundamental amplitude .

### Test configuration.

We change to:

 Ktest(ηi,ηj)=[−kvoe−dlskϕoδϕij],

with and . We also set .

## Appendix C Robustness of the methods

In the main paper we have demonstrated that the proposed methods -SNet and -SNet can train deep ODE models orders of magnitude faster and with better generalization than standard backpropagation through an ODE solver as well as the adjoint method. Since for moderate scale ODEs the use of the adjoint becomes infeasible in terms of iteration time, in Section 6 we focused solely on comparing our methods with backpropagation through an ODE solver.

In Section 6, the use of a high order variable-step method (DoPr5), allegedly providing accurate solutions, didn’t lead to good training results. In particular, the loss function continued to increase over the iterations. On the other hand, despite being nearly times slower than our methods, the fixed-step forward Euler solver was successfully used for learning the dynamics of a -state system in the training configuration described in Appendix B. One should however note that, in this configuration, the gains for the collision avoidance policy (which couples the ODE) were set to small values. This makes the system simpler and more stable than having a larger gain. As a result, if one attempts to train with the test configuration from Appendix B, where the gains are increased and the system is more unstable, then backpropagating trough Euler simply fails. Comparing Figures 3 and 5, it can be seen that the learning curves of our methods are unaffected by the change in the gains, while in the latter configuration BKPR-Euler fails to decrease the loss.

The forward Euler method is known to have a small region of convergence. In other words, integrating very fast dynamics requires a very small time step,

, in order to provide accurate results. In particular, for the solver error to be bounded, the eigenvalues of the state Jacobian of the ODE need to lie into the circle of the complex plane centered at

with radius (Ciccone et al., 2018; Isermann, 1989). Higher-order explicit methods, such as Runge-Kutta (Runge, 1895), have larger but still limited convergence regions. Our algorithms on the other hand are implicit methods, which have a larger region of convergence than recursive (explicit) methods (Hairer et al., 1993). We claim that this results in a more stable and robust training. This claim is supported by our experiments.

Reducing the time step can improve the Euler accuracy but it can still lead to vanishing or exploding gradients (Zilly et al., 2016; Goodfellow et al., 2016). In the next section, we show that our methods do not suffer from this problem.

Consider the classic discrete-time RNN:

 x(t+1)=f(x(t),u(t);θ). (10)

Then, given a loss , the following gradients are used during training222We consider a single batch element: and are in , where is the state dimensionality.:

 ∂L∂θ=tf∑t=t0∂L(x(t))∂x(t)∂x(t)∂θ, (11)

where, for any

, the chain rule gives:

 ∂x(t+1)∂θ=∂f(x(t),u(t);θ)∂θ+J(x(t))∂x(t)∂θ, (12) J(x(t))=f(x(t),u(t);θ)∂x(t)

Iteration of (12) is the main principle of backpropagation through time (Goodfellow et al., 2016). A known fact is that iterating (12) is similar to a geometric series. Therefore, depending on the spectral radius of the Jacobian, , it can result in vanishing () or exploding () gradients (Zilly et al., 2016).

We can now demonstrate that, by avoiding function compositions, our approach is immune to the exploding gradient problem. In particular, our gradients are fully

time-independent and their accuracy is not affected by the sequence length. Recall the ODE:

 ˙x(t)=f(x(t),u(t);θ). (13)

The following result is obtained:

###### Theorem 1.

Assume the full Jacobian of has a finite spectral radius for all . Then, the norms of the gradients used in Algorithm 1 and Algorithm 2 are also finite for all .

We will prove the theorem for Algorithm 2 since it includes the relevant parts of Algorithm 1.

###### Proof.

Given the Legendre basis functions, , define the ODE residual as:

 r(x(t),θ)=p∑i=0xi˙ϕi(t)−f(p∑i=0xiϕi(t),u(t);θ),

for which the residual loss is . Then, Algorithm 2 consists of the concurrent minimization of the relaxed loss function:

 ∫t1t0γL(x(t)))+R(x(t),θ⋆)dt,

with respect to the coefficients of the Legendre polynomial given , and of the residual loss:

 ∫t1t0R(x(t),θ)dt,

with respect to given the set of coefficients . For the residual loss gradients are:

 ∂R(x⋆(t),θ)∂θ=−2 r(x⋆(t),θ)T ∂f(∑pi=0x⋆iϕi(t),u(t);θ)∂θ, (14)

where there is no recursion over the previous values , since the basis functions are given and the points are treated as data. By assumption, the Jacobian of

has finite singular values. Hence, by standard matrix norm identities (see Chapter 5 of

Horn & Johnson (2012)) the gradient of with respect to has a finite norm for all .

Similarly, the absence of time recursions in the gradient for the update using the relaxed loss also follows trivially from the fact that the coefficients are independent variables, that we assume a given , that the basis functions are fixed. Then, the claim follows again from the assumption on the Jacobian of . ∎

Note that the result of Theorem 1

cannot easily be achieved when training ODEs using backpropagation through the solver or the adjoint method unless some gradient conditioning, such as clipping or batch normalization

(Ioffe & Szegedy, 2015), is performed after applying . On the other hand, our result relies only on the gradient of being finite. This is trivial for a shallow and, for a very deep , the methods in (Ioffe & Szegedy, 2015; Srivastava et al., 2015; Ciccone et al., 2018; He et al., 2015) can be applied just inside the architecture of , if necessary. This fundamental difference with standard RNN training allows for to have unrestricted Jacobian magnitudes which are needed to effectively model instabilities and long-short term dynamics, similar to LSTMs (Greff et al., 2017).