Neural Ordinary Differential Equations (ODE-Nets; Chen et al., 2018
) can learn latent models from observations that are sparse in time. This property has the potential to enhance the performance of neural network predictive models in applications where information is sparse in time and it is important to account for exact arrival times and delays. In complex control systems and model-based reinforcement learning, planning over a long horizon is often needed, while high frequency feedback is necessary for maintaining stability(Franklin et al., 2014). Discrete-time models, including RNNs (Jain & Medsker, 1999), often struggle to fully meet the needs of such applications due to the fixed time resolution. ODE-Nets have been shown to provide superior performance with respect to classic RNNs on time series forecasting with sparse training data. However, learning their parameters can be computationally intensive. In particular, ODE-Nets are memory efficient but time inefficient. In this paper, we address this bottleneck and propose a novel alternative strategy for ODE-Nets training.
Summary of contributions.
We propose SNet, a compact representation of ODE-Nets that makes use of a higher-order approximation of its states by means of Legendre polynomials. This is outlined in Section 4. In order to find the optimal polynomial coefficients and network parameters, we develop a novel optimization scheme, which does not require to solve an ODE at each iteration. The resulting algorithm is detailed in Section 3 and is based on backpropagation (Linnainmaa, 1970; Werbos, 1981; Lecun, 1988) and automatic differentiation (Paszke et al., 2017). The proposed method is fully parallel with respect to time and its approximation error reduces exponentially with the Legendre polynomial order (Canuto et al., 1988).
Summary of numerical experiments.
In Section 5, our method is tested on a 6-state vehicle dynamics identification problem, where it is at least one order or magnitude faster in each optimizer iteration than explicit and adjoint methods, while convergence is achieved in a third of the iterations. At test time, the MSE is reduced by one order of magnitude. In Section 6, the method is used for the identification of a 30-state system consisting of identical vehicles, coupled via a known collision avoidance policy. Again, our method converges in a third of the iterations required by backpropagation thourgh a solver and each iteration is x faster than the fastest explicit scheme.
2 Neural ordinary differential equations
The minimization of a scalar-valued loss function that depends on the output of an ODE-Net can be formulated as a general constrained optimization problem:
where is the state, is the input, the loss and ODE functions and are given, and the parameters have to be learned. Equation (11994; Law et al., 2015; Ross, 2009). Problem (1) can be solved using gradient-based optimization through several time-stepping schemes for solving the ODE. (Chen et al., 2018; Gholami et al., 2019) have proposed to use the adjoint method when is a neural network. These methods are typically relying on explicit time-stepping schemes (Butcher & Wanner, 1996). Limitations of these approaches are briefly summarized:
Limitations of backpropagation through an ODE solver.
The standard approach for solving this problem is to compute the gradients using backpropagation through a discrete approximation of the constraints, such as Runge-Kutta methods (Runge, 1895; Butcher & Wanner, 1996) or multi-step solvers (Raissi et al., 2018). This ensures that the solution remains feasible (within a numerical tolerance) at each iteration of a gradient descent method. However, it has several drawbacks: 1) the memory cost of storing intermediate quantities during backpropagation can be significant, 2) the application of implicit methods would require solving a nonlinear equation at each step, 3) the numerical error can significantly affect the solution, and 4) the problem topology can be unsuitable for optimization (Petersen et al., 2018).
Limitations of adjoint methods.
ODE-Nets (Chen et al., 2018) solve (1) using the adjoint method, which consists of simulating a dynamical system defined by an appropriate augmented Hamiltonian (Ross, 2009), with an additional state referred to as the adjoint. In the backward pass the adjoint ODE is solved numerically to provide the gradients of the loss function. This means that intermediate states of the forward pass do not need to be stored. An additional step of the ODE solver is needed for the backward pass. This suffers from a few drawbacks: 1) the dynamics of either the hidden state or the adjoint might be unstable, due to the symplectic structure of the underlying Hamiltonian system, referred to as the curse of sensitivity in (Ross, 2009); 2) the procedure requires solving a differential algebraic equation and a boundary value problem which is complex, time consuming, and might not have a solution (Ross & Karpenko, 2012).
Limitations of hybrid methods.
ANODE (Gholami et al., 2019) combines the two methods above in order to mitigate their limitations. The problem is split into time batches, where the adjoint is used, and only some intermediate states from the forward pass are stored in memory. This is a compromise between time and storage and does not entirely resolve the above limitations, thus only providing results that are within the convex envelope of the two methods. Moreover, there is no clear strategy to a priori determine the size of the time chunks, which can significantly affect speed and performance.
3 Relaxation of the supervised learning problem
Our algorithm is based on two ingredients: i) the discretization of the problem using spectral elements leading to SNet, detailed in Section 4, and ii) the relaxation of the ODE constraint from (1), enabling efficient training through backpropagation. The latter can be applied directly at the continuous level and significantly reduces the difficulty of the optimization, as shown in our examples.
The problem in (1) is split into two smaller subproblems: one finds the trajectory that minimizes an unconstrained relaxation of (1). The other trains the network weights such that the trajectory becomes a solution of the ODE. Both are addressed using standard gradient descent and backpropagation. In particular, a fixed number of ADAM or SGD steps is performed for each problem in an alternate fashion, until convergence. In the following, the details of each subproblem are discussed.
Step 0: Initial trajectory.
The initial trajectory is chosen by solving the problem
If this problem does not have a unique solution, a regularization term is added. For a quadratic loss, a closed-form solution of the above problem is readily available. Otherwise, a prescribed number of SGD iterations is used.
Step 1: Coordinate descent on residual.
Once the initial trajectory is found, is computed by solving the unconstrained problem:
If the value of the residual at the optimum is smaller than a prescribed tolerance, then the algorithms stops. Otherwise, steps 1 and 2 are iterated until convergence.
Step 2: Coordinate descent on relaxation.
Once the candidate parameters are found, the trajectory is updated by minimizing the relaxed objective:
The proposed algorithm can be seen as an alternating coordinate gradient descent on the relaxed functional used in problem (4), i.e., by alternating a minimization with respect to and . If , multiple minima can exist, since each choice of the parameters would induce a different dynamics , solution of the original constraint. For , the loss function in (4) trades-off the ODE solution residual for the data fitting, providing a unique solution. The choice of implicitly introduces a satisfaction tolerance , i.e., similar to regularized regression (Hastie et al., 2001), implying that . Concurrently, problem (3) reduces the residual.
4 SNet – High-order discretization of the relaxed problem
In order to numerically solve the problems presented in the previous section, a discretization of is needed. Rather than updating the values at time points from the past to the future, we introduce a compact representation of the complete discrete trajectory by means of the spectral element method.
Collocation and quadrature.
In order to compute the coefficients of (5), we enforce the equation at a discrete set of collocation points . Here, we choose Gauss-Lobatto nodes (Canuto et al., 1988), which include the initial point
. Introducing the vectors of series coefficientsand of evaluations at quadrature points , the collocation problem can be solved in matrix form as
We approximate the integral (3) as a sum of residual evaluations over . Assuming that , the integrand at all quadrature points can be computed as a component-wise norm
Fitting the input data.
Integral (4) entails evaluating the loss function at quadrature points, which requires the knowledge of the input data at . For the case when this is available, we propose a new direct training scheme, namely, -SNet. This is summarized in Algorithm 1.
If data at is not available, for instance when it is sparse, or if problem (2) does not admit a unique solution, a least-squares approach can be used. We approximate the integral by evaluating at the available time points. The corresponding alternating coordinate descent scheme -SNet is presented in Algorithm 2. We use fixed numbers and of updates for, respectively, and
. Both are performed with standard routines, such as SGD. In the following sections, we study the consequences of a very low-data scenario on this approach. In our experiments, we use ADAM to optimize the parameters and an interpolation order, but any other orders and solvers are possible.
Ease of time parallelization.
If is enforced explicitly at , then the resulting discrete system can be seen as an implicit time-stepping method of order . However, while ODE integrators can only be made parallel across the different components of , the assembly of the residual can be done in parallel also across time.
It can be shown that, if an ODE admits a smooth solution, the approximation error of the SNet converges exponentially with (Canuto et al., 1988), thereby producing a very compact representation of an ODE-Net. Thanks to this property, is typically much lower than the equivalent number of time steps of explicit or implicit schemes with a fixed order. This greatly reduces the complexity and the memory requirements of the proposed method, which can be evaluated at any via (5) by only storing few coefficients.
5 Modeling of a planar vehicle dynamics
Let us consider the system
where are the states, is the control, is the Coriolis matrix, is the damping force, assumed to be linear, and encodes the coordinate transformation from the body to the world frame (Fossen, 2011).
Identification of a surrogate model.
A model the true system is built using a neural network for each matrix
Each network consists of two layers, the first with a activation. Bias is excluded for and . For , we use and input features, where is the vehicle orientation. When inserted in (8), these discrete networks produce an ODE-Net that is a surrogate model of the physical system. Time horizon and batch size of are used. All experiments are performed with learning rates for ADAM (for all methods) and for SGD (only for -SNet). The trajectories of the system are shown in Figure 1, for a single batch, as well as the learning curve.
Comparison of methods with full data.
The trajectories were sampled at 100 equally-spaced time points and compared the training performance of the novel and traditional training methods. We choose an order for the spectral interpolation. For the -SNet method, we use and
iterations for the SGD and ADAM algorithms at each epoch, as outlined in Algorithm2, and perturb the initial trajectory as , . This perturbation prevents the exact convergence of Algorithm 1 during initialization, allowing to perform the alternating coordinate descent algorithm. Table 1(a) shows that -SNet outperforms BKPR-DoPr5 by a factor of 50, while producing a significantly improved generalization. The speedup reduces to 20 for -SNet, which however yields a further reduction of the testing MSE by a factor of 10, as can be seen in Figure 2.
Comparison of methods with sparse data.
We compared again the performance of the methods, with trajectories sampled at fewer time points. In the case of -SNet and -SNet, only points are needed for very accurate integration of the loss function, if such points coincide with the Gauss-Lobatto quadrature points. We found that 100 equally-spaced points produce a comparable result. Here, the points are reduced further. Table 1(b) shows that -SNet preserves a good testing MSE, at the price of an increased number of iterations. We argue that this could be improved by a more specific Legendre interpolation of the data or by a more specific solver for (2). However, even with no modifications and only of data, -SNet is 10x faster than BKPR-DoPr5 with th of test MSE.
6 Learning a multi-agent simulation
Consider the multi-agent system consisting of kinematic vehicles:
where are the states (planar position and orientation), is the coordinate transform from the body to world frame, common to all agents. The agents velocities are determined by known arbitrary control and collision avoidance policies, respectively, and plus some additional high frequency measurable signal , shared by all vehicles. We wish to learn their kinematics matrix by means of a neural network as in Section 5. The task is simpler here, but the resulting ODE has states, coupled by . We simulate agents in series.
We use learning rates for ADAM (for all methods) and for SGD (only for -SNet), time horizon and batch size . The learning curves for high-data regime are in Figure 3. For method -SNet, we use and training is terminated when the loss in (4) is less than , with and 111For the case of data only we set . Table 2 summarizes results. -SNet is the fastest method, followed by -SNet which is the best performing. BKPR-Euler falls back in iteration time by a factor , with x worse test MSE. Down-sampling data by and confirms the trend. BCKPR-DoPr5 failed due to an underflow when computing the time step, solver tolerances have been increased to rtol, atol. Since the loss continued to increase, at epochs training was terminated. Test trajectories are shown in Figure 4. Additional figures and details are in Appendix.
7 Scope and limitations
Test time and cross-validation
At test time, since the future outputs are unknown, an explicit integrator is needed. For cross-validation, the loss needs instead to be evaluated on a different dataset. In order to do so, one needs to solve the ODE forward in time. However, since the output data is available during cross-validation, a corresponding polynomial representation of the form (5) can be found and the relaxed loss (4) can be evaluated efficiently.
We have used the fact that the ODE-Net dynamics is smooth in order to take advantage of the exponential convergence of spectral methods. However, this might not be true in general. In these cases, the optimal choice would be to enrich the spectral element space with (possibly low-order) locally-supported basis functions (Canuto et al., 1988).
As discussed in (Petersen et al., 2018), the set of functions generated by a fixed neural network topology does not posses favorable topological properties for optimization. We argue that one reason for the performance improvements shown by our algorithms is that the constraint relaxation proposed in this work can improve the properties of the optimization space.
Multiple ODEs: Synchronous vs Asynchronous.
The proposed method can be used for an arbitrary cascade of dynamical systems as they can be coded as a single ODE. In the special case when only the final state of one ODE (or its trajectory) is fed into the next block, e.g. as in (Ciccone et al., 2018), the method could be extended by means of smaller optimizations, where is the number of ODEs.
8 Related work
RNN training pathologies.
One of the first RNNs to be trained successfully were LSTMs (Greff et al., 2017), due to their particular architecture. Training an arbitrary RNN effectively is generally difficult as standard RNN dynamics can become unstable or chaotic during training and this can cause the gradients to explode and SGD to fail (Pascanu et al., 2012). When RNNs consist of discretised ODEs, then stability of SGD is intrinsically related to the size of the convergence region of the solver (Ciccone et al., 2018). Since higher-order and implicit solvers have larger convergence region (Hairer et al., 1993), following (Pascanu et al., 2012) it can be argued that our method has the potential to mitigate instabilities and hence to make the learning more efficient. This is supported by our results.
In (Graves, 2016), an RNN has been used with a stopping criterion, for iterative estimation with adaptive computation time. Highway (Srivastava et al., 2015) and residual networks (He et al., 2015) have been studied in (Greff et al., 2016) as unrolled estimators. In this context, (Haber & Ruthotto, 2017) treated residual networks as autonomous discrete-ODEs and investigated their stability. Finally, in (Ciccone et al., 2018) a discrete-time non-autonomous ODE based on residual networks has been made explicitly stable and convergent to an input-dependant equilibrium, then used for adaptive computation.
Training stable ODEs.
In (Haber & Ruthotto, 2017; Ciccone et al., 2018), ODE stability conditions where used to train unrolled recurrent residual networks. Similarly, when using our method on (Ciccone et al., 2018) ODE stability can be enforced by projecting the state weight matrices, , into the Hurwitz stable space: i.e. . At test time, overall stability will also depend on the solver (Durran, 2010; Isermann, 1989). Therefore, a high order variable step method (e.g. DoPr5) should be used at test time in order to minimize the approximation error.
Dynamics and machine learning.
A physics prior on a neural network was used by (Jia et al., 2018) in the form of a consistency loss with data from a simulation. In (De Avila Belbute-Peres et al., 2018), a differentiable physics framework was introduced for point mass planar models with contact dynamics. (Ruthotto & Haber, 2018)
looked at Partial Differential Equations (PDEs) to analyze neural networks, while(Raissi & Karniadakis, 2018; Raissi et al., 2017) used Gaussian Processes (GP) to model PDEs. The solution of a linear ODE was used in (Soleimani et al., 2017) in conjunction with a structured multi-output GP to model patients outcome of continuous treatment observed at random times. (Pathak et al., 2017) predicted the divergence rate of a chaotic system with RNNs.
We proposed SNet, a novel approximation of the internal dynamics of ODE-Nets in terms of a truncated Legendre series. This allows for a very compact representation using very few coefficients thanks to its exponential convergence for smooth functions. In order to solve the associated optimization problem, a coordinate descent method is employed, where the series coefficients and net weights are updated alternately. When a good guess for the optimal trajectory can be produced from the training data, such as when the data set is rich, then the optimization converges very rapidly. SNet trains orders of magnitude faster and improves generalization with respect to the state of the art. Faster and more accurate ODE-Nets, such as SNet, are a step towards tackling larger and more complex problems.
Butcher & Wanner (1996)
Butcher, J., & Wanner, G. (1996).
Runge-Kutta methods: some historical notes.
Applied Numerical Mathematics, 22(1-3), 113–151.
- Canuto et al. (1988) Canuto, C., M.Y. Hussaini, A. Quarteroni, & T.A. Zang (1988). Spectral Methods in Fluid Dynamics.. Springer-Verlag.
- Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). Neural Ordinary Differential Equations. In NeurIPS.
- Ciccone et al. (2018) Ciccone, M., Gallieri, M., Masci, J., Osendorfer, C., & Gomez, F. (2018). Nais-net: Stable deep networks from non-autonomous differential equations. In NeurIPS.
- De Avila Belbute-Peres et al. (2018) De Avila Belbute-Peres, F., Smith, K. A., Allen, K., Tenenbaum, J. B., & Kolter, J. Z. (2018). End-to-end differentiable physics for learning and control. In NeurIPS.
Durran, D. R. (2010).
Numerical Methods for Fluid Dynamics.
Springer New York.
- Fossen (2011) Fossen, T. I. (2011). Handbook of marine craft hydrodynamics and motion control. John Wiley & Sons.
- Franklin et al. (2014) Franklin, G. F., Powell, J. D., & Emami-Naeini, A. (2014). Feedback Control of Dynamic Systems (7th Edition). Pearson.
- Gholami et al. (2019) Gholami, A., Keutzer, K., & Biros, G. (2019). ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs. Preprint at arXiv:1902.10298.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.
Graves, A. (2016).
Adaptive Computation Time for Recurrent Neural Networks.
Greff et al. (2017)
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., &
Schmidhuber, J. (2017).
LSTM: A Search Space Odyssey.
IEEE Transactions on Neural Networks and Learning Systems,
Greff et al. (2016)
Greff, K., Srivastava, R. K., & Schmidhuber, J. (2016).
Highway and residual networks learn unrolled iterative estimation.
Haber & Ruthotto (2017)
Haber, E., & Ruthotto, L. (2017).
Stable Architectures for Deep Neural Networks.
arXiv preprint arXiv:1705.03341.
- Hairer et al. (1993) Hairer, E., Nørsett, S. P., & Wanner, G. (1993). Solving Ordinary Differential Equations I (2Nd Revised. Ed.): Nonstiff Problems. Berlin, Heidelberg: Springer-Verlag.
- Hastie et al. (2001) Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.
He et al. (2015)
He, K., Zhang, X., Ren, S., & Sun, J. (2015).
Deep Residual Learning for Image Recognition.
- Horn & Johnson (2012) Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis. New York, NY, USA: Cambridge University Press, 2nd ed.
Ioffe & Szegedy (2015)
Ioffe, S., & Szegedy, C. (2015).
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift.
Isermann, R. (1989).
Digital Control Systems.
Springer Berlin Heidelberg.
- Jain & Medsker (1999) Jain, L. C., & Medsker, L. R. (1999). Recurrent Neural Networks: Design and Applications (International Series on Computational Intelligence). CRC Press.
Jia et al. (2018)
Jia, X., Karpatne, A., Willard, J., Steinbach, M., Read, J. S., Hanson, P. C.,
Dugan, H. A., & Kumar, V. (2018).
Physics guided recurrent neural networks for modeling dynamical
systems: Application to monitoring water temperature and quality in lakes.
- Law et al. (2015) Law, K., Stuart, A., & Zygalakis, K. (2015). Data assimilation. Cham, Switzerland: Springer.
- Lecun (1988) Lecun, Y. (1988). A theoretical framework for back-propagation. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.) Proceedings of the 1988 Connectionist Models Summer School, CMU, Pittsburg, PA, (pp. 21–28). Morgan Kaufmann.
- Linnainmaa (1970) Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki.
- Pascanu et al. (2012) Pascanu, R., Mikolov, T., & Bengio, Y. (2012). On the difficulty of training recurrent neural networks. CoRR, abs/1211.5063.
Paszke et al. (2017)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., & Lerer, A. (2017).
Automatic differentiation in pytorch.In NIPS 2017 Workshop Autodiff.
Pathak et al. (2017)
Pathak, J., Lu, Z., Hunt, B. R., Girvan, M., & Ott, E. (2017).
Using Machine Learning to Replicate Chaotic Attractors and
Calculate Lyapunov Exponents from Data.
Chaos: An Interdisciplinary Journal of Nonlinear Science,
- Petersen et al. (2018) Petersen, P., Raslan, M., & Voigtlaender, F. (2018). Topological properties of the set of functions generated by neural networks of fixed size. arXiv preprint arXiv:1806.08459.
Raissi & Karniadakis (2018)
Raissi, M., & Karniadakis, G. E. (2018).
Hidden Physics Models: Machine Learning of Nonlinear
Partial Differential Equations.
Journal of Computational Physics, 357, 125–141.
Raissi et al. (2017)
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2017).
Numerical Gaussian Processes for Time-dependent and
Non-linear Partial Differential Equations.
arXiv:1703.10230 [cs, math, stat].
Raissi et al. (2018)
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2018).
Multistep Neural Networks for Data-driven Discovery of
Nonlinear Dynamical Systems.
arXiv:1801.01236 [nlin, physics:physics, stat].
- Ross (2009) Ross, I. M. (2009). A primer on Pontryagin’s principle in optimal control, vol. 2. Collegiate publishers San Francisco.
- Ross & Karpenko (2012) Ross, I. M., & Karpenko, M. (2012). A review of pseudospectral optimal control: From theory to flight. Annual Reviews in Control, 36(2), 182–197.
Runge, C. (1895).
Ueber die numerische Auflï¿œsung von Differentialgleichungen.
Mathematische Annalen, 46(2), 167–178.
Ruthotto & Haber (2018)
Ruthotto, L., & Haber, E. (2018).
Deep Neural Networks Motivated by Partial Differential
arXiv:1804.04272 [cs, math, stat].
- Siciliano et al. (2008) Siciliano, B., Sciavicco, L., Villani, L., & Oriolo, G. (2008). Robotics: Modelling, Planning and Control (Advanced Textbooks in Control and Signal Processing). Springer.
Soleimani et al. (2017)
Soleimani, H., Subbaswamy, A., & Saria, S. (2017).
Treatment-Response Models for Counterfactual Reasoning with
Continuous-time, Continuous-valued Interventions.
arXiv:1704.02038 [cs, stat].
Srivastava et al. (2015)
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015).
- Stengel (1994) Stengel, R. F. (1994). Optimal control and estimation. Dover books on advanced mathematics. New York: Dover Publications.
- Werbos (1981) Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC, (pp. 762–770).
Zilly et al. (2016)
Zilly, J. G., Srivastava, R. K., Koutník, J., & Schmidhuber, J. (2016).
Recurrent Highway Networks.
Appendix A Vehicle dynamics simulation
where are the states, namely, the , and coordinates in a fixed (world) frame, the vehicle orientation with respect this this frame, , and the body-frame velocities, , , and angular rate, . The input is a set of torques in the body-frame, . The Kinematic matrix is
the mass matrix is
where is the vehicle mass and represents the rotational intertia. The Coriolis matrix is
and the damping force is . We set and . The input, , comes from a Fourier series with fundamental amplitude .
Appendix B Multi-agent simulation
The multi-agent simulation consists of kinematic vehicles:
where are the states for each vehicle, namely, the , positions and the orientation of vehicle in the world frame, while are the controls signals, in the form of linear and angular velocities, , . The kinematics matrix is
The agents velocities are determined by known arbitrary control and collision avoidance policies, respectively, and . In particular:
where , and .
The signal is generated by a Fourier series with fundamental amplitude .
We change to:
with and . We also set .
Appendix C Robustness of the methods
In the main paper we have demonstrated that the proposed methods -SNet and -SNet can train deep ODE models orders of magnitude faster and with better generalization than standard backpropagation through an ODE solver as well as the adjoint method. Since for moderate scale ODEs the use of the adjoint becomes infeasible in terms of iteration time, in Section 6 we focused solely on comparing our methods with backpropagation through an ODE solver.
In Section 6, the use of a high order variable-step method (DoPr5), allegedly providing accurate solutions, didn’t lead to good training results. In particular, the loss function continued to increase over the iterations. On the other hand, despite being nearly times slower than our methods, the fixed-step forward Euler solver was successfully used for learning the dynamics of a -state system in the training configuration described in Appendix B. One should however note that, in this configuration, the gains for the collision avoidance policy (which couples the ODE) were set to small values. This makes the system simpler and more stable than having a larger gain. As a result, if one attempts to train with the test configuration from Appendix B, where the gains are increased and the system is more unstable, then backpropagating trough Euler simply fails. Comparing Figures 3 and 5, it can be seen that the learning curves of our methods are unaffected by the change in the gains, while in the latter configuration BKPR-Euler fails to decrease the loss.
The forward Euler method is known to have a small region of convergence. In other words, integrating very fast dynamics requires a very small time step,
, in order to provide accurate results. In particular, for the solver error to be bounded, the eigenvalues of the state Jacobian of the ODE need to lie into the circle of the complex plane centered atwith radius (Ciccone et al., 2018; Isermann, 1989). Higher-order explicit methods, such as Runge-Kutta (Runge, 1895), have larger but still limited convergence regions. Our algorithms on the other hand are implicit methods, which have a larger region of convergence than recursive (explicit) methods (Hairer et al., 1993). We claim that this results in a more stable and robust training. This claim is supported by our experiments.
Appendix D Gradient analysis
Consider the classic discrete-time RNN:
Then, given a loss , the following gradients are used during training222We consider a single batch element: and are in , where is the state dimensionality.:
where, for any
, the chain rule gives:
Iteration of (12) is the main principle of backpropagation through time (Goodfellow et al., 2016). A known fact is that iterating (12) is similar to a geometric series. Therefore, depending on the spectral radius of the Jacobian, , it can result in vanishing () or exploding () gradients (Zilly et al., 2016).
We can now demonstrate that, by avoiding function compositions, our approach is immune to the exploding gradient problem. In particular, our gradients are fullytime-independent and their accuracy is not affected by the sequence length. Recall the ODE:
The following result is obtained:
Given the Legendre basis functions, , define the ODE residual as:
for which the residual loss is . Then, Algorithm 2 consists of the concurrent minimization of the relaxed loss function:
with respect to the coefficients of the Legendre polynomial given , and of the residual loss:
with respect to given the set of coefficients . For the residual loss gradients are:
where there is no recursion over the previous values , since the basis functions are given and the points are treated as data. By assumption, the Jacobian of
has finite singular values. Hence, by standard matrix norm identities (see Chapter 5 ofHorn & Johnson (2012)) the gradient of with respect to has a finite norm for all .
Similarly, the absence of time recursions in the gradient for the update using the relaxed loss also follows trivially from the fact that the coefficients are independent variables, that we assume a given , that the basis functions are fixed. Then, the claim follows again from the assumption on the Jacobian of . ∎
Note that the result of Theorem 1
cannot easily be achieved when training ODEs using backpropagation through the solver or the adjoint method unless some gradient conditioning, such as clipping or batch normalization(Ioffe & Szegedy, 2015), is performed after applying . On the other hand, our result relies only on the gradient of being finite. This is trivial for a shallow and, for a very deep , the methods in (Ioffe & Szegedy, 2015; Srivastava et al., 2015; Ciccone et al., 2018; He et al., 2015) can be applied just inside the architecture of , if necessary. This fundamental difference with standard RNN training allows for to have unrestricted Jacobian magnitudes which are needed to effectively model instabilities and long-short term dynamics, similar to LSTMs (Greff et al., 2017).