I Introduction
Systems described by differential equations (DEs) are ubiquitous in science. Therefore, methods for solving them are of wide interest and importance. While numerical methods for DEs have existed for centuries, modern advances in machine/deep learning, cluster/parallel computing, etc, have shown Neural Networks (NNs) to be powerful options for studying complex systems
Han et al. (2018); Sirignano and Spiliopoulos (2018); Raissi et al. (2019).Landmark works envisioning the utility of NNs in the study of dynamical phenomena Narendra and Parthasarathy (1992); Lagaris et al. (1998); Kuschewski et al. (1993); Polycarpou and Ioannou (1991); Narendra and Parthasarathy (1989) have leveraged such potential. Recently, those methods were adapted to construct NN solvers for Hamiltonian systems that were several orders of magnitude more accurate than symplectic Euler solvers at equivalent temporal discretization Mattheakis et al. (2020).
NN DE solvers generate smooth, closed form, functional approximations to solutions of DEs over domains of interest  providing arbitrarily precise and numerically robust approximations. Further, these methods are amenable to parallelization in ways many discrete and/or iterative methods are inherently incapable of.
However, there has been a general lack of work on directly analysing the local errors present in the predictions of NN DE solvers. Much of the attention focuses on surrogate markers like the loss functions, which provide imprecise, ambiguous, and/or global measures of performance. This adds a layer of uncertainty to understanding the fitness of the prediction, which can often only be resolved by benchmarking the NN predictions against solutions obtained by established methods, possibly defeating the entire purpose of building NN DE solvers and/or leading to inherently untrustworthy predictions.
Computational costs can also limit the utility of NN DE solvers, especially when studying low dimensional systems or when domain resolution and further manipulation of quantities of interest (taking derivatives, compositions, etc) is not substantially important.
Local error quantification can mitigate these problems by providing precise benchmarking and error correction options. Indeed, it is often a possible/useful option for refining predictions of traditional numerical methods. We show that similar estimates are obtainable for NN DE solvers as well  the
learning done by NN DE solvers can be powerful enough to reveal its own limitations. We also show that these predictions help in obtaining faster and better approximations to the true solutions.Let be an operator prescribing the evolution of a dimensional dynamical system:
(1) 
In this work, we present methods that can extend the applicability and effectiveness of NN DE solver variants presented in Narendra and Parthasarathy (1992); Lagaris et al. (1998); Raissi et al. (2019); Kuschewski et al. (1993); Polycarpou and Ioannou (1991); Narendra and Parthasarathy (1989); Mattheakis et al. (2020); Sirignano and Spiliopoulos (2018); Han et al. (2018); Choudhary et al. (2020) and other related works, when such solvers are constructed for dynamical systems (our strategies are expected to generalize to all NN DE solvers designed for systems with piecewise smooth components, but we leave this for later work). We work with the assumption that is a smooth operator over the domain of interest. We derive results for the error in NN predictions and prescribe error correction methods that can magnify the speed and accuracy of the NN. We end this work by exemplifying the utility of our ideas on a nonlinear oscillator and the chaotic HenonHeiles system.
Ii Neural Network Approaches to Smooth dynamical systems
The central aim of this section is to generalize existing results and showcase strategies to significantly magnify the efficiency of existing dynamical system NN solvers, with minimal modifications to the existing methods.
In Mattheakis et al. (2020), the authors presented a rapidly convergent NN that could find accurate functional approximations for the evolution of phase space parameters of various Hamiltonian systems  chaotic and nonlinear systems included  by simply demanding information about the initial phase space coordinate and the temporal domain of interest. They demonstrated how NN training can be accelerated by tailoring the architecture for the problem at hand (physical insight
increasing NN accuracy by advising the choice of activation functions) and how parameters of inherent significance can be studied better by involving the advances that machine/deep learning methods have made in the past few decades (NNs bettering
physical insight by accurately approximating the dynamics at hand).The NN itself was structurally simple and easy to train (Fig. 1): a single hub input layer demanding a set of points in the temporal domain of interest each training iteration, two hidden layers with sin() activation hubs, and an output layer with outputs  one for each state parameter described in Eqn. 1. Thus, the NN was a  dimensional output map for any input : sourcing meant the NN was being trained to give an effective functional approximation for the expected evolution of the dynamical system over . To enforce the initial conditions during the training process, the NN output hubs and the ultimate NN prediction were related as: .
The authors used the residual of Hamilton’s equations as the basis for a mean squared temporal loss.
(2) 
where is the symplectic matrix and is the gradient of the Hamiltonian evaluated at the prediction . Since was a function of by the construction of the network, could be evaluated at any from within the NN, making the training completely unsupervised.
Eqn. 2 also implies the capacity of the NN to dispense with causality, since the evaluation of is not dependent on  showcasing the capacity of NN DE solvers to be parallelized in physical time . The authors also presented the following repackaging of the residual:
(3) 
where is the Hessian of the Hamiltonian evaluated at and is the difference between the true evolution and the NN prediction.
Let us describe (and generalize to smooth dynamical systems) strategies for constructing efficient NN DE solvers similar to the ones found in Lagaris et al. (1998); Mattheakis et al. (2020) and other works. Let the NN make its predictions for some discrete, finite set of time points , with and at every iteration of training. All intermediate are sampled randomly from before each forward pass. We define the following time averaged cost/loss function :
(4) 
where is the NN prediction and is the dynamical operator of the system.
The local error vector
is the centerpiece of the entire NN DE solver: , for , which implies , if the initial condition is being enforced. This can be handled by parametrizations like those in Mattheakis et al. (2020): .Let be the true value at and . Assume the NN is trained sufficiently such that the Taylor expansion is convergent in . We get:
(5) 
Here, , and so on, evaluated at . We note that many common dynamical operators are built from elementary functions with infinite or qualitatively large radii of convergence.
We know that . We use Eqn. 4 and 5 to generalize Eqn. 3 (Eqn. 15 in Mattheakis et al. (2020)):
(6) 
Let us say that for some given loss profile, and
is the minimum singular value of
over . Keeping only the leading order term in Eqn. 6 and using the methodology in Mattheakis et al. (2020), the generalization of their result (Eqn. in Mattheakis et al. (2020)) on upper bounds on magnitudes of individual error components of is:(7) 
Eqn. 7 gives us a way of using the loss function to bound the error in the predictions of the NN DE solver. However, given smoothness for our operator and the sufficient training assumption used to obtain Eqn. 6, we can derive more than just bounds on expected NN error.
Eqn. 4 tells us is a smooth function (since is a smooth function of by the construction of our NN and the operator is assumed to be a smooth operator). Further, the NN can calculate for any . Pausing/Finishing the training of the original NN DE solver makes a fixed closed form expression, which implies itself is now some determinable trajectory in time like , one whose evolution is governed by Eqn. 6. Therefore, we use the following discrete, recursive relation to estimate by picking a small enough , such that is reasonably accurate  such a finite exists due to smoothness of and F:
(8) 
We can use Eqn. 8 to quantify the error in the NN prediction during any stage of training, to as good a resolution and accuracy as needed, by choosing an adequately small . Thus, the NN DE solver, when trained by using the residual of the DE itself, contains all the information needed to quantify the fitness of its own approximation.
Eqn. 6 and 8 allow a powerful possibility  using a second NN to produce smooth, closed form, functional approximation for the error trajectory associated with any prediction . retains all the advantages of being an NN approximation to the solution of the DE, while being significantly more accurate. The latter half of standard training gives diminishing returns in accuracy, thanks to a saturation of an NN’s predictive power, machine precision limitations and/or any other factors that may be at play. Eventually, computational resources are better spent on adjusting for the error in the NN prediction, rather than trying to minimize further.
Eqn. 8 hints directly at the error correction method  using the new NN as a regression tool: Algorithm 1

train the NN DE solver for iterations, until is reasonably within the radius of convergence of for all (Eqn. 7 can aid in identification).

Generate and at uniformly distanced points in for some adequately large , including at and .

create a copy of the original solver. Reinitialize weights/biases and enforce .

generate a dataset using Eqn. 8 + Step 2.

define the local loss vector for the second solver:

every iteration, keep and randomly select other points from the set of for which is available ( reduces likelihood of similar or repeating batches, defeating overfitting concerns by mimicking stochastic methods).

train the second NN using .

use as the approximation at end of training.
This algorithm cuts down the computational costs of calculating every iteration. For NN DE solvers Narendra and Parthasarathy (1992); Lagaris et al. (1998); Raissi et al. (2019); Kuschewski et al. (1993); Polycarpou and Ioannou (1991); Narendra and Parthasarathy (1989); Mattheakis et al. (2020); Sirignano and Spiliopoulos (2018); Han et al. (2018); Choudhary et al. (2020), appropriate variants of such differential terms lead to the dominant costs per iteration  the differential term
has to be evaluated from within the NN, before backpropagation (an additional differential cost) can be applied. Hence, our algorithm is faster than the existing methods.
An important choice is : setting up Alg. 1 costs about forward passes. However, simple combinatorics dictates that the NN could train using for much more than iterations, if and/or were large enough.
To put this into perspective, for the NN presented in Mattheakis et al. (2020), would practically ensure that an exactly repeated batch never occurs with random selection. The utility of would last practically forever, before overfitting concerns start building up in any appreciable sense. The cost of these forward passes is insignificant, since NN DE solver variants train over tens of thousands of full training iterations. We use in our examples.
Existing NN DE solvers give diminishing returns on accuracy as soon as they start settling around a local loss minima. Further training leads to better performance at increasingly untenable costs. The core idea is to use those resources for precise, local corrections to the original NN predictions. Using the constraints that govern , we can quantify/better the fitness of our approximations, without external resources/references. These ideas should extend to other NN DE solvers too.
We end this section by noting that NN DE solvers are usually better at sidestepping the curse of dimensionality than grid based discrete approaches Poggio et al. (2017) (in some sense, they do not overcome the curse of dimensionality, but avoid it by focusing only on the relevant features of the problem in question). Indeed, for the BlackScholes equation, this has been rigorously proven Grohs et al. (2018). NN DE solvers also exhibit the capacity for producing general representations that can easily adapt to changes in the parameters describing the problem Magill et al. (2018). They are a rapidly emerging set of powerful tools for studying differential equations. It is imperative that new techniques for improving their efficiency/effectiveness are found.
Iii A nonlinear and a Chaotic example
We exemplify the utility of local error quantification on two Hamiltonian systems (),
where () are the (Null, Identity) block matrices and is the Hamiltonian). The first is the nonlinear oscillator:
The second is the chaotic HenonHeiles system Henon and Heiles (1964):
The central objectives of this section are to:

showcase the local error quantification capacity of the NN DE solver from within, using Eqn. 8.

showcase that the proposed methods are capable of accuracy better than standard methods.

showcase that the proposed methods are significantly more efficient than the standard methods.
We test our methods on the NN solver described in Mattheakis et al. (2020). For a fair comparison, we do not modify the solver in any way that affects its performance/efficiency. We claim and prove that our methods are capable of quantifying the error in the NN solver approximations from within the setup. Further, we claim that we can leverage this capacity of the NN DE solver to significantly better its efficiency. To verify these claims, we use Scipy’s odeint to generate what can be considered the ground truth or true solutions for all practical purposes Oliphant (2007).
We setup 100 randomly initialized NN DE solvers for each of the two Hamiltonian systems. To begin, we use the solver proposed in Mattheakis et al. (2020) to train each unique NN solver for 50,000 iterations. We make a very minor modification to an otherwise exact copy of the original setup in Mattheakis et al. (2020)  we save the weight/bias values after 20,000 iterations of NN training were completed (this minimal modification adds a one time, insignificant computational cost to the training, with no effect on the accuracy).
We create copies of each NN solver and initialize them at weight/bias values that were saved at the end of 20,000 iterations of training. These copies are then trained for 30,000 more iterations using our proposed algorithm. In effect, we are comparing what the effectiveness of the NN DE solvers could have been, if they had used our proposed training methods, with choices of training iterations and . Fig 2(a) and 3(a) visually verify the accuracy of Alg. 1: is practically .
We use several measures to quantify performance: per iteration runtime , Mean error and Max error . Per iteration runtime is the total cost of setting up and using a method, normalized by the number of iterations of use. Mean error of the final NN approximation is the mean of over . Max error is the max value takes over .
can be approximated internally from the standard NN DE solver using Eqn. 8 and externally by comparing with odeint’s ground truth . Fig. 2 (b) and 3 (b, c, d) verify this inherent capacity of the NN.
In Table 1, we report the median observed performance for each algorithm/Hamiltonian system pair. The results are in line with claims made in the previous section: proposed methods outperform existing ones used in Mattheakis et al. (2020).
System  Training  Median  Median  Median 

Algorithm  ( s)  ()  ()  
NL Osc  Standard  4.8  3.63  1.50 
NL Osc  Alg. #1  2.2  2.15  0.86 
HH  Standard  7.2  0.76  0.22 
HH  Alg. #1  2.6  0.44  0.15 
Iv Conclusions
We showcase the power of error quantification on precise, local scales in context of smooth dynamical system NN DE solvers. The local information provided by the residual of the governing DE, when viewed from the NN perspective, leads to methods that can accurately quantify the error in the NN prediction over the domain of interest. We were also able to propose an algorithm that directly uses that information to produce better solutions than those obtainable by standard NN DE methods.
These methods should extend naturally to DEs that admit piecewise well behaved solutions, since a natural definition for exists for NN DE solvers designed for such settings. In turn, also encodes relevant information about the fitness of the NN approximation.
To our knowledge, this work is the first attempt to utilize information embedded in the components that make up the cost/loss function of NN DE solvers (indeed, most physics inspired NNs). Future work will focus on exploring whether our ideas can generalize to most well behaved systems and their corresponding NN DE solvers. Future work will also involve exploration of other algorithms to take advantage of the inherent capacity for error quantification possessed by NN DE solvers (we present one such example in the appendix).
Acknowledgements.
ASD was supported as a Research Fellow by the School of Engineering and Applied Sciences at Harvard University. WTR is supported by a Chancellor’s Fellowship from UCSB.References
 Han et al. (2018) J. Han, A. Jentzen, and W. E, Proceedings of the National Academy of Sciences (2018).
 Sirignano and Spiliopoulos (2018) J. Sirignano and K. Spiliopoulos, Journal of Computational Physics 375, 1339–1364 (2018).
 Raissi et al. (2019) M. Raissi, P. Perdikaris, and G. E. Karniadakis, Journal of Computational Physics 378, 686 (2019).
 Narendra and Parthasarathy (1992) K. Narendra and K. Parthasarathy, International Journal of Approximate Reasoning 6:2, 109 (1992).
 Lagaris et al. (1998) I. E. Lagaris, A. Likas, and D. I. Fotiadis, IEEE Transactions on Neural Networks and Learning Systems 9, 987 (1998).
 Kuschewski et al. (1993) J. G. Kuschewski, S. Hui, and S. H. Zak, IEEE Transactions on Control Systems Technology 1:1, 37 (1993).
 Polycarpou and Ioannou (1991) M. M. Polycarpou and P. A. Ioannou, (1991).
 Narendra and Parthasarathy (1989) K. Narendra and K. Parthasarathy, Proceedings of the 28th IEEE Conference on Decision and Control (1989).
 Mattheakis et al. (2020) M. Mattheakis, D. Sondak, A. S. Dogra, and P. Protopapas, arXiv:2001.11107 (2020).
 Choudhary et al. (2020) A. Choudhary, J. F. Lindner, E. G. Holliday, S. T. Miller, S. Sinha, and W. L. Ditto, Phys. Rev. E 101, 062207 (2020).
 Poggio et al. (2017) T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao, International Journal of Automation and Computing 14, 503 (2017).
 Grohs et al. (2018) P. Grohs, F. Hornung, A. Jentzen, and P. V. Wurstemberger, Mem. Amer. Math. Soc. (to appear) (2018).
 Magill et al. (2018) M. Magill, F. Qureshi, and H. W. de Haan, Advances in Neural Information Processing Systems 2018 (NeurIPS 2018) (2018).
 Henon and Heiles (1964) M. Henon and C. Heiles, The Astronomical Journal 69, 73 (1964).
 Oliphant (2007) T. E. Oliphant, Computing in Science and Engineering 9 (2007).
V Appendix: Another Error correction Scheme
Eqn. 6 also hints at an error correction method similar to the NN DE solvers themselves  setting up a new NN DE solver that seeks to minimize its residual. can be generated for any using the original solver and the new solver can be structured as an exact copy of the old one, with a minor modification to the final parametrization: , where are the hubs of the new output layer.
New DE and NN solver are structurally and functionally similar to their original NN DE solver counterparts: their per iteration costs should be similar. The method does not change, only the quantity of interest does.

train the NN DE solver for iterations, until is reasonably within the radius of convergence of for all (Eqn. 7 can aid in identification).

Generate and at uniformly distanced points in for some adequately large , including at and .

create a copy of the original solver. Reinitialize weights/biases and enforce .

define the local loss vector for the second solver:

every iteration, keep and randomly select other points from the set of for which and are available (generating data at reduces likelihood of similar or repeating batches, reducing overfitting concerns).

train the second NN using .

use as the approximation at end of training.
As mentioned before, NN DE solvers reach a plateau in loss performance after a certain amount of training. This may be a result of various factors, one of which is the saturation of accuracy possible with a given NN architecture, etc. The second NN DE solver is intended for stepping in once a situation like this is reached  by fixing the first NN as a relatively coarser approximation, the second NN seeks to provide smaller scale, finer adjustments that are needed to achieve further accuracy.
Comments
There are no comments yet.