Neural dynamical systems are dynamical systems described at least in part by neural networks. Our interest in the subject emerges in the context of Systems Dynamics and Optimization dddas (SDO), which is central to many applications such as storm prediction ravela12, climate-risk based decision support ravela10, or autonomous observatories ravela13
. The SDO cycle conceptually involves a forward path dynamically parameterizing, reducing, calibrating, initializing and simulating numerical models, and quantifying their uncertainties. SDO further involves a return path for adaptive observation, inversion and estimation. In this context, neural networks are attractive as rapidly executing surrogates, and as parametrizations of dynamics difficult numerically model, such as sub-grid-scale turbulencekarpatne19.
Neural networks were classically developed to model discrete processes lstm
, but physical systems evolve continuously in time. Continuous-time Neural Networks (CtNNs) seek to advance neural modeling in both feed-forward or recurrent forms; the latter consist of neurons connected through a time delay and continuous-time activationCtNN_Chen.
CtNNs can approximate the time trajectory of any -dimensional dynamical system to arbitrary accuracy CtRNN_Funahashi. Interest in them as Neural ODEs neuode has grown, particularly to describe dynamics in the form , with inputs , feed-forward/feedback components , and parameters (weights, biases) being a function of time . More common is to consider autonomous dynamics, for example , superposing a non-neural part () and a neural CtNN part (). The neural part may be modeled with time-invariant parameters, i.e. .
Variational estimation for continuous dynamical systems directly applies to training Neural-ODEs. The attendant Hamilton-Jacobi-Bellman (HJB) equations estimate CtNN and non-neural parameters alike bryson
. Although the Machine Learning community cites the use of HJB as a critical discoveryneuode, nevertheless, HJB applies to continuous dynamics in general, and requires no special adaptation for Neural-ODEs. In practice, it is essential to consider the discrete adjoint equations whence a fundamental connection becomes even more plainly visible. For discrete networks, the adjoint equations exactly define back-propagation mlclass; bryson. They emerge as normal equations of a multi-stage (multi-layer network) two-point boundary value problem bryson for stage parameters (layer weights and biases) using terminal constraints (training data). Simply changing the definition of stages from “layers” to “discrete (time) steps,” the same methodology yields forward-backward iterations for training discretized CtNNs.
Whether CtNNs are interesting in their own right or as models of continuous processes, discrete-time simulation is presently central for training and operating them. The standard advice neuode is to use neural computation to calculate derivatives, then resort to a classical numerical integration, interleaving the two. Couldn’t a purely neural computation generate the discrete solution instead?
In this first paper of a three-part series, we show that a single recurrent Discrete-time Neural Network (DtNN) couples derivative calculation and numerical stepping to generate numerical sequences, which we define as neural integration. We construct DtNNs to implement the explicit fixed-step Runge-Kutta method press07 of arbitrary order. The key benefit of our approach is compactness; unlike extant methods RKNN; RKNN2019, the order of approximation for integration does not change the size of our network. We further apply this methodology to semi-implicit methods press07. Specifically, we couple a two-step Adams-Bashforth predictor to a two-step Adams-Moulton corrector. Once initialized, a single DtNN in each case performs neural integration, sequentially issuing outputs.
To demonstrate our approach, we define and employ PolyNets, the class of CtNNs for dynamical systems with polynomial nonlinearities. PolyNets are exact when polynomial coefficients of the governing equations are known. Exact PolyNets conveniently prove correctness of the proposed neural integration methods. We further demonstrate correspondence between neural and numerical integration using numerical simulations of the chaotic Lorenz-63 Lorenz1963 model. Extensions of methodology to non-polynomial dynamics are also discussed.
Ii Related Work
Multiple efforts for implementing neural ordinary differential equationsneuode are being developed, including by incorporating physical constraints during learning RackauckasDiffEq. Neural-ODEs must be discretized for numerical computation; it is typically suggested that CtNNs produce the derivatives subsequently passed to a traditional numerical integrator before returning to the neural computation as input for the next time step neuode. Our approach instead allows the CtNN to be discretized and executed entirely as a DtNN for neural integration.
Previous works have constructed numerical integrators in the form of a neural circuit RKNN; RKNN2019. However, the size of these networks scales with the order of the Runge-Kutta method. Our approach offers a fixed recurrent architecture invariant to order, which is possible through the use of multiplicative nodes, which are now fairly common, e.g. see lstm.
We devise methods for explicit Runge-Kutta (RK) of fixed time-step. We also demonstrate a semi-implicit predictor-corrector Adams-Bashforth-Moulton (ABM) method. To the best of our knowledge, the fixed-size RK neural integrator and the ABM neural integrator are new.
Our neural integration approach is easily demonstrated as correct. Each neural integration method described here is a PolyNet. If the embedded CtNN is certified to be “error-free” then so is neural integration; no algorithmic error beyond the approximation inherent in the numerical method is induced. However, CtNN certification is generally difficult because they are typically trained to an error tolerance. Except when CtNNs are exact PolyNets, the DtNN in our construction is also an exact PolyNet. Thus, algorithmic correctness of neural integration follows. This is important because differences between neural integration and numerical integration can in general arise for many reasons including training error, numerical approximation, computational implementation and numerical precision. It would be difficult to partition these errors, particularly when modeling chaotic processes.
CtNNs must in general be trained and even in the special case of PolyNets, this is true when the polynomial coefficients are unknown. Solving the discrete adjoint equations corresponding to HJB for learning requires discrete forward and backward numerical integration, which neural integration can accurately and efficaciously perform.
We posit that a key advantage of neural integration is that specialized hardware can be exploited better, particularly for Monte Carlo simulations of dynamical systems. For example, CUDA benefits of parallel simulations, e.g. for Monte Carlo, have already been shown Niemeyer2014.
Iii Exact PolyNets
To demonstrate neural integration, we introduce PolyNets, a class of CtNNs that match polynomial dynamics. Polynomial dynamics informs a broad class ODEs and PDEs (e.g. Navier Stokes). Although we do not consider PDEs in this work, the methods described here also informs their numerical integration. In Algorithm 1 that follows, we show that neural computation of polynomial dynamics with known coefficients is (trivially) exact. A fuller discussion of PolyNet classes is presented in Section V.
Let be an -dimensional variable and a dynamical system , where is a polynomial containing monomials of maximum degree . There are possible monomial terms per output and there are outputs. The function can be of the form:
where, , and is defined as
The equivalent PolyNet
is defined as a three-layer network with inputs , outputs , and one hidden layer . is constructed as follows (Algorithm 1):
Set each output node’s activation to be linear.
Set as the bias for the output node .
Connect input node to output node with weight , where and .
To map the input to the hidden layer, re-write the single term,
where and is the Kronecker delta function.
The term is a hidden product node with linear activation and connecting to the input layer. Product nodes are commonly used in deep networks lstm. There are exactly connections between node and , each of unit weight. Note logarithms and anti-logarithms among others are obvious ways to express powers. however, we’ve avoided them to preclude invoking complex number representations.
A hidden node connects to an output node additively with weight , where and .
There are at most nodes in the hidden layer. Thus the PolyNet with parameters , , and at most weights of value
constitute the parameter vectorof .
Thus, by Algorithm 1, the dynamical system , (Equation 1) and the PolyNet neural network are equal.
iii.1 Example: Lorenz Dynamics
In Figure 1, a neural architecture derived directly from the governing equations of the Lorenz system computes the time derivatives given an input. The equations describing the system are
Interestingly, Lorenz system is a polynomial with variables and degree (maximum) but requires no more than two hidden nodes as implemented by the PolyNet. Note also that hidden nodes of degree are technically unnecessary, and we have simply replaced them with the necessary additive connections from input to output nodes.
Iv Neural Integration
In this section, we construct neural integrators, first for the Runge-Kutta (RK) method and second to Adams-Bashforth-Moulton (ABM). For the Runge-Kutta method, consider a neural implementation of the common order explicit Runge-Kutta scheme with fixed step:
If is an external function guaranteed to be accurate/correct, then the step-update is polynomial in and , and it is an exact PolyNet. A simple design is to follow Algorithm 1, however, the network size will change with order, which we avoid using recurrence.
Consider instead the network depicted in Figure 2 and Figure 3. In this circuit, coefficients cyclically enter the network through the nodes on the left, and the output for the next time step is computed after five iterations of the system. In this circuit, computation occurs in the order specified by the numbers adjacent to each node. The computations at nodes , , and are then held by delays in the recurrence until the next iteration. Each node takes the sum or product of its inputs as indicated by the symbols on each node. In the circuit, represents the gradient computation achieved via any neural circuit representing a set of polynomial ODEs. In our example, the Lorenz neural circuit in Figure 1 is implemented for , but any neural dynamical system could replace it. Every iterations, node contains the output that is the next element in the time series. With this construction, given a single initial state, the neural circuit iteratively outputs the time series. The neural circuit in Figure 2 is an exact recurrent PolyNet of fixed size that implements explicit fixed-step RK4 shown in Equations 5-9.
The structure shown in Figure 2 can be modified slightly to apply to a general Runge-Kutta method. Since we consider general Runge-Methods to integrate ordinary differential equations, the equations take the form
To generalize the structure of our system, we modify node and the associated coefficient array to an array of nodes, one for each where , such that node will pass the appropriate to each node with a timed one-hot multiplicative weight. Each node will pass the to itself and node with the appropriate multiplicative coefficient such that the input to from array is always at round for . In addition, the coefficient array must change to
Finally, the coefficient array changes to the length one-hot vector with the single high bit in the final entry. With these changes, the neural architecture for the order Runge-Kutta method applies to general explicit Runge-Kutta methods. Note that the order of approximation can be varied during neural integration merely by increasing or decreasing coefficient sequences without changing the network. This provides a serious advantage in changing stiffness/dynamical regimes.
iv.1 Semi-Implicit Methods
To extend the methodology to other numerical integrators, consider the two-step Adams-Bashforth predictor, coupled to a two-step Adams- Moulton Corrector, according to
The neural architecture for this numerical integrator is shown in Figure 4, where again any equation can be integrated by the system.
depicts a comparison between a fourth-order explicit Runge-Kutta neural integration, a second order Adams-Bashforth-Moulton method neural integration and the corresponding ordinary numerical integration. The computational difference between the neural circuit method and the classical numerical computations were that the neural circuit was implemented with a TensorFlow TensorGraph, and the classical numerical computation was done with a simple function111All programs were written in Python and executed on a Laptop.. Unsurprisingly, since the operations are mathematically equivalent, the output time series are exactly equivalent, save computational floating point discrepancies that magnify as a result of the system’s chaotic behavior. To demonstrate that the discrepancies are due to floating point computation, we have perturbed the initial condition by , which is the precision used for the computation. As a comparison, we have also generated a time series with a perturbation of to show that there is some level of randomness in the accuracy of the system following an initial perturbation.
Although exact PolyNets are sufficient to establish the correctness of neural integration, two issues remain. The first is computational efficiency. We posit that constructing a DtNN allows for faster and more efficient computation than interleaving neural and classical computation. Representing calculations with tensors enables computation on specialized hardware for potentially faster performance. The scaling with independently executing DtNNs, such as for uncertainty quantification and other applications, can be enormous. Indeed, for nominal ODE integration, such benefit already been shown Niemeyer2014. It remains to be seen whether the benefits apparent for many independent DtNN simulations extend to other problems such as variational Bayes. We hypothesize that the gains could be significant especially if coupled with a low-cost variable-order neural integration, as our circuits allow, is minimal.
The second issue is of training embedded CtNNs. To discuss this further, consider the network-dynamics table shown in Table 1. When polynomial structure is known but coefficients are unknown, PolyNet Identification222This term is used in the sense of System Identification in Estimation and Control. is necessary. This is accomplished using the discrete adjoint method bryson for DtNN parameter estimation. In the forward pass, the DtNN simulates outputs up to a horizon and, in the backward pass, errors are back-propagated for estimation. DtNN learning over a receding horizon relates closely to classical receding horizon model predictive optimal control. Of course, DtNNs can be trained using alternate approaches, particularly using Bayesian filters, and fixed point, fixed interval and fixed lag smoothers Ravela2007.
|Poly w/ Coeff.||Poly w/o Coeff.||Non-poly|
When the dynamics have non-polynomial governing equations, PolyNets are a useful approximation in the sense of Stone-Weierstrass weierstrass theorem enabling approximations of smooth non-linearities. However, in this case the PolyNet structure (number of monomials, degree, connectivity) is also in error. The challenging problem of PolyNet Structure Learning333Used analogously to Structure Learning in Machine Learning and Structural Model Errors in Geophysics. must be addressed. However, if parsimony principles are applicable to modeling, then one can gradually increase PolyNet complexity, perhaps by sampling, and apply model selection criteria. This approach is however in stark contrast to structure learning wherein a dense, large network is “made sparse,” a problem both in neural learning as well as graphical models. The systematization of model complexity may enable more tractable structure learning.
PolyNets are also, we posit, promising theory-guided learning machines karpatne19. Using theory, via governing equations to establish the initial PolyNet makes them “explainable” and avoids the ”black-box” nature typical neural construction. Subsequently, with the arrival of data, PolyNet Identification and PolyNet Structure Learning interleave, following parsimony principles titrating model complexity. In these steps, to be sure, only PolyNet terms are estimated– the integrator loop remains immutable. Thus, constructing neural circuits for dynamical systems would potentially allow for inference of the underlying equations of a system while accounting for the known physical behavior.
Theory-guided learning, as discussed, offers enormous advantage. Contrast it with the general setting wherein a polynomial is learned directly from data. Using a two-layer neural network similar to the PolyNet, bounds derived by Andoni et al. andoni are show that the problem is substantially worse. This absence of guidance for PolyNet design implies a combinatorial explosion emerging with degree. This in turn causes the number of nodes, required training data, iterations to convergence to all become prohibitive for most applications.
When the governing equations are polynomial but PolyNets aren’t used, the situation also becomes quite a bit complicated. Neural networks have been argued to be polynomial machines where hidden layers PolyReg control the effective degree. However, while neural architectures offer enormous modeling flexibility through myriad activation and learning rules, they can be very sub-optimal with respect to number of parameters needed to model a polynomial. For example, a single neuron with tanh activation poorly models 444See http://essg.mit.edu/ml/yx2; sensitivity to noise, weak generalization and no ‘extrapolation” skill are easily evident. In fact, the Taylor expansion of has no second-degree term. It may be easier to perform ”polynomial regression” instead (as PolyReg argue) provided, as discussed in the previous paragraph, the combinatorial explosion of monomials can be managed. We hypothesize that layered architectures alternating compression (by dimensionality reduction) with nonlinear stretching (by low-degree PolyNets) could be a promising alternative to classical neural circuitry or classical polynomial regression.
Regardless, given polynomial equations and the insistence on using a standard neural network architecture (e.g. feed-forward with smooth nonlinear activation), we show elsewhere that bounds on neural structure can be established ravela19. These bounds have been experimentally validated ziwei, but naturally are higher than PolyNets provide. For example, for the Lorenz system, exactly PolyNet nodes are needed, however, a feed-forward network with activation, nodes appear to be necessary. Andoni et al.’s bounds are prohibitively weaker.
In this paper we demonstrate neural integration: solving neural ordinary differential equations as neural computation. Neural integration is carried out by Discrete-time Neural Networks (DtNN). For a provided Neural-ODE, we construct DtNNs for explicit fixed-step Runge-Kutta methods of any order. The DtNNs are recurrent PolyNets and remain of fixed size irrespective of order. We also construct DtNN for an second-order Adams-Bashforth predictor coupled to a second-order Adams-Moulton corrector. In both these cases, we show that neural integration is correct for any correct embedded Continuous-time Neural Network (CtNN). We further demonstrate this numerically using exact PolyNet for the Lorenz-63 attractor, and showing convergence between neural and numerical integration up to numerical representation. We argue that PolyNets offer a for theory-guided modeling because the availability.
Our focus in this first of a three-part paper is to simply demonstrate neural integration. The next paper focuses PolyNet identification and the third paper focuses on PolyNet adaptation. We also seek to apply this methodology to Systems Dynamics and Optimization (SDO) problems in the environment and for the design of new adaptive autopilots for environmental monitoring.
Ms. Trautner is an undergraduate researcher at ESSG advised by Dr. Ravela. Support from ONR grant N00014-19-1-2273, the MIT Environmental Solutions Initiative, the Maryanne and John Montrym Fund, and the MIT Lincoln Laboratory are gratefully acknowledged. Discussions with members of the Earth Signals and Systems group is also gratefully acknowledged.