 # Learning second order coupled differential equations that are subject to non-conservative forces

In this article we address the question whether it is possible to learn the differential equations describing the physical properties of a dynamical system, subject to non-conservative forces, from observations of its realspace trajectory(ies) only. We introduce a network that incorporates a difference approximation for the second order derivative in terms of residual connections between convolutional blocks, whose shared weights represent the coefficients of a second order ordinary differential equation. We further combine this solver-like architecture with a convolutional network, capable of learning the relation between trajectories of coupled oscillators and therefore allows us to make a stable forecast even if the system is only partially observed. We optimize this map together with the solver network, while sharing their weights, to form a powerful framework capable of learning the complex physical properties of a dissipative dynamical system.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Review

Dynamical systems find manifold applications to processes in everyday life and allow insights into many areas not only of physics, but also mathematics, or theoretical biology. A dynamic system can be described by a mathematical model of a time-dependent process whose further course depends only on its initial state, but not on the choice of the starting point in time. In this work we focus on dynamical systems that can be described by ordinary differential equations (ODEs).

If the system is only subject to conservative forces, i.e. forces that can be derived from a scalar potential, higher order ODEs can be written as first order systems by introducing derivatives as new dependent variables of the system. An example of such a reduction are Hamilton’s equations (Equation 1). These equations relate the canonical coordinates, position and momentum , of a system to its total energy, (for a 1-dim. system with one particle of mass , and ).

 dpdt=−∂H∂q,dqdt=+∂H∂p. (1)

There have been several recent articles that focus on dynamical systems, only subject to conservative forces, in the context of machine learning

[Lu2018, Long2018, Chen2018, Greydanus2019, Dupont2019]. The fact that such systems are reducible to first order ODEs qualifies them as interesting examples to be studied with residual type networks. It has been first observed by [E2017] that a ResNet block can be understood as a difference equation that approximates a first order ODE.

Residual Neural Networks

[He2016] are deep neural networks defined by stacking network blocks combined with a residual connection, adding the input of the block to its output (see left panel of Figure 1). If we assume a 1-layer convolutional network with hidden nodes and , filter, applied on a 1-dimensional input , this becomes:

 xt+1,h=xt+\emph{NL}(∞∑j=0ωl=1h,jxt−j), (2)

or:

 xt+1=xt+F(xt,Θt) (3)

with representing the learned network parameters and NL the non-linearity that has been chosen. If we have more than one layer in the network and represents the hidden state of the layer, we can rewrite Equation 3 by introducing a strictly positive parameter , representing the time discretization of a given observable: timestep/number of layers.

 xl+1=xl+d~F(xl,Θl). (4)

For sufficiently small , the residual block can be interpreted as a forward Euler discretization of a first order differential. In this sense Equation 4 builds the link between the network architecture and the dynamics of the system under observation.

The family of problems captured by this formalism is limited to systems that are not subject to dissipation. In this note we are widening the scope and target non-reducible, second-order dynamical systems, i.e. systems that are subject to non-conservative forces as for example friction. We introduce a new network that incorporates the notion of a second-order time derivative, as a function of the systems real space observable and it’s time derivative. We then show that we can learn the systems approximate dynamics from it’s real space trajectory only. We then further ask the question if we still can learn the governing physics of a coupled dynamic system driven by non-conservative forces, if we only partially observe its real-space trajectories?
From here the paper is structured in the following way: In Section 2 we introduce the concept of how residual connections can be used to emulate a second-order finite difference step, Section 3 we then put the architecture to the test and Section 4 deals with the question if we can further exploit the mathematical relation between real-space trajectories of coupled systems to approximate the differential equations of such a system if observed only partially.

## 2 Architecture

The base architecture of the network proposed, is inspired by a finite difference solver. We are looking for solutions of a second order (linear) problem of the form:

 ¨x+p(t)˙x+q(t)x=r(t) (5)

To be able to feed our network with real-space trajectory coordinates of a dissipative dynamical system, we need to incorporate the notion of a second order derivative into our architecture that we call OscillatorNet. We use the second order central difference approximation, where represents the time discretization or sampling frequency of a time-series,

 ¨xt≈xt+Δ−2xt+xt−ΔΔ2+O(Δ2), (6)

to retrieve the residual architecture as:

 xt+Δ=xt+(xt−xt−Δ)−F(xt,Θt)⋅Δ2 (7) Figure 1: Left: Schematic of the ResNet architecture. Right: Schematic of the OscillatorNet architecture. The yellow arrow indicates the residual connection, the green arrows indicate the differential residual

The right panel in Figure 1 shows a schematic of a second order differential block. Additional to the residual connection (yellow arrow) we add a differential residual (green arrows), i.e. we subtract in a separate channel from and then add it together with the residual to the network-block output, therefore implementing a second order finite difference step. The differential residual only adds the terms necessary to approximate a second order differential. The network therefore does not have parameters available to learn dissipation (see top panel of Figure 3). By doubling the number of filters we not only learn the inhomogeneous term as a function of the time series variable, i.e. the position (e.g. the force applied by a spring in an elastic pendulum; see lower panel of Figure 2) but it also becomes possible, to learn a second inhomogeneous term as a function of the derivative of the time series, i.e. the speed (e.g. the friction force acting on an elastic pendulum). This allows the system to numerically approximate the full differential equation generating the data:

 ¨x=f(x,˙x). (8)

## 3 Learning ODEs from real-space trajectories

### 3.1 Damped harmonic oscillator

The damped harmonic oscillator is a simple dynamical system subject to dissipation (see Figure 2), described by the differential equation:

 d2xdt2=−bmdxdt−kmx. (9)

We replace the derivatives with their central difference approximation and get:

 (xt+Δ−2xt+xt−ΔΔ2)≈−bm(xt−xt−ΔΔ)−kmxt (10)
 xt+Δ≈−bΔm(xt−xt−Δ)−kΔ2mxt+2xt−xt−Δ (11)

#### 3.1.1 Canonical weights

Three physical constants govern Equation 11: the mass , the spring constant and the damping coefficient . From now on we will refer to them as canonical weights, to be learned by the solver network:

 W=[m,b,k]. (12)

We chose the canonical weights of the network to reflect the proper parametrization of our ODE for explanatory reasons only. As we will reveal in the result section, this choice allows us to demonstrate the capabilities of OscillatorNet in a intuitive manner. However if we want the learned values to be expressed in their proper SI units we have to initialize them with values of the correct order of magnitude. In Section 4.1.2 we generalize these results by training more general weights only dependent on the sampling frequancy of the input time-series.

For two input vectors

and we can express a finite difference step of the network in terms of the canonical weights as:

 [kΔ2m−bΔm]⋅[xtxt−xt−Δ]

#### 3.1.2 Experiments

In our experiment we input 60 consecutive real-space trajectory points of a damped harmonic oscillator with mass , damping coefficient and spring constant in a 1-layer OscillatorNet (kernel size = 1, channels = 2, i.e. number of free parameters = 3). We can now carry out a single finite difference step while the parameters of the differential equation are automatically learned from the data. If we choose to forecast multiple points (free forecast), each prediction is fed back into the layer, before the network is re-trained to generate the next forecast.

Figure 3 shows the results of a free forecast of 60 real-space trajectory points, i.e. except for the training set, no data of the time-series is used to train the network. The top panel shows the networks output with only one filter. We learn the correct frequency but the amplitude deviates increasingly as a function of time, due to the fact that we cannot learn a second inhomogeneous term as function of the oscillators momentum and therefor the total energy is conserved. Figure 3: Top panel: Output from OscillatorNet without doubling the number of filters, i.e. the network cannot learn the dissipative term oft he governing equation. Botton panel: The network has enough free parametrs to learn the full equation ¨x=f(x,˙x). The blue dots represent the output of 1 layer ResNet. Inset: Forecast if the network is trained on less than 1/4π of the oscillator signal.

The lower panel of Figure 3 shows the forecast of OscillatorNet with two filters (needs some explanation of the generator). In comparison we also show the result of a single ResNet layer (blue dots).
The fact that we can forecast multiple time-steps in a stable fashion, without exposing the network to more ground-truth data, indicates that we indeed learn the differential equation. To confirm this claim and to show that we accurately learn the governing physics of the system, we extract the trained canonical weights. A comparison of the true values versus the canonical weights learned by OscillatorNet can be found in Table 1.

An impressive feature of the network is it’s ability to approximatively learn the correct parameters even if it is only trained on a very small sample of the time-series. The inset in Figure 3 shows the output of the network if trained on less than a quarter period of the damped oscillator signal. The learned parameters can be found in Table 2.

### 3.2 Coupled harmonic oscillators

OscillatorNet generalizes to arbitrary many coupled masses. Let’s take a look at two coupled oscillators (Figure 4), subject to different spring constants and damping coefficients, described by the system of ODEs:

 d2x1dt2 = −b1m1dx1dt−k1m1x1+(x2−x1)k2m1 d2x2dt2 = −b2m2dx2dt−(x2−x1)k2m2 (13)

Replacing the derivatives with the respective central difference approximations and solving for becomes:

 xt+Δ1 ≈ −b1Δm1(xt1−xt−Δ1)−Δ2m1(k1+k2)xt1+k2Δ2m1xt2+2xt1−xt−Δ1 xt+Δ2 ≈ −b2Δm2(xt2−xt−Δ2)−k2Δ2m2(xt1+xt2)+2xt2−xt−Δ2 (14)

The canonical weights of this problem are:

 W=[m1,m2,b1,b2,k1,k2] (15)

For two input time-series ]and we can express a finite difference step of the network in terms of the canonical weights as:

 ⎡⎢⎣−Δ2m1(k1+k2)Δ2m1k2−Δb1m10Δ2m2k2−Δ2m2k20−Δb2m2⎤⎥⎦⋅⎡⎢ ⎢ ⎢ ⎢ ⎢⎣xt1xt2xt1−xt−Δ1xt2−xt−Δ2⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

We train OscillatorNet on 60 points of the signal by embedding each trajectory as a separate channel into the network. The network then carries out a single finite difference step for both time-serie embeddings, while learning the parameters of the differential equations, before we feed the forecast back into the network and retrain to produce the next forecast. The lower panel in Figure 5 shows the results of the free forecast of 60 points for each trajectory, achieved in this manner. The learned weights can be found in Table 3. Figure 5: Top panel: 60 points free forecast of one trajectory of a coupled harmonic oscillator after the network is trained on only on the trajectory of oscillator 1. Lower panel: 60 points free forecast of the full oscillator system after trainig on 60 points of each oscillator trajectory.

## 4 Exploiting the relation between real-space trajectories of a coupled system

In a real-world application we might not be able to observe the complete configuration space of a coupled dynamical system. Let’s consider again the toy-model consisting of two coupled masses depicted in Figure 4. If only partial information about the system is available, i.e. just one time-series describing the trajectory of the first oscillator, a stable forecast is not possible (see top panel of Figure 5). Let’s go back to a system of coupled ODEs as in Equation 13. If the boundary conditions are fixed on one side, we can recover the position in time of a second coupled oscillator from the first. If we solve the first equation for and replace the derivative with it’s proper backward difference approximation, we arrive at :

 xt2=1k2{m1Δ2(xt1−2t−Δx1+xt−2Δ1)+b1Δ(xt1−xt−Δ1)+k1xt1}+xt1. (16)

If we have more than two coupled oscillators in the system of consideration, the oscillator position is given by:

 xti=1ki{mi−1Δ2(xti−1−2t−Δxi−1+xt−2Δi−1) (17) +bi−1Δ(xti−1−xt−Δi−1)},2

Equation 17 can be realized with a convolutional network of the form:

 xt2=α⋅Conv3(xt1,[+1,−2,+1])+β⋅% Conv2(xt1,[+1,−1])+γ⋅xt1. (18)

We refer to this part of the network as mapping (see Figure 6), where we learn the relation between and ( and for 3 oscillators, etc.). The weights of the convolutional kernel in Equation 18 are fixed and represent a stencil on which we perform the numerical derivative. It is important that we use backward finite difference coefficients to respect the causal order of the time-series. The accuracy of this numerical derivative can be increased by changing it’s order. For example the second order stencil carries the following coefficients:

The 3 parameters and in Equation 18 however are trainable by an additional convolutional kernel that acts as a projection operator:

 α=m1k2Δ2,β=b1k2Δ,γ=k1+k2k2 (19)

Since these parameters are also a combination of canonical parameters that form a subset of , the canonical weights that where introduced in the solver part of the network, it seems natural to share this subset over the whole network (solver and mapping). In other words, during training we update the weights of the mapping and the solver combined. However, since , and we embed only one of the two trajectories in the network, when back-propagating the error not all canonical weights are updated but only the subset . This limits our capability in learning the physics that govern the system but not necessarily our ability to produce a stable forecast over a large horizon.

### 4.1 Results

It is only possible to share the parameters between mapping and solver, if we implement the projection according to Equation 18. In other words, the choice of canonical weights dictates the size of the convolutional kernel that learns the linear combination of and . For the canonical weights chosen in our examples the kernel is limited to size . We can choose a wider kernel but have to cut the umbilical cord between solver and mapping. In our experiments we have tried both, a kernel of size that allows us to share and , as well as a wider mapping kernel of size , where we update only on the solver side.

#### 4.1.1 Canonical weights Figure 7: Top panel: Network output of the mapping within the training set. Lower panel:Mapping output for causal padding (blue diamonds) and valid padding (green triangles) and their respective free forecasts in the test set (for clarity only the first ten points are plotted). The orange line is the network output using valid padding but with the inner feedback loop activated.

The first experiment shows the results of sharing the solver weights with the mapping, implemented according to Equation 18, with a projection kernel of size one. The blue diamonds show the mapping output if we set padding to causal, for the green triangles, padding has been set to valid. If we use zero padding during the training of the network, the numerical derivatives calculated on the mapping stencils are affected, as can be seen in the top panel of Figure 7 between timestamp t= and t=

. If we set padding to valid, these points are truncated. What is more evident, is that even though if padding is set to valid, the mapping is not learning the correct trajectory for the second mass, it approximates a higher amplitude solution, i.e. all points where the trajectories of the two oscillators time-series intersect, are matched by the mapped trajectory. This result is not surprising since we update only a subset of all canonical weights when we back-propagate the error, the network is free to learn a solution that is numerically more stable, given it’s partial degrees of freedom:

, , and , while and are fixed at initialization. Figure 8: Top panel: Blue diamonds show the quality of the mapping and the impact of zero padding, the green triangles show the mapping with padding set to valid. Botton panel: Free forecast of oscillator 1 on basis of wide mapping kernel with padding set to valid (solid green line), and causal (solid blue line).

The lower panel of Figure 7 shows the out-of-sample free forecast based on the above training. Not surprisingly, the highest error is observed when padding is set to causal and therefore affecting the mapping kernel. But even the forecast with padding set to valid, proves to be only stable for the first 4 points before the error becomes too large. The forecast however can be stabilized, if we feed the mapped time-series back into the solver at each free-forecast step. The result of this inner feedback loop (IFL) is the orange curve in the lower panel of Figure7.

If we widen the projection kernel to size 25, we are no longer able to share the weights between solver and mapping, but on the other hand we get more stable results in terms of the quality of the forecast. Figure 8 shows the behaviour of the mapping network in the test and training set (blue diamonds for padding set to causal and green triangles for valid padding). If we take a look at the parameter learned (Table 6), and compare it to Table 3, the results of embedding both trajectories into the network, we can see that we achieve a lower error when mapping to a higher frequency solution.

#### 4.1.2 Combined weights

When we look at the above results, we find that all examples where we use mapping, fail at approximating the second spring constant. The reason lies in the fact that the chosen weights make the problem numerically ill defined. In an attempt to make the network more general and ensure more numerical stability, we choose weights that only carry the units and . In terms of the canonical weights, we can express the new set of parameters as:

 parametera = Δ2k2m1 (20) parameterb = Δ2k2m2 (21) parameterc = Δb1m1 (22) parameterd = Δb2m2 (23) parametere = k1+k2k2 (24)

We initialize the network with the same values as before, for the combined weights, all results can be found in Figure 9 and Tables 8, 9 & 10 Figure 9: Top panel: Mapping output during training for a wide mapping kernel and causal padding (blue diamonds), a wide mapping kernel and valid padding (green triangles) and for a mapping kernel of size 1 with valid padding (red triangles).Botton panel: Test set showing the free forecast of different kernel sizes and padding. The mapping and forecast if kernel size is set to one (red triangles and dots) is only shown for the first 6 points due to large increase in error.

## 5 Conclusion

In this work we have introduced a new type of network, capable of solving the time-series patterns that form the solution of second order ODEs, subject to non-conservative forces. Instead of locally interpolating a time-series, the network introduced learns a prior representing the governing physical laws and is therefore capable to forecast the systems behaviour over a large time-horizon. Moreover we have shown that given the correct parametrization of the network we can approximate the governing equations to high accuracy. We have further investigated the possibility that, given only partial information about a coupled system we can train an internal mapping to retrieve a stable forecast of this partial system over many points. However our experiments have shown that if internal mapping is used, the network tends to learn a high amplitude