1 Introduction
Dynamical systems are a tool of choice to model the evolution of phenomena occurring in nature. In order to derive a dynamical system describing a real world physical process, one must first gather measurements of this system. Then, a set of variables describing the system at a given time , called the state, along with a transition function linking consecutive states in time, is inferred based on the available measurements. Generally, the continuous limit proves to be more tractable, powerful and convenient for calculations, so that one usually considers an evolution equation of the form :
(1) 
Many phenomena studied in physics, computer vision, biology
(pde_biology), geoscience (pde_geoscience), finance (pde_finance), etc… obey a general equation of this form. For this reason, an extensive effort has been put into gaining a better understanding and resolving this equation. However, for many practical problems, the relation between the components of the state is highly nonlinear and complex to describe analytically: finding an appropriate evolution model can thus elude scientific communities for decades.With the availability of very large amounts of data captured via diverse sensors and recent advances of statistical methods, a new datadriven paradigm for modeling dynamical systems is emerging, where relations between the states are no longer handcrafted, but automatically discovered based on the available observations. This problem can be approached by considering some class of admissible functions , and looking for a such that the solution of :
(2) 
fits the measured data. This approach has motivated some recent work for exploiting machine learning in order to solve differential equations. For example, rudy_datadriven_2017 parameterizes
as sparse linear regression over a set of predefined candidate differential terms,
RAISSI2017683; Raissi18 or Long2018 use statistical models such as Gaussian processes and neural networks to model and learn a solution to the corresponding equation.Previous methods have essentially considered the case where the state of the system is fullyobserved at all times . However, for many realworld applications, the entire state of the system is not fully visible to external sensors: one usually only has access to lowdimensional projections of the state, i.e. observations. Intuitively, the latter can be seen as what is readily and easily measurable; this means that, in contrast with the ideal case where the full state can be observed at all times with perfect certainty, there is an important loss of information. This issue is a major one in many fields within applied sciences (data_assim_bocquet; CAnalysisMF).
In our work, we consider the problem of learning complex spatiotemporal dynamical systems with neural networks from observations , which are only partially informative with respect to the full state
. First, we formulate the problem as a continuoustime optimal control problem, where the parameters of the neural network are viewed as control variables. From this, we then present a natural algorithm solving the resulting optimization problem, placing the neural network in an ordinary differential equation (ODE) solver in order to produce future predictions. Finally, we successfully apply our method to three increasingly challenging datasets and show promising results, comparing our approach to standard deep learning baselines.
Our main contributions are the following:

a general, widely applicable approach for modeling spacetime evolving processes with neural networks;

linking the continuoustime optimal control framework to neural network training in the partially observed case;

experiments with realistic dynamical systems exhibiting good forecasting performance for long time horizons in different settings;

experiments showing successful unsupervised learning of the true hidden state dynamics of the dynamical system;

all our results are achieved without imposing priors over the form of the studied equation. This is an important novelty w.r.t. existing work.
2 Background
2.1 Continuous State Space Models
We consider spacetime dynamics for which can be written as a function of where and are respectively the time and space variables,
the domain over which we study the system. The spatial vectorvalued function
contains the quantities of interest describing a studied physical system at time .In a realistic setting, the state is generally only partially observed e.g., when studying the ocean’s circulation, variables contained in the system’s state such as temperature or salinity are observable, while others such as velocity or pressure are not. In other words, the measured data is only a projection of the complete state . This measurement process can be modelled with a fixed operator linking the system’s state to the corresponding observation :
In the following, is supposed known, fixed and differentiable^{1}^{1}1In most practical cases, this hypothesis is verified as can usually be represented as a smooth operator.. Let us note that, generally, the measurement process represents a considerable loss of information compared to the case where is available, as the measurements may be sparse and lowdimensional.
Moreover, we assume that obeys a differential equation of the general form of equation 1, with an initial condition . This leads us to the following continuous state space model:
(3) 
2.2 Neural Ordinary Differential Equations
Recently, the link has been made between residual networks and dynamical systems E2017: a residual block can be seen as the explicit Euler discretization of the following system:
(4) 
Adopting this viewpoint, time corresponds to the neural network’s layer index, the initial condition to the network’s input, and the forward pass as the time integration , where corresponds to its output. node propose computing this intractable integral using an ordinary differential equation (ODE) solver. During training, in order to compute the derivative with respect to the neural network parameters, the corresponding adjoint state equation is solved backward in time. Note that in this work, instead of considering the evolution of the inner dynamics of the neural throughout its layers, we consider the dynamics of the studied process itself, in the context partially observed states.
3 Theoretical Framework
In this section, we set the theoretical framework necessary to solve our problem. As we will see, it can be formulated as a continuoustime optimal control problem, where the control variables correspond to the network’s parameters. In order to train our model, we derive the forward and backward equations necessary for the classical gradient descent algorithm solving it and discuss the two main methods available to compute numerical solutions.
3.1 Optimization Problem
Our goal is to learn the differential equation driving the dynamics of a smooth state function for which we only have supervision over observations through a fixed operator . In order to enforce our dynamical system to explain the observations, we define a cost functional of the form :
(5) 
Here, is a spatiotemporal field representing observations of the underlying system, the output of the system, and the norm associated to the Hilbert space over .
Since the state is constrained to follow the dynamics described by equation 2, starting from its initial condition , the optimization problem is in fact a constrained one : —l— θE_Y∈Dataset[J(Y,H(X))] dXtdt=F_θ(X_t) X_0=g_θ(Y_k,˘X_0) where is a smooth vector valued function defining the trajectory of , and gives us the initial condition . In other words, parameterizes both the dynamics through and the initialization through . In particular, if a full initial state is given as input to the system, can be taken as independent of any parameter and doesn’t need to be learned.
For any , we assume that and are such that there always exists a unique solution to the equation given as a constraint in equation 3.1. In the following, we will call such a solution .
3.2 Adjoint State Method
Now that the optimization problem is stated, an algorithm to solve it must be designed. For this, we will use a standard gradient descent technique. In order to use gradient descent, we must first calculate the gradient of the cost functional under the constraints, i.e. the differential of . However, this implies calculating , which is often very computationally demanding, as it implies solving forward equations.
But, by considering the Lagrangian formulation of the constrained optimization problem introduced in equation 3.1, it is possible to avoid explicitly calculating . The Lagrangian is defined as :
(6) 
here, the scalar product is the scalar product associated to the space over .
As, for any , satisfies the constraints by definition, we can now write :
which gives :
By calculating the differential of w.r.t. and using it to have the gradient of , we can obtain :
Theorem 1 (Adjoint State Equation).
(7) 
where is solution of :
(8) 
solved backwards, starting with , and where :
and
where denotes the adjoint operator of linear operator .
Proof.
The proof is deferred to section of the supplementary material, Section B.
We now have equations entirely characterizing the gradient of our cost functional: for a given value of , we can solve the forward equation 2 to find . Then, can be solved backwards as its equation only depends on which gives us all necessary elements to calculate the gradient of . This gives us the following iterative algorithm to solve the optimization problem, starting from a random initialization of :
3.3 Approximate Solutions
While Algorithm 1 seems quite straightforward, solving the forward and backward equations (2, 8) generally is not. Typically, they do not yield a closed form solution. We must content ourselves with approximate solutions. There are essentially two different ways to tackle this problem (control): the differentiatethendiscretize approach, or the discretizethendifferentiate approach^{2}^{2}2The differentiatethendiscretize method is often referred to as the continuous adjoint method, and the discretizethendifferentiate approach as the discrete adjoint method (sirkes1997finite)..
In a differentiatethendiscretize approach, one directly approximates the equations using numerical schemes. Here, the approximation error to the gradient comes from the discretization error made in the solver for both the forward and backward equations. This method is used in the black box solvers presented in node. This method has the advantage of allowing the use of nondifferentiable steps in the solver. However, this method can yield inconsistent gradients of cost functional , and the discretization of the adjoint equations depends highly on the studied problem and must carefully be selected (Bocquet2012).
In a discretizethendifferentiate approach, a differentiable solver for the forward equations is used, e.g. using an explicit Euler scheme . Based on the solver’s sequence of operations for the forward equations, the backward equations and the gradient can be directly obtained using automatic differentiation software (pytorch)
. This algorithm is actually equivalent to backpropagation
(lecun1988theoretical). As the stepsize approaches zero, the forward and backward equations are recovered. In this paper, we will use this method as the explicit Euler solver gives good results for our examples while being more easily tractable.4 Experiments
In this section we evaluate our approach, both quantitatively and qualitatively. We consider three different datasets representing dynamical systems with increasing complexity. We evaluate our method with respect to its ability to predict observations and to reproduce the dynamics of the hidden state. For the first two datasets, we use the full initial condition as input. For the last dataset, we only have access to a subset of the states which makes us propose a variant of our approach in order to accommodate this situation.
4.1 Datasets
The first two datasets are completely simulated: we have the true full state to initialize our algorithm in equation (3.1). The last dataset is based on a complex simulation, where real observations are assimilated in order to correct the simulation. Note that for this dataset, we do not have access to the full initial conditions.

The Shallow Water equations are derived from the Navier Stokes equations when integrating over the depth of the fluid (see supplementary material, section A.1). These equations are discretized on a spatial grid. We decompose the simulation into trainvalidation and test subsets of and acquisitions respectively.

The Euler equations, which are also derived from the Navier Stokes equations when neglecting the viscosity term (see supplementary material Section A.2). These equations are discretized on a spatial grid. We use observations for the train set and for the test.

Glorys2v4, parent2013global is a very challenging simulation to learn. We consider as observations the Sea Surface Temperature (SST) from a certain zone provided by the Global Ocean Physics reanalysis Glorys2v4 provided by the Copernicus Marine environment monitoring service ^{3}^{3}3http://marine.copernicus.eu. A brief description of Glorys2v4 is provided in appendix A.3. The dataset consists of daily temperatures from 20061228 to 20151230, from which we extracted a subregion. We take the first days for training, and leave the rest for the test set. Here, the full state is not completely available as initial input, we only have a proxy for one variable and for two dimensions of it: the velocity field. This makes initializing our dynamical system more challenging.
4.2 Implementation Details
We decompose the simulations into training sequences of fixed length, using timesteps for the target sequence. In practice, the cost functional
is estimated on a minibatch of sequences from the dataset, and optimized using stochastic gradient descent.
Throughout all the experiments, is a standard residual network (he_deep_2016), with downsampling layers, residual blocks, and bilinear upconvolutions instead of transposed convolutions. To discretize the forward equation (2) in time, we use a simple Euler scheme. Note that the discretization stepsize may differ from the time interval between consecutive observations; in our case, we apply Euler steps between two observations, i.e. . For the spatial discretization, we use the standard gridlike discretization induced by the dataset.
The weights of the residual network are initialized using an orthogonal initialization. Our model is trained using a exponential scheduled sampling scheme with exponential decay, using the Adam optimizer, with a learning rate set to . We use the Pytorch deep learning library (pytorch).
4.3 Experiments with Shallow water equations
The system of equations is described in more details in the supplementary material, Section A.1. Here, the state corresponds to the column height and the twodimensional velocity vector field, is a linear projector giving the first component of so that observation is the mixed layer depth anomaly and velocity is unobserved. The problem amounts to predicting future states with a training supervision over densities only and an initial full state given to the system. For experiments with shallow water and Euler simulations, we set to be equal to the initial full state provided as input to the system. Note that it is not uncommon to have prior knowledge on the system’s initial condition state (bereziat).
Forecasting Observations.
Figure 1 shows a sample of the predictions of our system over the test set. We can clearly see that it is able to predict observations up to a long forecasting horizon, which means that it generalizes and thus has managed to learn the dynamical system. Note that the initial state used at test time has never been seen at training time which means that the optimization problem was solved correctly without overfitting. The cost function and the supervision were only defined at the level of observations. For the velocity vector field, color represents the angle, and the intensity the magnitude of the associated vectors.
Hidden State Discovery.
Our method forecasts a full state and not only the observations . In order to predict the observations correctly, our model has to learn to predict future hidden states that contain information of the true state. By feeding the true initial conditions to our model, we find that our method is able to learn the true dynamics of the hidden state with a good accuracy, while never directly enforcing a penalty on the the latter. Note that the only access our method has to full states is through the initial state provided as input. This result is intriguing: the model should theoretically be able to use a state encoding that is different from the one given by the initial condition. We hypothesize that our network’s architecture is biased towards preservation of the input code. This is also empirically observed in the domain translation domain.
Interpolation between data points.
Our framework allows us to forecast for arbitrary times . Figure 2 shows a sample of this interpolation mechanism. In this example, the model has been trained by regressing to the targets every 3 images (materialized on the figure by the red boxes). The outputs of the model are then compared with the unseen ground truth states. This shows that our approach allows us to learn the true evolution of the state. This is an important feature of our method, similar in this aspect to the claims of node, even though it is applied here to a highdimensional, highly nonlinear and partially observed learned dynamical system, for which we can interpolate the observations as well as the inferred hidden state.
4.4 Experiments with the Euler equations
The encouraging results of the previous subsection made us want to try our methods with more complex dynamics, namely the Euler equations, in the same conditions to see if it is able to cope with a more difficult example. We use exactly the same architecture as the the previous experiment, and obtain similarly good results on the test set, as shown in Figure 3. Again, we manage to predict observations as well as hidden state dynamics for long forecasting horizons with one full state as input and a training supervision over observations only. The form of Euler equations is provided in appendix A.2.
4.5 Experiments with Glorys2v4
This dataset is much more challenging and represents a leap from the fully simulated ones presented before. One reason is obviously the high dimensionality of the system and the absence of a full state as initial input to our system as we only have a proxy over the velocity field. A second one is the fact that we only work over sequences from the same ocean zone while the model functions within a larger area. This makes the dynamics for a single zone nonstationary as boundary conditions are constantly shifting, thus violating an important assumption of our method and making it almost impossible to make long term forecasts with a reasonable number of observations. All we can hope for is for the dynamics to be locally stationary so that the model can work well for a few steps.
Model  K=5  K=10 

Ours  
Ours (with estimation)  
PKnI (bezenac_deep_2018)  
ConvLSTM ((convlstm)) 
Model  K=5  K=10 

Ours  
Ours (with Estimation)  
PKnI (bezenac_deep_2018)  
ConvLSTM (convlstm) 
. We use the average cosine similarity between the velocity vectors
and ground truth , defined as .Dealing with partial initial conditions.
In order to take into account the observations made above regarding this system, especially the fact that the initial temperatures (in this case, since the we observe the temperatures, ) and the proxy of the velocity field provided as initial input is insufficient to represent the full state, we take in equation (3.1) to be:
(9) 
where corresponds to the past observations ( in the experiments), and is an encoder neural network^{4}^{4}4In this case, corresponds to the parameters of and , which are not shared across networks.. Using allows us to encode available information from the observations which is not contained in nor in . For , we use the UNet architecture (unet). This variant accommodates our approach to model to this dataset, and shows the potential of our method to be used in settings of varying difficulty. We now compare our method against several baselines.
4.5.1 Baselines
PKnI.
This is the physicsinformed deep learning model in bezenac_deep_2018, where prior physical knowledge is integrated: it uses an advectiondiffusion equation to link the velocity with the observed temperatures, and uses a neural network to estimate the velocities.
Convolutional LSTM.
LSTM NN which uses convolutional transitions in the inner LSTM module (convlstm). This model can only produce observations.
4.5.2 Results
We test both variants of our model. The first one is the same as in previous experiments: we take as input and consider the full state to be . The second variant accommodates the fact that the latter is not the full state, and use an encoder network to produce an augmented state. Table 1 shows the forecast error on the observations for different time horizons ( and ). Note that both models variants outperform our baselines across the the different time horizons. In Table 2, we also evaluate our hidden state. For this, we calculate the cosine similarity between the hidden states associated to the proxy on the velocity vectors and the proxy itself. Interestingly, both both our methods outperform the baselines, and tend to produce vector field correlated with . Finally, in figure 4, we can see that despite the high uncertainty from both the partial knowledge about the initial conditions and the varying boundary, our approach performs well.
5 Related Work
Datadriven Forecasting of SpaceTime Dynamics.
Forecasting spacetime dynamics with machine learning methods has been a long standing endeavour. (cressie2015statistics) gives a comprehensive introduction to the use of classical statistical methods to predict spatial timeseries, including the use of hierarchical models. In the neural networks community, (lstm)
introduced the famous Long ShortTerm Memory model which proved powerful in integrating temporal correlations and for which a convolutional version, more suited to spatiotemporal dependencies, was introduced by
(convlstm). More recent work includes (videopixel) which showed compelling results for video forecasting including on the standard Moving MNIST baseline while (Ziat2017SpatioTemporalNN) used embeddings to encode the dynamics in a latent space where the forecasting is done. All the works mentionned here aimed directly to the estimation of a transition function such that where is the studied spatial timeseries which means that the dynamics aren’t understood as resulting from a differential equation as we do in our approach.DataDriven Discovery of Differential Equations.
In the past, several works have already attempted to learn differential equations from data, such as e.g. Crutchfield87equationsof, Alvarez:2013:LLF:2554063.2554065. More recently, rudy_datadriven_2017 uses sparse regression on a dictionary of differential terms to recover the underlying PDE. In RAISSI2017683, they propose recovering the coefficients of the differential terms by deriving a GP kernel from a linearized form of the PDE. Long2018 carefully tailor the neural network architecture, based on the discretization of the different terms of the underlying PDE. Raissi18 develops a NN framework for learning PDEs from data. bilinear construct a bilinear network and use an architecture similar to finite difference schemes to learn fully observed dynamical systems. In those approaches, we often see that either the form of the PDE or the variable dependency are supposed to be known and that the context is the unrealistic setting where the state is fully observed. A more hybrid example is bezenac_deep_2018 where they propose to learn a forecasting system in the partially observable case, where part of the differential equation is known, and the other is approximated using the data, which allows the network hidden state to be interpretable.
6 Discussion
Benefits of ContinuousTime.
In the machine learning community, the forecasting problem is often seen as a learning a neural network mapping consecutive states in time. In this work, we take an alternate approach, and use the neural network to express the rate of change of the states instead. This task is intrinsically simpler for the network, and is in fact the natural way to model time varying processes. Moreover, this allows us to accommodate irregularly acquired observations, and as demonstrated by the experiments, allows interpolation between observations. From a more theoretic viewpoint, the adjoint equations derived in theorem 1 may be helpful in analyzing the behaviour of the backpropagated gradient w.r.t. the properties of the studied system.
Limitations.
However, there are still many aspects to explore. The fact that we are using explicit discretization should be limiting w.r.t. the class of equations we can learn as stiff equations necessitate the use of implicit methods and this can be worked around by the adjoint method we presented. We have also restricted ourselves to a linear and it would be interesting to see how our algorithms work for operators with a more complicated structure. Finally, we have restricted ourselves to the stationary hypothesis while, as we can see through the Glorys2v4 example, realworld processes, when looked at from a local point of view^{5}^{5}5Meaning that not all exterior forces are factored into the model., aren’t. These are interesting directions for future work.
Hidden State Discovery.
By feeding the initial condition to the neural network, and training the network to regress only to the observations, it was not expected that the neural network would forecast the hidden state in a way that closely mimics the true state of the underlying dynamical system. Indeed, the neural network must predict a hidden state that contains the information of the dynamical system’s state in order to correctly forecast the observations for multiple time steps, but the way the network structures this information is not constrained by the loss functional. We believe that these results are due to the fact that is easier for the network to use the same coding scheme as in the initial condition, instead of creating a disjoint code of its own for the following time steps. We see this empirical result as a very important one as it implies that it is possible to learn very complex dynamics with only partial information, without necessarily incorporating prior knowledge on the dynamics of the state. Along with the results obtained for the very challenging Glorys2v4 dataset, we are convinced this constitutes an important step towards applying learning to realworld physical processes. Obviously, the interaction of this phenomenon with the integration of physical priors into the algorithm, for example by adding explicit differential operators into
, is a very interesting question.7 Conclusion
We have introduced a general datadriven framework to predict the evolution of spacetime processes, when the system is highly complex and nonlinear and the state is not fully observed. Assuming the underlying system follows a timedependant differential equation, we estimate the unknown evolution term with a neural network. We argue that this is in fact a natural way to model continuoustime systems. Viewing its parameters as control variables, we propose a learning algorithm for the neural network, making use of results from continuoustime optimal control theory. Experiments performed on two simulated datasets from fluid dynamics and on data from a sophisticated data simulator used in climate modeling show that the proposed method not only is able to produce high quality forecasts at different horizons, but also learns with a good accuracy the underlying state space dynamics. This may open the way for new methods for integrating prior physical knowledge, e.g. by imposing constraints directly on the modeled evolution term.
References
Appendix A Equations
In this section, we succinctly describe the equations used in our experiments.
a.1 The Shallow Water Equations
The shallowwater model can be written as:
(10)  
where:

, , are state variables, standing for velocity and mixed layer depth anomaly)

is the vorticity.

is the reduced gravity

is the mean mixedlayer depth.

is the density of the water set to

is the dissipation coefficient set to

is the diffusion coefficient set to

is the zonal wind forcing defined in Eq. A.1
The zonal wind forcing is defined as:
where:

is the maximum intensity of the wind stress(in the standard case ).

is the latitude coordinate

is the center coordinate of the domain

is the length of the domain ( in our case).
Here, the state is composed of the velocity vector and the mixed layer depth:
For our simulations, the spatial differential operators have been discretized using finite differences on a Arakawa Cgrid.
a.2 The Euler Equations
(11)  
where is the divergence operator, corresponds to the flow velocity vector, to the pressure, and to the density.
The Euler equations are not of the form equation 1 as we still have the pressure variable as well as the null divergence constraint. However, the HelmholzLeray decomposition result states that for any vector field , there exists and such that :
and
Moreover, this pair is unique up to an additive constant for . Thus, we can define a linear operator by :
This operator is a continuous linear projector which is the identity for divergencefree vector fields and vanishes for those deriving from a potential.
By taking a solution of equation 11 and applying on the first equation, we have, as is divergence free from the third equation and as derives from a potential :
where permuting derivation and is justified by the continuity of the operator^{6}^{6}6One can use a finite difference approximation to show it for example..
Conversely, the solution of the above system is such that :
which gives, by exchanging and the integral^{7}^{7}7To prove this, we can take a sum approximation to the integral and use again the linearity then the continuity of . :
so that is automatically of null divergence by definition of . The two systems are thus equivalent.
In conclusion, we have:
Moreover, is generally a two or threedimensional spatial field while is a scalar field.
a.3 Glorys2v4
The Glorys2v4 product is a reanalysis of the global Ocean (and the Sea Ice, not considered in this work). The numerical ocean model is NEMOv3.1 (madec) constrained by partial real observations of Temperature, Salinity and Sea Level. Oceanic output variables of this product are daily means of Temperature, Salinity, Currents, Sea Surface Height at a resolution of 1/4 degree horizontal resolution.
The NEMO model describes the ocean by the primitive equations (NavierStokes equations together with an equation of states).
Let the 3D basis vectors, the vector velocity, (the subscript denotes the local horizontal vector, i.e. over the plane), the potential temperature, the salinity, the in situ density. The vector invariant form of the primitive equations in the vector system provides the following six equations (namely the momentum balance, the hydrostatic equilibrium, the incompressibility equation, the heat and salt conservation equations and an equation of state):
where is the in situ density given by the equation of the state A.3, is a reference density, the pressure, is the Coriolis acceleration. , and are the parameterizations of smallscale physics for momentum, temperature and salinity, and , and surface forcing terms.
As in subsection A.2, the divergencefree constraint over can be enforced through the Leray operator. Moreover, is a function of other state variables so that the state can be written as:
where is the daily mean temperature derived from the instantaneous potential temperature T in the model.
Appendix B Proof of Theorem 1
We start by differentiating . In what follows, all considered functions are supposed to be twice continuously differentiable in all variables and we will use the notation to designate the differential of with respect to i.e. the unique linear operator such that:
By hypothesis, we consider this operator to be always continuous in our case.
Straightforward calculus gives us:
Let us fix and a variation . Then, we have, by definition:
and, for any and any :
and:
so that:
Then, because is twice continuously differentiable:
and:
Moreover, as all differential operators below are continuous by hypothesis, we have that:
so that:
We now have all elements to conclude calculating the derivative of , with some more easy calculus:
By the Schwarz theorem, as is twice continuously differentiable, we have that . Integrating by parts, we get:
Putting all this together and arranging it, we get:
We can now define:
and
and, recalling that can be freely chosen, impose that is solution of:
with final condition . We also choose so that, finally, we have:
which concludes the proof.∎
Comments
There are no comments yet.