    # Uncovering differential equations from data with hidden variables

Finding a set of differential equations to model dynamical systems is a difficult task present in many branches of science and engineering. We propose a method to learn systems of differential equations directly from data. Our method is based on solving a tailor-made ℓ_1-regularised least-squares problem and can deal with latent variables by adding higher-order derivatives to account for the lack of information. Extensive numerical studies show that our method can recover useful representations of the dynamical system that generated the data even when some variables are not observed. Moreover, being based on solving a convex optimisation problem, our method is much faster than competing approaches based on solving combinatorial problems. Finally, we apply our methodology to a real data-set of temperature time series.

## Code Repositories

### L-ODEfind

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many branches of science are based on the study of dynamical systems. Examples include meteorology, biology, and physics. The usual way to model deterministic dynamical systems is by using (partial) differential equations. Typically, differential equations models for a given dynamical system are derived using apriori insights into the problem at hand; then the model is validated using empirical observations. In an era in which massive data-sets pertaining to different fields of science are widely available, an interesting problem is whether it is possible for a

useful differential equations model to be learned directly from data, without any major modeling effort required by the researcher.

Our goal in this paper is to develop a general methodology for building such differential equations models in contexts in which not all relevant variables are observed, that is, in cases in which the main variable of interest depends on other variables of which no measurements are available. As a concrete example, consider the following problem. RTE, the electricity transmission system operator of France, uses high-level simulations of hourly temperature series to study the impact different climate scenarios have on electricity consumption, and hence on the French electrical power grid. The underlying simulations are based on the Navier-Stokes equations and include as variables the wind velocity, density, pressure, etc. The resulting dynamic system is known to be chaotic, see ChaosNavierStokes. Can we learn a system of differential equations that adequately models the dynamics of the temperature time series if the only observable variable is temperature, that is, if pressure, wind velocity, etc, are latent variables?

In the case in which all relevant variables are observed, the problem of recovering the differential equations that generated the observed data has been studied recently. The papers DataDrivenPDE2, DataDrivenPDE and sindy, which motivated our research, developed an approach that consists of linking the dynamical system discovery problem to a statistical regression problem, making it scalable and easy to apply in different contexts. The main idea of their approach is to consider a set of differential operators (possibly in both time and space if appropriate), discretize them, for example by using finite differences, and then regress the outcome of interest on the discretized differential operators. By solving the regression problem using an ad-hoc thresholded least-squares algorithm, they can build sparse, interpretable models, that use mostly low order derivatives. They explored the applicability of their method on simulated data, but only in situations in which all the variables of the simulated models are observed.

To accommodate the possibility of latent variables, we note that, for a large class of dynamical systems, it is possible to reconstruct a trajectory (equivalent to the original one) given only one of the model variables, using its delayed lags (Takens). Based on this result, we propose to augment the methodology developed in DataDrivenPDE2, DataDrivenPDE and sindy

by including higher-order time derivatives, to tackle situations in which not all relevant variables are observed. We estimate the coefficients of the dynamical system using the Lasso estimator: an

-regularised least squares regression estimator (lasso). We choose to use the Lasso due to its simplicity, the abundance of theoretical guarantees on its performance (sparse-book) and the availability of efficient algorithms to solve the convex optimization problem that defines the estimator. After estimating the regression coefficients, we build a forecasting method by integrating the retrieved differential equation. We call this methodology, pdefind-latent.

Most related to our approach are the proposals of GPoMo and Bongard9943, whose method, GPoMo, addresses the recovery problem via a combinatorial search between a predefined set of polynomial functions of the observable variables. The method proceeds by choosing iteratively a family of combination of terms that minimize the Akaike or the Bayesian information criterion. Finally, it returns the set of best models. The authors also discuss the ability of their algorithm to find equations able to capture the dynamics when only some variables are observed. However, we will show that this approach tends to be slow and does not scale to large problems.

Evaluating the equations found by different methods can be challenging. For example, when the data generating process is chaotic small variations of the parameters or the initial conditions can produce large discrepancies in the predictions. This implies that any metric based solely on the prediction accuracy can be misleading. We propose to use a metric (the Wasserstein distance, villani_optimal) that compares the trajectories of each method and the real series on the phase diagram 888The phase diagram of a dynamical system is a plot of the main variable of the dynamical system (e.g. position) and its first derivative (e.g. velocity)..

The rest of this paper is organized as follows. In Section 3 we describe our methodology in detail and provide an overview of the GPoMo method. Section 4 presents the results of our experiments. In particular, in Section 4.1, we compare the performance of these methods in recovering differential equations using empirical data in the case in which all relevant variables are observed. In Section 4.2 we compare them in the harder case in which at least one relevant variable driving the dynamical system is latent. We apply both methods to the RTE data in Section 4.3. Finally in Section 5 we discuss future work and possible extensions. The Supplementary materials for this article include further results of our numerical experiments, excluded from the main paper due to space constraints.

## 2 Differential equations recovery as a linear regression problem

We consider dynamical systems represented by functions satisfying a set of differential equations of the form

 Df=U(f,Ef), (1)

where , are differential operators in, possibly, both spatial and temporal variables () and is an unknown map. Suppose we have measurements on a grid, that is, we observe An example of a dynamical system we will study in this paper is the classical Rosseler system (2) (Rosseler). This system was originally designed to have similar properties and be simpler than the Lorenz system (lorenz1963) which was, at the same time, a simplified model for atmospheric convection. The system is given by

 df1(t)dt =−f2(t)−f3(t), df2(t)dt =f1(t)+αf2(t), (2) df3(t)dt =β+f3(f1(t)−γ),

for constants . This system can be written in the form (1) by taking , and where , and . For certain values of the parameters , the system is known to have chaotic solutions.

Our objective is to find some system of differential equations that can explain the behaviour of the measurements. As discussed in the introduction, we follow the approach outlined by DataDrivenPDE2, DataDrivenPDE and sindy. Their approach works by choosing a large dictionary of functions and regressing discretisations of on the dictionary. The dictionary in question can be formed, for example, by collecting polynomial powers of , , spatial derivatives of and trigonometric functions of . A concrete simple example of such a dictionary in the case in which is the following:

 A={x,y,z,t,x2,yz,z2,tx,sin(t),f1,f21,f1x,∂f1(x,y,z,t)∂x,∂f1(x,y,z,t)∂y,∂f1(x,y,z,z)∂z}.

Of course in practice all derivatives are replaced by the corresponding finite differences taken from the measurements represented in . Having chosen a dictionary, we let

be the vector collecting all members of the dictionary.

Using the observations of the dynamical system, a regression model can be fitted to find the combination of the elements of the dictionary of functions that adequately explains the behaviour of . That is, we look for a vector of regression coefficients such that for all

 ∂fh(x,y,z,t)∂t≈p∑i=1ci,h.Ai(x,y,z,t). (3)

the authors propose to use an ad-hoc linear regression estimator based on iteratively thresholding the least-squares estimator. Through extensive numerical experiments, they show that this methodology is able to learn systems of partial differential equations that adequately model the dynamical system that generated the data. Unfortunately, if some variables are latent, that is, if one is unable to measure at least one of

, the approach described above breaks down. Next, we describe a way of extending this methodology to deal with the case in which some variables are latent.

### 2.1 Our proposal

It is known that, for a large class of dynamical systems, it is possible to reconstruct a trajectory (equivalent to the original one) given only some of the model variables, using its delayed lags. See Takens for a formalization of this statement. In the same direction, the differential embedding method of differentialEmbedding deals with the problem of single time series reconstruction by adding higher-order derivatives. Based on this result, we propose to augment the methodology developed in DataDrivenPDE2, DataDrivenPDE and sindy by including higher-order derivatives.

More precisely, our approach works by expanding the dictionary of functions by including higher order derivatives of the observable variables. On top of including linear and non-linear functions of , polynomial powers and spatial derivatives of the observable s, we add higher order time derivatives of the observable s. For example, if only is latent, our dictionary might include

A regression model can now be built to find a combination of the elements of the dictionary of functions that adequately explains the behaviour of some time derivative that is of higher order than those included in the dictionary. More precisely, for an appropriately chosen and for all such that is observable, we assume a model for the measurements of the form

 ∂nfh(xi,yj,zk,tl)∂tn=p∑g=1c0h,gAg(xi,yj,zk,tl)+ϵijkl (4)

where are zero mean i.i.d. random errors. The task then is to learn, from possibly noisy observations, vectors , such that for all that is observable,

The regression model has to be learned using the available data. This regression problem could be solved in principle using least-squares. However, the ordinary least-squares regression estimator is ill-defined in cases in which the number of predictor variables

is larger than the number of observations. Since the analyst is usually uncertain about the number of elements in the dictionary needed to adequately model the system of interest, the method used to solve the regression problem at hand should allow for large number of predictor variables (possibly larger than the number of observations) and automatically estimate sparse models, that is, generate accurate models that only use a relatively small fraction of predictor variables. The Lasso regression technique (

5) is perfectly suited for this task. The Lasso is a -regularised least-squares regression estimator, defined as follows. For such that is observable we let

 c∗h=argminc∈Rp∑i,j,k,l(∂nfh(xi,yj,zk,tl)∂tn−p∑g=1ch,gAg(xi,yj,zk,tl))2+λ\normch1, (5)

where is a tuning constant, measuring the amount of regularisation. It can be shown (sparse-book) that the penalty encourages sparse solutions and that the larger the sparser the solution vector will be. In practice, is usually chosen by cross-validation. Note that any other sparse regression technique could have been used to estimate the coefficients. We prefer the Lasso due to its simplicity and the wide availability of efficient algorithms to compute it. See for example CD.

The main assumption behind this methodology is that the dynamical system that generated the data at hand can, in reality, be at least approximated using a sparse model, that is, that the vectors in (4) are either exactly or approximately sparse. This hypothesis is known to hold for several dynamical systems of interest in different fields of science. See sindy

. If the hypothesis holds, we can expect the Lasso estimates to select only a few elements of the dictionary, namely, those that do a good job at explaining variations in the response variable (

sparse-book).

## 3 Numerical experiments

To evaluate the performance of the proposed method and compare it with GPoMo algorithm we used dynamical systems provided by the GPoM R package 999Systems: Nosé-Hoover 1986, Genesio-Tesi 1992, Sprott-F, H, K, O, P, G, M, Q and S, Lorenz 1963 and 1984, Burke and Shaw 1981, Rosseler 1976, Chlouverakis-Sprott 2004, Cord 2012. For all these dynamical systems, we use the exact same configuration (coefficients, time duration, etc.) used in the GPoMo package. All of them have three variables . Of particular interest is the Rössler (1976) system (2) (with ), a chaotic system extensively used by GPoMo to show how their method succeeds in finding adequate models.

Evaluating the equations found by the methods is difficult, due to the chaotic nature of the systems considered and the lack of a ground truth equation to compare within the case in which only some variables are observed. This difficulty makes any metric that compares the prediction accuracy of a method against the true series (like the mean square error) unreliable. Phase diagrams are a powerful way of summarising the behavior of a dynamical system, capturing information about possible physical conserved quantities (phase_diagram_conservation), asymptotic behavior, attractors, etc. Hence, visual inspection of the estimated phase diagrams can reveal which models are reasonable or not, by noticing that the trajectories of two systems that behave similarly will produce series that evolve in the same regions of the phase diagram.

These insights led us to propose the Wasserstein distance, also called earth-mover distance, (villani_optimal), to compare the phase diagram obtained by each method to the true phase diagram. Given a trajectory over the phase space we can collect all its points and think of them as drawn from a certain distribution characteristic of the system that generated those data points. Let be that distribution and the one obtained through proper integration of the equation found by some method. The earth-mover distance between and is computed as the infimum over all couplings of and of . We evaluate the system recovered by a given method by computing , where smaller values are preferable.

To study the impact noisy measurements have on pdefind-latent and GPoMo, we introduced various levels of random noise to the dynamical systems. More precisely, for each time point and each variable in the system, we add to the true data a zero-mean normal random variable with standard deviation equal to

, where is the sample standard deviation of the variable in question and , so that larger values of imply more noise is being added to the system.

## 4 Results

### 4.1 Full information

When all the information is available and no noise is added, both methods are capable of recovering the dynamical system. This can be seen in Figure 1 (a) for example, where for the Rosseler system, the predictions using each of the methods follow the true dynamics without much variation for time units. Moreover, when looking at the phase diagram, it is clear that both can reproduce the particular dynamics of each variable. After applying GPoMo and pdefind-latent to all the dynamical systems provided by GPoMo package (under different noise conditions) we calculated the emd distance over the resulting phase diagrams and plotted the corresponding distributions to compare the performance of both methods. In the violin plots of Figure 1 we can see that both methods perform similarly with and without noise, with pdefind-latent’s metric distribution nearer to zero meaning it’s phase portraits are more similar to the true ones than the ones obtained with GPoMo. When noise is added the performance of both methods progressively deteriorates as increases.

### 4.2 Latent variables

In this section, we explore the feasibility of using higher-order time derivatives to tackle the lack of information by using the same dynamical systems but simulating the situation in which just one of the three variables is observed. For both methods, we fixed the maximum polynomial order in and choose derivatives up to the order.

When no noise is added, both methods are capable in many cases to recover an equivalent dynamical system . This can be seen in Figures 2 a) and b), where, again for the Rosseler system, the predictions and phase diagrams of both methods reproduce the same behavior as the real series.

A full comparison of the methods can be seen in Figure 2, where for each noise level we plot the distribution of distances, over the phase diagrams. The distributions were generated by applying both methods to each of the three variables of all the available dynamical systems. The pdefind-latent method produces models that are better at capturing the geometry of the true phase diagram trajectories with zero noise, as the metric obtained is slightly lower than GPoMo. When noise is added, both methods lose their power.

We close this section by highlighting the following important fact. pdefind-latent is orders of magnitude faster than GPoMo. For the data-sets analyzed in this paper, with or without latent variables, it takes less than seconds to compute pdefind-latent, whereas it takes minutes or even hours to compute GPoMo using the GPoMo R package (see Figure 3). This can be explained by the fact that pdefind-latent solves a convex optimization problem, but GPoMo performs a complicated combinatorial search.

Further results of our numerical experiments can be found in the supplementary materials to this article. (b) Figure 3: Comparison of the time taken by both methods for scenarios grouped according to whether all information is available (full) or only one variable is available (missing). The pdefind-latent method is orders of magnitude faster then GPoMo usually taking less than 10 seconds compared to more than a minute for GPoMo taken even hours in some cases.

### 4.3 RTE data

Finally, we address the problem of recovering a differential equation using the temperature series provided by RTE. They were generated by simulations of fluid dynamics over a European spatial grid with thousands of geographical points, running for years and taking into account many variables like the velocity of wind, pressure, air density together with temperature. However, we had access only to the hourly temperature time series over grid points covering France.

We made many experiments using the available data in different settings: we fit models using only the time series of one geographical point, two geographical points far from each other or two neighbouring points. At the same time, we created tailored made dictionaries of functions to feed the algorithm using different combinations of derivatives, polynomials, and regressors. The two regressors added were trigonometric functions with a period of day and year fitted using the temperature series. In that case, the predictions agree reasonably well with the real series when looking at both the phase diagram and the future predictions, leading to a good approximate model.

As this learning task turns out to be difficult, the majority of the models were not compatible with the real data when looking at the predictions in the phase diagram. Many of them eventually diverged. A remarkable exception is the case of regressors with two separate geographical points and derivative order three and polynomials up to order (Figure 4).

## 5 Discussion

When dealing with information loss in a known and controlled system we demonstrated that the methodology proposed by DataDrivenPDE2, DataDrivenPDE and sindy is a powerful tool to have in hand when looking to learn systems of differential equation to model data measurements. When all information is available, pdefind-latent results are competitive with those obtained using the GPoMo algorithm when there is no noise added. When only noisy measurements are available, the performance of both algorithms deteriorates. When not all variables of the system are observed, both methods can recover useful dynamical systems when no noise is added to the observations. When using the temperature data from RTE we found an equation whose predictions yielded a behaviour compatible with the real series when looking at both the phase diagram and the future predictions.

Importantly, pdefind-latent is orders of magnitude faster than GPoMo (see Figure 3): for the data-sets analysed in this paper it takes less than seconds to compute, whereas it takes minutes or even hours to compute GPoMo using the GPoMo R package.

Some drawbacks of the method can be explored in the future. While the parameter was chosen by cross-validation, still remains unsolved a way to automate the choice of the parameters (maximal order of derivation) and the size of the dictionary. There is no doubt that these parameters are influencing the performances of the method.

A direction to improve the results in the presence of noise, could be to make a better approximation of the derivative: in the paper, we simply replaced the derivative by the normalised increment of the function at the finest scale. But doing this in the presence of noise increases the variability, since the noise is also normalised. This can explain partly the poor performances of the method in the presence of noise, and there might be room for improvements in this direction.

Finally, one could also think about replacing the Fourier and polynomial bases by more localised bases such as wavelet bases which proved reliable properties for solving PDE’s.

## 6 Acknowledgements

Funding: This work was supported by Aristas S. R. L. and RTE (Réseau de Transport d’Électricité).