We consider the problem of learning a dynamical system from multiple, vector-valued time series. Suppose we have time series or trajectories, each observed at discrete times . We assume this temporal grid is sufficiently fine to capture the dynamics of the system of interest. Let denote the observation for trajectory at time . Here is arbitrary. Given this data, we compute estimates of (i) the states and (ii) a vector field that determines a dynamical system model for the time-evolution of the states.
Suppose the true trajectory satisfies the nonlinear system of differential equations given by
We model the observation as the true state plus noise:
We seek that approximates , and that approximates . While there has been much recent interest in learning differential equations from data, the bulk of the literature focuses on computing from relatively clean data;  notes that when Gaussian noise of strength greater than is present in observed states, estimation becomes unstable and inaccurate. We demonstrate below that leading methods encounter difficulty even at relatively low noise magnitudes [2, 3]. Our method yields accurate estimates even when the data is corrupted by noise.
Our prior work 
established that the block coordinate descent proximal filtering method yields more accurate and robust parameter estimates than either iPDA or the extended Kalman filter. A key element of our filtering approach is its proximal step. In contrast, other techniques such iPDA and soft adherence use a penalty term that anchors the filtered states (at all iterations) to the data [6, 7]. Our work shares goals with , who are concerned with recovering the functional form of sufficiently ergodic, chaotic dynamical systems from highly corrupted measurements. While our parameterizations of differ considerably, both the present paper and  develop and apply alternating minimization methods.
The present paper focuses on extending  to the setting where the functional form of is unknown and must be estimated. In our prior work, we assumed this vector field was known up to a finite-dimensional set of parameters. Hence our prior work contains no mention of neural networks, nor does it contain comparisons against the methods of  and .
The central finding of this paper is as follows: as the magnitude of noise increases, computing accurate and is still possible if one alternates model estimation steps with filtering steps, and if one trains using a larger number of trajectories. Using tests on simulated data, we show that leading equation discovery methods can be retrofitted with our filtering approach, greatly enhancing the ability of these methods to cope with noisy data. These methods differ in the way they model the estimated vector field : sparse linear combinations of prescribed functions , neural shape functions, or dense, feedforward neural networks . We conduct these tests for both the FitzHugh–Nagumo system and a nonlinear oscillator chain, showing that we can recover both the filtered states and the underlying vector field with high accuracy.
We also apply our method to power grid data recorded by micro-phasor measurement units (PMUs). The units record synchronized measurements of voltage and phase angle at the power distribution level . Using data taken on two different days for which the system’s status has been labeled, we estimate two vector fields, one for normal operation and one for anomalous operation. To estimate these vector fields, we found it essential to both filter the data and increase the number of trajectories.
Let model the true vector field . For now, we are not concerned with the actual details of this model except to say that the model parameters are given by . Consider an explicit Euler time discretization of (1) with time step . For trajectory , we have
We use Euler purely for simplicity here; in practice the method can accommodate higher-order, explicit time integration schemes. Let denote the collection of all filtered states over all trajectories. Then define the objective function
Assume that if the true states were known, that could be trained by minimizing over . We thus refer to minimization over as model training.
Let denote the collection of all observations over all trajectories. We propose the following alternating procedure to learn and . The states are initialized to be the data: —superscripts denote the iteration number:
We terminate when the change in is sufficiently small.
Ii-B Three Parameterizations of
We first take to be a linear combination of prescribed functions. Given a input we let denote a dictionary of functions. For instance, for we can take , i.e., all polynomials in the components of , up to degree two. Using this dictionary, we write
where is an matrix of coefficients. Throughout this paper, to extend to a function , we apply to each slice of . To solve for , we apply an iteratively thresholded least-squares regression procedure that promotes a sparse solution —for further details, consult [2, 11].
Ii-B2 Neural Shape Functions
Instead of prescribing and learning only , here we learn both and . Let be the desired number of one-dimensional shape functions; we model these using a dense neural network with one-dimensional input, hidden layers, and a final layer with outputs. Let denote the output corresponding to scalar input ; then is the -th one-dimensional shape function.
|pred. err on|
|pred. err on|
To create multi-dimensional shape functions, we take tensor products of one-dimensional shape functions. Let a multi-index denote a collection of integers such that . Define . Then for a given multi-index , define the multi-dimensional shape function
Let be a collection of multi-indices and its cardinality. Suppose we compute for each ; in this way we obtain a vector . Here plays the same role as in SINDy above; is analogous to , the number of shape functions. Now let denote an matrix of weights. The neural shape function model is . Note that there are possible multi-dimensional shape functions. In practice, we choose such that , thus constraining to be a small linear combination of multi-dimensional shape functions. The set of parameters consists of together with all weights and biases in the network.
Suppose we work in a compact subset . Given direct observations of a smooth vector field in , it is possible to achieve arbitrarily small error
by choosing the neural shape function hyperparameters sufficiently large. To understand why, note that universal approximation theory guarantees that (even when) the space of our one-dimensional neural shape functions is dense in the space of continuous functions. Hence tensor products of these shape functions are dense in the space of tensor products of continuous functions. By Stone-Weierstrass, we see that any smooth vector field can be approximated by linear combinations of tensor products of continuous functions (e.g., univariate polynomials). Putting the previous two facts together, we conclude that linear combinations of our multi-dimensional neural shape functions can be used to approximate smooth vector fields; the accuracy of this approximation can be controlled by the number of units in each network together with .
Ii-B3 Dense Neural Network (DNN)
We also consider a dense, feedforward neural network model with inputs and outputs, the model used by .
Here we present results from simulated data experiments in which the ground truth vector field is known.
To generate simulated data, we set the system parameters , , and initial condition . We then numerically solve the system on the interval with , recording data at a spacing of . To these clean trajectories we add mean-zero Gaussian noise at strengths of , , and . We train using data on ; in Figure 1, we plot prediction errors for both the training interval and an extrapolatory test interval .
For this initial comparison, let us define the concept of prediction error. Suppose we are at iteration and that we have trained the model for epochs. Define the predicted states to be the numerical solution of (1) that results from using the filtered initial conditions contained in (for the -th trajectory, ) together with the current best vector field . We call the distance between and the prediction error—we claim this is a good metric to detect overfitting. If we find that the current prediction error increases as we increase the number of epochs, we halt the optimization step (5a), set to the weights of the network with minimum current prediction error, and proceed to the filtering step (5b).
In Figure 3, we plot the predicted and true states for the neural shape function model of learned from data with Gaussian noise. We also plot the learned one-dimensional neural shape functions . An advantage of this method is the ability to graphically interpret these shape functions, regardless of the dimension of the vector field being modeled.
Among the methods studied in Figure 1, only our method and that of  attempt to deal with noise during training. Instead of using a proximal term of the form as in (5b),  use a penalty term of the form ; filtered states are always anchored to the data. While our method outperforms this penalty-based method, both are more robust to noise than approaches that do not incorporate filtering at all. Clearly SINDy  has issues with even small amounts of noise, while the dense neural network approach of  encounters problems starting at noise.
|shape :||shape :||shape :|
|shape :||shape :||shape :|
We have carried out tests similar to that of Figure 1 for nonlinear, chaotic systems such as the Lorenz, Rössler, and double pendulum systems. For such systems, if we perturb the vector field or initial conditions slightly, sensitive dependence implies that over time, trajectories will diverge exponentially. For such systems, the prediction error on can seem large even if we estimate the vector field and filtered states with a reasonable degree of accuracy. These tests motivate us to measure the vector field error , which provides insight into how well the estimated vector field captures the global behavior of .
We now retrofit both  (the dense neural network model for ) and  (SINDy) with our filtering procedure. To be clear, both retrofitted methods follow (5); the only difference is in how is modeled.
As all the results shown in Figure 1 are for one trajectory, we now move to the setting where we have multiple trajectories. We numerically integrate the FitzHugh–Nagumo system with initial conditions taken from an equispaced grid on the square , producing a set of trajectories. We then apply each retrofitted procedure to either , (randomly chosen), or all trajectories from the collection. In each case, we obtain an estimated vector field . We compute the vector field error using elementary quadrature on the square . Here is the right-hand side of the FitzHugh–Nagumo system; see the ground truth vector field plotted in Figure 2.
The results, shown in Figure 9 and Table I, clearly show the benefit of combining any method with filtering and increasing the number of trajectories used for training, even if those trajectories are all corrupted with noise. The improvement is particularly striking for the SINDy method; in fact, after incorporating SINDy into (5), it outperforms the other methods.
Iii-B Nonlinear Oscillator Network
Consider a ring of masses connected by identical springs, each with potential energy and force . Here denotes displacement from equilibrium. Then the equations of motion for the mass-spring system are
for , with the understanding that and . For our tests, we choose the double-well potential , so that .
We focus on the system with masses; when we write the system in first-order form, the system has degrees of freedom corresponding to and for each . To generate data for our tests, we simulate (8) using the 8th-order Dormand-Prince integrator in scipy.integrate, with absolute and relative tolerances tuned to . We generate trajectories, each with an initial condition sampled uniformly from the hypercube . Each trajectory is saved at equispaced time steps from to a final time of , i.e., . To these clean trajectories , we add mean-zero Gaussian noise with strengths of , , and as before, resulting in noisy data .
We focus our attention on a retrofitted SINDy method with modeled using a dictionary of polynomials. In this method, we use a variant of (5) theoretically analyzed in our previous work . In this method, we incorporate the SINDy model (6), which has the benefit of being linear in the parameters . This linearity implies that the objective function (4) is convex in . Suppose we split the decision variables into two halves, the first half consisting of time steps , and the second half consisting of time steps . Then . This leads us to the following block coordinate descent algorithm:
Splitting and formulating the algorithm in this way gives us block convexity. That is, when we hold two of the three variables in fixed and minimize over the remaining variable, we obtain in each case a convex subproblem . Note that this property would not hold if we were to instead use either of the neural network approaches to model , as in this case (9a) would be non-convex.
When we train, we restrict to consist of only the first steps of each trajectory. Starting with noisy data , we run algorithm (9) with . Note that multiplies the squared Frobenius error between two large matrices; hence this value of is still consequential. We terminate if the norm difference between and is less than , or if we have already completed iterations.
To solve (9b) and (9c), we use steps of gradient descent with a learning rate of . To solve (9a), we use the iteratively thresholded least squares procedure from [2, 11] with a threshold of . For the dictionary defined in (6), we include all polynomials up to degree in variables, not including an intercept or constant term, resulting in columns. We have uploaded our code to https://github.com/hbhat4000/filtersindy/
An interesting property of algorithm (9) is the way in which behaves as a function of . In Figure 4, we have plotted this quantity as a function of iteration . We have recorded these magnitudes of the change in filtered states while running algorithm (9) on the full -trajectory training set, contaminated with either , , or noise. In all cases, we see monotonic convergence. Note that the vertical scale on this plot is logarithmic, further indicating the rapid decrease in as a function of . As a practical matter, this means that we do not need to run (9) for many iterations, and that the results are robust to the number of iterations.
After running algorithm (9) on either , , or trajectories with either , , or noise, we quantify errors in three ways—see Table II. The vector field error measures the mean absolute error between ground truth and estimated vector fields. The quantity is the mean absolute error between true and filtered states. After training on the first steps of each trajectory, we compute predicted trajectories by numerically integrating the estimated vector field forward in time, starting from the filtered initial conditions. The prediction error is the mean absolute error between these predicted trajectories and the true states . When we numerically integrate the estimated vector field, we use a standard fourth-order explicit Runge-Kutta method.
The first thing to notice from the results in Table II is that they confirm that combining SINDy with our alternating minimization filtering approach allows SINDy to estimate the mass-spring system’s vector field accurately. Even with 10% noise in the original data, the vector field error when we train on trajectories is . To get a sense for what this number means, note that the ground truth vector field is highly sparse—across the entire ground truth coefficient matrix , out of possible entries, only are nonzero. The retrofitted SINDy algorithm gets this sparsity pattern correct; if we round the coefficients obtained by SINDy, we obtain the ground truth vector field.
Let us now interpret the filtering errors (or second column) of Table II. At the 10% noise level, increasing the number of trajectories appears to decrease the filtering error from (with 10 trajectories) only slightly to . To visualize this, we present Figure 5. From top to bottom, we present the results of training with (top), (middle), or (bottom) trajectories worth of data, all contaminated with Gaussian noise. For the purposes of illustration, we have singled out one trajectory that is a member of all three training sets. Each plot contains black curves, corresponding to for , together with the filtered (or hatted) versions, for a total of curves per plot. When there are only trajectories (top panel), one can discern differences between red and black curves; as we go to trajectories (bottom panel), these errors mostly disappear.
Overall, Figure 5 supports the notion that it is indeed possible to estimate filtered states even when the equations of motion (i.e., the vector field) for the underlying system are themselves unknown and must be simultaneously estimated. However, there is a difference between filtered trajectories and predicted trajectories of the dynamical system. One will notice from Figure 5 that the time axis ends at ; we have used steps of training data with , so there is no noisy training data to filter beyond .
To obtain solutions of the estimated dynamical system beyond , we must numerically integrate, as in the predicted trajectories whose errors are quantified in the third column of Table II. When we compute the predicted trajectories in this table, we integrate all the way up to and then compare against the ground truth (clean) states of the system . We view predictions on as a true test set, i.e., a test of the estimated vector field’s ability to extrapolate beyond the training set.
We again single out the same trajectory common to our training sets with , , or total trajectories. Starting with the estimated/filtered initial conditions, we numerically integrate the estimated vector field forward in time using a standard fourth-order explicit Runge-Kutta method, from to . In Figure 6, we compare the results of these numerical integrations (in green) against the ground truth trajectories (in black). When the number of trajectories is small (top), the predicted dynamics are qualitatively wrong. As we train on more trajectories (middle, bottom), the dynamics begin to qualitatively match the true mass-spring system’s dynamics. Note that we obtain much better quantitative accuracy for , the training interval.
For the bottom panel (trained on trajectories), by the time we reach , the predicted state has drifted noticeably away from the ground truth. We can improve upon this situation by resetting the state of the system, at , to equal our estimate of the true state at that time and then continuing the numerical integration until . In Figure 7, we see that this procedure leads to improved quantitative agreement between predicted and ground truth trajectories on the test interval .
Iii-C MicroPMU Data
We apply the neural shape function plus filtering method to PMU data taken from a Mountain View substation near Riverside, CA. We use one hour’s worth of data from both Aug. 9 and Aug. 1, 2017. We use measurements aggregated at a spacing of . In prior work, these dates have been identified as corresponding to normal (Aug. 9) and anomalous (Aug. 1) system operation. Our idea is to learn two vector fields, one for normal and one for anomalous behavior.
To begin, we prefiltered the data using a wavelet low-pass filter. This was to remove high-frequency noise prevalent in the raw data. Next, we focused our modeling efforts on two phase angle variables, and . This means that we did not use components of the full -dimensional signal. To implement the multiple trajectory idea, we reshaped the single hour’s trajectory from its original length of to trajectories each consisting of points. To implement (5a) we use epochs of the Adam optimizer with a learning rate of ; to implement (5b), we use L-BFGS-B from scipy.optimize with default tolerances and . The network itself consists of shape functions with a depth of and units per layer. The results are robust to increasing , , , or . However, note that training on one prefiltered trajectory of length , without alternating filtering, does not yield usable models.
In Figure 8
, we plot the phase portraits for the learned systems. Interestingly, the main difference between the learned vector fields for the normal (left) and anomalous (right) settings has to do with stability. In short, we see that normal (respectively, anomalous) system operation corresponds to stable (respectively, unstable) oscillations. We have confirmed this by numerically finding the fixed points of the vector fields and checking the eigenvalues of the Jacobians at these fixed points.
We find that we can compensate for noisy training data by increasing its volume (the number of trajectories) and by applying simultaneous filtering. Filtering, in the form of the proximal step (5b), reestimates the states based on the best model at the current iteration, . When is heavily contaminated by noise, several iterations of (5b) allow to step away from as needed. While we have focused on the FitzHugh–Nagumo and nonlinear mass-spring systems in the paper, we are currently running additional tests for higher-dimensional physical systems. Preliminary results confirm the same trends we have observed for the FitzHugh–Nagumo system.
SINDy uses polynomial shape functions and the FitzHugh–Nagumo vector field consists of polynomials. Yet unless we include filtering and multiple trajectories, SINDy performs poorly on noisy data. For non-polynomial vector fields, the neural network approaches allow for greater model flexibility. We hypothesize that on non-polynomial systems, neural network-based approaches will prove superior. With the neural shape function approach, one can plot and visualize the one-dimensional shape functions , analogous to visualizing the components of in SINDy. However, due to its special form and construction, the neural shape function model is more difficult to train than a standard dense neural network.
In future work, we seek to replace these dense neural networks with sparsely connected neural networks that have optimal approximation properties . We will also combine these methods with dimensionality reduction techniques to obtain reduced-order models from the full -dimensional PMU time series. Ultimately, our goal is to uncover global phenomena, such as instabilities, attractors, and periodic orbits, that go beyond single trajectory forecasts.
|Filter SINDy||Neural Shape Functions||Filter DNN|
H. Schaeffer, “Learning partial differential equations via data discovery and sparse optimization,”Proceedings of the Royal Society A, vol. 473, no. 2197, pp. 20 160 446, 20, 2017.
-  S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Discovering governing equations from data by sparse identification of nonlinear dynamical systems,” Proceedings of the National Academy of Sciences of the United States of America, vol. 113, no. 15, pp. 3932–3937, 2016.
-  M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations,” arXiv:1711.10561, 2017.
R. Raziperchikolaei and H. S. Bhat, “A block coordinate descent proximal
method for simultaneous filtering and parameter estimation,” in
Proceedings of the 36th International Conference on Machine Learning
, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA: PMLR, 09–15 Jun 2019, pp. 5380–5388. [Online]. Available:http://proceedings.mlr.press/v97/raziperchikolaei19a.html
-  N. Parikh and S. Boyd, “Proximal algorithms,” Found. Trends Optim., 2014. [Online]. Available: http://dx.doi.org/10.1561/2400000003
-  J. Ramsay and G. Hooker, Dynamic Data Analysis: Modeling Data with Differential Equations, ser. Springer Series in Statistics. Springer New York, 2017.
-  S. H. Rudy, S. L. Brunton, and J. N. Kutz, “Smoothing and parameter estimation by soft-adherence to governing equations,” Journal of Computational Physics, vol. 398, p. 108860, 2019.
-  G. Tran and R. Ward, “Exact Recovery of Chaotic Systems from Highly Corrupted Data,” Multiscale Modeling & Simulation, vol. 15, no. 3, pp. 1108–1129, Jan. 2017.
S. H. Rudy, J. N. Kutz, and S. L. Brunton, “Deep learning of dynamics and signal-noise decomposition with time-stepping constraints,”Journal of Computational Physics, vol. 396, pp. 483–506, 2019.
-  J. Cadena, P. Ray, and E. Stewart, “Fingerprint discovery for transformer health prognostics from micro-phasor measurements,” in ICML 2019, Time Series Workshop, Long Beach, CA, 2019.
-  L. Zhang and H. Schaeffer, “On the Convergence of the SINDy algorithm,” Multiscale Modeling & Simulation, vol. 17, no. 3, pp. 948–972, 2019.
H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen, “Optimal
approximation with sparsely connected deep neural networks,”
SIAM Journal on Mathematics of Data Science, vol. 1, no. 1, pp. 8–45, 2019.