Weak SINDy: A Data-Driven Galerkin Method for System Identification

We present a weak formulation and discretization of the system discovery problem from noisy measurement data. This method of learning differential equations from data fits into a new class of algorithms that replace pointwise derivative approximations with linear transformations and subsequent variance reduction techniques and improves on the standard SINDy algorithm by orders of magnitude. We first show that in the noise-free regime, this so-called Weak SINDy framework is capable of recovering the dynamic coefficients to very high accuracy, with the number of significant digits equal to the tolerance of the data simulation scheme. Next we show that the weak form naturally accounts for measurement noise and recovers approximately twice the significant digits of the standard SINDy algorithm while significantly reducing the size of linear systems in the algorithm. In doing so, we combine the ease of implementation of the SINDy algorithm with the natural noise-reduction of integration to arrive at a more robust and user-friendly method of sparse recovery that correctly identifies systems in both small-noise and large-noise regimes.



There are no comments yet.


page 14

page 15

page 20


Weak SINDy: Galerkin-Based Data-Driven Model Selection

We present a weak formulation and discretization of the system discovery...

Weak SINDy For Partial Differential Equations

We extend the WSINDy (Weak SINDy) method of sparse recovery introduced p...

A toolkit for data-driven discovery of governing equations in high-noise regimes

We consider the data-driven discovery of governing equations from time-s...

Online Weak-form Sparse Identification of Partial Differential Equations

This paper presents an online algorithm for identification of partial di...

A symplectic discontinuous Galerkin full discretization for stochastic Maxwell equations

This paper proposes a fully discrete method called the symplectic dG ful...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Problem Statement

Consider a dynamical system in dimensions of the form


and measurement data given at timepoints by

The matrix represents i.i.d. measurement noise. The focus of this article is the reconstruction of the dynamics (1.1) from the measurements y.

The SINDy algorithm (Sparse Identification of Non-linear Dynamics [1]) has been shown to be successful in solving this problem for sparsely represented nonlinear dynamics when noise is small and dynamic scales do not vary across multiple orders of magnitude. This framework assumes that the function in (1.1) is given component-wise by


for some known family of functions and a sparse weight matrix . Here and throughout we define . The problem is then transformed into solving for by building a data matrix given by

so that the candidate functions are directly evaluated at the noisy data. Solving (1.1) for then reduces to solving


for a sparse weight matrix w, where is the numerical time derivative of y. Sequentially-thresholded least squares is then used to arrive at a sparse solution.

The formulation of system discovery problems in terms of a candidate basis of nonlinear functions (1.2) and subsequent discretization (1.3) was first introduced in [7]

in the context of catastrophe prediction with compressed sensing techniques used to enforce sparsity. Since then there has been an explosion of interest in the problem of identifying nonlinear dynamical systems from data, with some of the primary techniques being Gaussian process regression, deep neural networks (


), Bayesian inference (

[9], [10]) and a variety of methods from numerical analysis ([2], [3]

). These techniques have been successfully applied to discovery of both ordrinary and partial differential equations. The variety of approaches qualitatitively differ in the interpretability of the resulting data-driven dynamical system, the practicality of the algorithm, and the robustness due to noise, scale separation, and so on. For instance, a neural-network based data-driven dynamical system does not easily lend itself to physical interpretation. As well, certain sparsification techniques are not practical to the general scientific community, where the problem of system identification from data is ubiquitous. The SINDy algorithm allows for direct interpretations of the dynamics from identified differential equations and uses sequentially thresholded least-squares to enforce sparsity, which is not nearly as robust as other approaches but is easy to implement and has been proven to converge to sparse local minimizers in

[8]. For these reasons, we use sequential thresholding in this article to demonstrate the viability of the method, and note that improvement using a more robust sparsifying strategy is possible.

The aim of the present article is to provide rigorous justification for using the weak formulation of the dynamics in place of local point-wise derative approximations, as well as a robust algorithm for doing so. As such, we restrict numerical experiments to autonomous ordinary differential equations for their immenability to analysis. In future works, more robust sparisification measures will be explored. As well, adaptation to PDEs and non-autonomous systems is forthcoming. We note that the use of integral equations for system identification was introduced in

[6], where compressed sensing techniques were used to enforce sparsity, and that this technique can be seen as a special case of the method introduced here. In section 2 we introduce the algorithm with analysis of the resulting error structure and in section 3 we provide numerical experimentation with a range of nonlinear systems.

2. Weak SINDy

In this article, we approach the problem (1.3) from a different perspective, by utilizing the weak form of the differential equation. Recall that for any smooth test function (absolutely continuous is enough) and interval , equation (1.1) admits the weak formulation


With , we arrive at the integral equation of the dynamics explored in [6]. If we instead take to be non-constant and compactly supported in , we arrive at


We then define the generalized residual for a given test function by replacing with a candidate element from the span of and x with y as follows:


Clearly with and we have for all compactly-supported in ; however, y is a discrete set of data, hence (2.3) can at best be approximated numerically, with measurement noise presenting a significant barely to accurate quadrature.

2.1. Method Overview

For analogy with traditional Galerkin methods, consider the forward problem of solving a dynamical system such as (1.1) for x. The Galerkin approach is to seek a solution x represented in a chosen trial basis such that the residual , defined by

is minimized over all test functions living in the span of a given test function basis . If the trial and test function bases are known analytically, inner product of the form appearing in the residual can be computed exactly, so that computational errors result only from representing the solution in a finite-dimensional function space.

The method we present here can be considered a data-driven Galerkin method of solving for where the trial “basis” is given by the set of gridfunctions evaluated at the data and only the test-function basis is known analytically. In this way, inner products appearing in must be approximated numerically, implying that the accuracy of the recovered weights w is ultimately limited by the quadrature scheme used to discretize inner products. Using Lemma 2.1 below, we show that the correct coefficients may be recovered to effective machine precision accuracy (given by the tolerance of the forward ODE solver) from noise-free trajectories y by discretizing (2.2) using the trapezoidal rule and choosing to decay smoothly to zero at the boundaries of its support.

Having chosen a quadrature scheme, the next accuracy barrier is presented by measurement noise, which introduces a bias in the weights. Below we analyze the distribution of the residuals to arrive at a generalized least squares approach where the covariance matrix can be computed directly from the test functions. This analysis also shows that placing test functions near steep gradients in the dynamics improves recovery, hence we develop a self-consistent and stable algorithm for constructing a test function basis adaptively near these regions which also does not rely on pointwise approximation of derivatives. Overall, we show that when noise is present, our method produces a recovered weight matrix

with the number of significant digits scaling optimally with the signal-to-noise ratio

(defined below).

Remark. The weak formulation of the dynamics (2.2) introduces a wealth of information. Given timepoints , equation (2.2) affords residuals over all possible supports . Using multiple families of test functions over these supports can be viewed as a type of data assimilation, with the action of each test function providing useful information for the covariance structure of the residuals. The information complexity of such an exhaustive approach quickly becomes computationally intractable; however, we show here that even with large noise recovery the true weights using .

2.2. Algorithm: Weak SINDy

We state here the Weak SINDy algorithm in full generality, which uses a generalized least squares approach with regularization and sequential thresholding to enforce sparsity.


  1. Construct matrix of trial gridfunctions

  2. Construct integration matrices V, from test functions such that the th rows satisfy ,

  3. Compute Gram matrix and right-hand side so that and

  4. Solve the generalized least-squares problem with -regularization

    using SINDy to enforce sparsity.

Below we analyze the residual to arrive at near optimal strategies for choosing and directly from the data . The choice of test function basis is chosen in tandem with a quadrature rule to eliminate numerical errors. Methods of choosing the sparsity and regularization parameters and a priori from the data exist [4], however in this article we do not optimize this process in order to focus on errors resulting from the weak formulation of the dynamics.

2.3. Residual Analysis

Performance of the Weak SINDy method is determined by the behavior of the residuals

Expanding this out and linearizing around the true data, we have

where . The errors manifest in the following ways:

  • is the misfit between w and

  • results from measurement error in trial gridfunctions

  • results from replacing with in the right-hand side of (1.1)

  • is the integration error ( by Lemma 2.1)

  • results from truncating the Taylor expansion of around the exact data x:

Clearly, recovery of when is straight forward: and are the only error terms, thus one only needs to select a quadrature scheme so that the integration error is negligible will be the minimizer (Figure 2 demonstrates this fact).

For , accurate recovery of

requires one to choose hyperparameters that exemplify the true misfit term

by enforcing that the other error terms are of lower order. We look for and that approximately enforce

utilizing that minimizing the residual in the least-squares sense corresponds to normally-distributed errors.

Approximate Covariance

Neglecting and , we can rewrite with and together as


where is the identity in . Assuming are i.i.d with mean and variance

, the transformed random variables

are also i.i.d. and so the Lyupanov central limit theorem asserts that


If it holds that , then we get

If we then consider the entire residual vector

, this corresponds to the approximate distribution

with , or .

Adaptive Refinement

Before advocating for a particular type of test function basis, we show that by localizing around large , an approximate cancellation of the error terms and results. Consider the one-dimensional case (). When is large compared to , we approximately have


for some , i.e. the perturbed value lands close to the true trajectory x at the time

. Visually this is clear, and for a heuristic argument, let

be the point of intersection between the tangent line at and :

When is large compared to , such an intersection exists and will be very small, hence will approximately lie on the true trajectory.

If we linearize using this approximation we get


Assuming is sufficiently localized around , (2.4) implies that

hence , while (2.5) implies

having integrated by parts. Putting the pieces together, the residual takes the form

and we see that and have effectively canceled. In higher dimensions this interpretation breaks down, but nevertheless, for any given coordinate , it holds that terms in the error expansion vanish around points where is large, precisely because .

2.4. Test Function Basis

Here we arrive at a test functions space and quadrature scheme to minimize integration errors and enact the heuristic arguments above, which rely on sufficiently localized. To ensure the integration error in approximating inner products is negligible, we rely on the following lemma, which provides a bound on the error in discretizing the weak derivative relation


for compactly supported .

Lemma 2.1 (Numerical Error in Weak Derivatives).

Let have continuous derivatives of order and define . Then if has roots of multiplicity , then


where . In other words, the composite trapezoidal rule discretizes the weak derivative relation (2.6) to order .


This is a simple consequence of the Euler-Maclaurin formula. If is a smooth function, then

where are the Bernoulli numbers. The asymptotic expansion provides corrections to the trapezoidal rule that realize machine precision accuracy up until a certain value of , after which terms in the expansion grow and the series diverges.

In our case, where the root conditions on imply that

So for odd, we have that

For even , the leading term is with a slightly different coefficient. ∎

We now define the test functions space to be uni-modal piece-wise polynomials of the form

where satisfies and . Functions are compactly supported in and non-negative. The normalization

ensures that . With , the exact leading order error in term in (2.7) is

which is negligible for a wide range of reasonable and values. The Bernoulli numbers eventually start growing like , but for smaller values of they are moderate. For instance, with and , this error term is up until , where it takes the value , while for , the error is below machine precision for all between 7 and 819. For these reasons, in what follows we choose test functions and discretize all integrals are discretized using the trapezoidal rule. Unless otherwise stated, each function satsifies and so is fully determined by the tuple indicating its polynomial degree and support.

Strategy 1: Uniform Grid

The simplest strategy for choosing a test function basis is to place uniformly on the interval each with support size

and a polynomial degree . This is the uniform grid strategy, only in light of the residual analysis above we introduce two parameters and below to pick the degree and the number of basis functions . We fix the support by


where is the magnitude of the th Fourier mode of y minus its mean. Large changes in the dynamics are most likely to occur within time intervals of length equal to the largest Fourier mode. We let which selects the polynomial degree as follows. Analytically,

and so we set . The shift parameter is then used to determine and the endpoints :

in other words is the height of intesection between and . This fixes for each pair of neighboring test functions and measures the amount of overlap between successive test functions, which factors into the covariance matrix . Larger implies that neighboring basis functions overlap on more points, with indicating that . Specifically, neighboring basis functions overlap on timepoints.

In Figures 3 and 4 we vary the parameters and and observe that results agree with intuition: larger and smaller lead to better recovery of .


  1. Construct matrix of trial gridfunctions

  2. Construct integration matrices V, from test functions parametrizes according to and (2.8).

  3. Compute Gram matrix and right-hand side so that and

  4. Compute approximate covariance and Cholesky factorization

  5. Solve the -regularized weighted least-squares problem

    using SINDy to enforce sparsity and applied using C.

Strategy 2: Adaptive Grid

Motivated by the arguments above, we now introduce an algorithm for constructing a test function basis localized near points of large change in the dynamics. This occurs in 3 steps: (1) construct a weak approximation to the derivative of the dynamics , (2) sample points c from a cumulative distribution with density proportional to the total variation , (3) construct test functions centered at c using a width-at-half-max parameter . Each of these steps is numerically stable and carried out independently along each dimension. A visual diagram is provided in Figure 1.

Figure 1. Counter-clockwise from top left: test function and derivative used to compute v, approximate total variation , cumulative distribution , noisy data y from the Duffing equation and resulting test functons centers c.

Step 1: Weak Derivative Approximation: Define , where the matrix enacts a linear convolution with the derivative of a piece-wise polynomial test function of degree and support size so that

The parameters and are chosen by the user, with and corresponding to taking a centered finite difference derivative with 3-point stencil. Larger results in more smoothing and minimizes the corruption from noise while still capturing the correct large deviations in the dynamics. For all examples we let and and note that greater disparity between and

results in more pronounced localization (less uniform distribution).

Step 2: Selecting c: Having computed v, define to be the cumulative sum of normalized so that . In this way

is a valid cumulative distribution function with density proportional to the total variation of

y. We then find c by sampling form . Let with begin the size of the test function basis, we then define , or numerically,

This stage requires the use to select the number of test functions .

Step 3: Construction of Test functions : Having chosen the location of the centerpoint for each test function , we are left to choose the degree of the polynomial and the supports . The degree is chosen according to the width-at-half-max parameter , which specifies the difference in timepoints between each center and , while the supports are chosen such that . This gives us a nonlinear system of two equations in two unknowns which can be easily solved (i.e. using MATLAB’s fzero). This can be done for one reference test functions and the rest of the weights obtained by translation.

The adaptive grid Weak SINDy algorithm is summarized as follows:


  1. Construct matrix of trial gridfunctions

  2. Construct integration matrices V, from test functions determined adaptively by according to the procedure above.

  3. Compute Gram matrix and right-hand side so that and

  4. Compute approximate covariance and Cholesky factorization

  5. Solve the -regularized weighted least-squares problem

    using SINDy to enforce sparsity and applied using C.

The parameters and play a role in determining how localized the test function basis is around steep gradients and ultimately depend on the timestep . As mentioned above, we set and throughout as this produces sufficient localization for the example featured in this article. For simplicity we fix the number of test functions to be a multiple of the number of trial functions (i.e. etc.). For larger noise it is necessary to use a larger basis, while for small noise it as often sufficient to set . The optimal value of

depends on the timescales of the dynamics and can be chosen from the data using the Fourier transform as in the uniform grid case, however for simplicity we set


3. Numerical Experiments

We now show that the Weak SINDy approach recovers the correct dynamics to high accuracy over a range of signal-to-noise ratios. To generate exact data we use MATLAB’s ode45 with absolute and relative tolerance e. Noise is then added to the exact trajectories by fixing and adding i.i.d Gaussian noise to each data point with mean zero and variance is computed by

We examine the following canonical nonlinear systems with variations in the specified parameters:

Van der Pol

The Duffing equation and Van der Pol oscillator present cases of an approximately linear systems with cubic nonlinearities. Solutions to the Van der Pol oscillator and Lotka-Volterra system exhibit orbits with variable speed of motion, in particular regions with rapid change between regions of very litte variation. For the Lorenz system, we focus on recovering the system in the chaotic regime. For this reason we fix the parameters of the differential equation to lie in the region with large Lyupanov exponents and vary the initial conditions. The initial conditions are chosen from a uniform distribution, and , which covers the strange attractor. In this case we see that if the initial conditions lead to trajectories which do not visit both sides of the strange attractor, then the system is not properly identified. This can be expected for recovery in general: trajectories that do not sample important regions of phase space cannot be relied upon to provide an accurate representation of the dynamics at large.

In what follows the sparsity and regularization parameters are set to and .

3.1. Noise-Free Data (Figure 2)

Here we show that in the zero-noise case () we recover the correct weight matrix to within the tolerance of the ODE solver, which is fixed at . We find that accurate recovery occurs regardless of sparsity enforcement or regularization, and so we set , orders of magnitude below any of the coefficients, and . For the data-driven trial basis , we include all polynomials up to degree 5 in the state variables as well as , for and ranging from to . In addition, we find that recovery occurs with the minimal number of basis functions , making the Gram matrix square. We use the uniform grid approach above with support selected from the Fourier transform of y and shift parameter fixed to ensure that .

The main goal of these figures is to examine the effect of increasing the polynomial degree

of the test functions to show that convergence to machine precision is realized in this limit (i.e. convergence to within the accuracy tolerance of the ODE solver). The only outlier in this regard is the Lorenz equation, where for some trajectories the system is not recovered. We find that recovery fails precisely for trajectories that do not visit both sides of the Lorenz attractor, and so inaccurate recovery can be attributed to not have visited a significant enough region of phase space.

Duffing Van der Pol
Lotka-Volterra Lorenz
V1 V2 V3 V4 Notes
Van der Pol,
Lorenz (see below) (see below) - - ,
Figure 2. Plots of relative error vs. when . For each system, a range of parameter values is considered, indicated by the different versions V1-V4. For the Duffing equation, Van der Pol oscillator and Lotka-Volterra system we see convergence in the recovery of coefficients using the uniform-grid Weak SINDy approach (Strategy 1) to the accuracy of the ODE solver () as is increased. For the Lorenz system, 20 random initial conditions were selected with 2/20 trajectories not yielding recovery of the correct coefficients due to having not visited both sides of the Lorenz strange attractor.

3.2. Small-Noise Regime (Figures 3 and 4)

We now turn to the case of low to moderate noise levels, examining a signal-to-noise ratio in the range . We observe another nice property of the Weak SINDy method, that the error in the coefficients scales with , in that the recovered coefficients having approximately significant digits.

We again use the uniform grid approach. We examine not only the polynomial degree but the number of basis functions used in recovery. To reiterate the arguments above, the magnitude of compared to affects the distribution of the residual, so we define and define the degree by fixing and then calculating . In this way, increasing corresponds to increasing . We look at which corresponds roughly to . This together with the spacing parameter determines the test functions basis. We enforce that two neighboring basis functions and intersect at a height of , so that with , the two functions perfectly overlap, and with their supports are disjoint. In this way, larger corresponds to higher . We examine . For (featured in both Figures) we note that as is varied from to , the number of basis functions ranges from to , or to .

We simulated 200 instantiations of noise for the Duffing equation and Van der Pol oscillator and for each noisy trajectory examined a range of the parameter values and . As one might expect form the noise-free case above, increasing leads monotonically to better recovery. In addition, increasing

also leads to better recovery. The mean and standard deviation of the coefficient error

are also pictured, along with sample resulting data-driven dynamical systems, with trajectories denoted by .

Figure 3. Dynamic recovery of the Duffing equation with parameters . Top row: example trajectory y (left) and learned dynamics (right) both plotted over true data x with , , and an error of . Middle row: heat map of the average error (left) and standard deviation (right) over 200 noisy trajectories with with increasing along the -axis and increasing along the -axis. Bottom: decreasing error trend for fixed for various . For each the expected error falls roughly an order of magnitude below the as increases.
Figure 4. Dynamic recovery of the Van der Pol oscillator with parameter . Top row: example trajectory y (left) and learned dynamics (right) both plotted over true data x with , , and an error of . Middle row: color plot of of the average error (left) and standard deviation of the error (right) over 200 noisy trajectories with with increasing along the -axis and increasing along the -axis. Bottom: decreasing error trend for fixed for various . As with the Duffing equation in Figure 3, for each the expected error falls roughly an order of magnitude below the as increases, however the expected accuracy begins to break down for larger , motivating Strategy 2.

3.3. Large-Noise Regime (Figures 5 to 8)

For the last set of experiments we show that recovery in the large noise regime is satisfactory using strategy 2 (non-uniform grid). The signal to noise ratio is for the Duffing, Van der Pol and Lorenz equations, and for Lotka-Volterra. In each case we set the weak differentiation parameters to and and the width-at-half-max to timepoints. For the 2D systems, we use test basis functions, while for the Lorenz equation were used. In each case the correct terms were identified with relative coefficient error less than , indicating approximately two significant digits. We plot the noisy data y, the true data x and the simulated data-driven dynamical systems in dynamo view and phase space to exemplify the separation of scales and the severity of the corruption from noise. We extend by .

For the Duffing equation, the data-driven trajectory diverges slightly from the true data as the system relaxes to equilibrium but is otherwise qualitatively accurate. The Van der Pol oscillator exhibits a case of nearly the same limit cycle being identified but the dominant timescale of is slightly different than that x, so that the data-driven trajectories diverge over time while visiting the same region of phase space. This suggests the level of accuracy needed to accurately recover systems from trajectories with sharp gradients. The recovered Lotka-Volterra trajectory is nearly indistiguishable from the true data. The recovered Lorenz trajecgtory remains close to the true trajectory up until around , before the two diverge as is expected from chaotic dynamics. The Lorenz attractor is captured.

Figure 5. Duffing Equation, same parameters as in Figure 3. Accurate recovery of the stable spiral with . All correct terms were identified with an error in the weights of and . The number of basis functions used was and the width-at-half-max parameter was set to time-points, resulting in .
Figure 6. Van der Pol oscillator, same parameters as in Figure 4. Accurate recovery of the limit cycle for . All correct terms were identified with an error in the weights of and . The number of basis functions used was and the width-at-half-max parameter was set to