Robust subsampling-based sparse Bayesian inference to tackle four challenges (large noise, outliers, data integration, and extrapolation) in the discovery of physical laws from

07/17/2019
by   Sheng Zhang, et al.
Purdue University
23

The derivation of physical laws is a dominant topic in scientific research. We propose a new method capable of discovering the physical laws from data to tackle four challenges in the previous methods. The four challenges are: (1) large noise in the data, (2) outliers in the data, (3) integrating the data collected from different experiments, and (4) extrapolating the solutions to the areas that have no available data. To resolve these four challenges, we try to discover the governing differential equations and develop a model-discovering method based on sparse Bayesian inference and subsampling. The subsampling technique is used for improving the accuracy of the Bayesian learning algorithm here, while it is usually employed for estimating statistics or speeding up algorithms elsewhere. The optimal subsampling size is moderate, neither too small nor too big. Another merit of our method is that it can work with limited data by the virtue of Bayesian inference. We demonstrate how to use our method to tackle the four aforementioned challenges step by step through numerical examples: (1) predator-prey model with noise, (2) shallow water equations with outliers, (3) heat diffusion with random initial and boundary conditions, and (4) fish-harvesting problem with bifurcations. Numerical results show that the robustness and accuracy of our new method is significantly better than the other model-discovering methods and traditional regression methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

page 24

page 26

05/23/2021

Learning Green's Functions of Linear Reaction-Diffusion Equations with Application to Fast Numerical Solver

Partial differential equations are often used to model various physical ...
07/30/2021

DySMHO: Data-Driven Discovery of Governing Equations for Dynamical Systems via Moving Horizon Optimization

Discovering the governing laws underpinning physical and chemical phenom...
11/05/2015

Robust data assimilation using L_1 and Huber norms

Data assimilation is the process to fuse information from priors, observ...
08/31/2020

β-Cores: Robust Large-Scale Bayesian Data Summarization in the Presence of Outliers

Modern machine learning applications should be able to address the intri...
03/27/2013

Bayesian Inference in Model-Based Machine Vision

This is a preliminary version of visual interpretation integrating multi...
03/11/2022

Bayesian inference via sparse Hamiltonian flows

A Bayesian coreset is a small, weighted subset of data that replaces the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The search for physical laws has been a fundamental aim of science for centuries. The physical laws are critical to the understanding of natural phenomena and the prediction of future dynamics. They are either derived by other known physical laws or generalized based on empirical observations of physical behavior. We focus on the second task, which is also called data-driven discovery of governing physical laws. It deals with the case where experimental data are given while the governing physical model is unclear. Traditional methods for discovering physical laws from data include interpolation and regression. Suppose

is an unknown physical law. Given data , traditional methods approximate the expression of in terms of a class of functions of . This approach has two limitations:

  • If the data are collected from different experiments, traditional methods would not be able to use all of the data together to discover the physical laws. For example, free falls from different initial height [] follow the same physical law but they have different trajectories. Traditional methods can only use the data on the same trajectory to discover the path and predict future motion. However, using all the data can increase the accuracy.

  • Traditional methods are incapable of extrapolating to the areas where no data are given.

To resolve these two limitations, we adopt the strategy: first, discover the differential equations that satisfies; second, solve the differential equations analytically or numerically. For the free fall example, we can use all the data from different trajectories together to discover the differential equation and extrapolate to any other given initial height. This discovery pattern is applicable to a larger class of models than traditional methods and derives the governing differential equations, which provide insights to the governing physical laws behind the observations zhang_robust_2018 . Many fundamental laws are formulated in the form of differential equations, such as Maxwell equations in classical electromagnetism, Einstein field equations in general relativity, Schrodinger equation in quantum mechanics, Navier-Stokes equations in fluid dynamics, Boltzmann equation in thermodynamic, predator-prey equations in biology, and Black-Scholes equation in economics. While automated techniques for generating and collecting data from scientific measurements are more and more precise and powerful, automated processes for extracting knowledge in analytic forms from data are limited schmidt_distilling_2009 . Our goal is to develop automated algorithms for extracting the governing differential equations from data.

Consider a differential equation of the form

(1)

with the unknown function . Given the data collected from a space governed by this differential equation, where and , automated algorithms for deriving the expression of are studied from various approaches. One of the approaches assumes that is a linear combination of simple functions of and . First, construct a moderately large set of basis-functions that may contain all the terms of ; then, apply algorithms to select a subset that is exactly all the terms of from the basis-functions and estimate the corresponding weights in the linear combination.

Suppose the basis-functions are chosen as . Then we need to estimate the weights in the following linear combination:

(2)

Given the data , where and , the above problem becomes a regression problem as follows:

(3)

where is the model error. Let

(4)
(5)
(6)

Equation (3

) may be written in the vector form as follows:

(7)

Now the problem is to estimate the weight-vector given a known vector and a known matrix .

Since many physical systems have few terms in the equations, the set of basis-functions usually has many more terms than : , which suggests the use of sparse methods to select the subset of basis-functions and estimate the weights. These sparse methods can be sequential threshold least squares (also called sparse identification of nonlinear dynamics (SINDy)) brunton_discovering_2016 ; rudy_data-driven_2017 , lasso (least absolute shrinkage and selection operator) tibshirani_regression_1996 ; schaeffer_learning_2017 , or threshold sparse Bayesian regression zhang_robust_2018 . Sequential threshold least squares does least-square regression and eliminates the terms with small weights iteratively (Algorithm 1); lasso solves the following optimization problem:

(8)

where the regularization parameter may be fitted by cross-validation (Algorithm 2); threshold sparse Bayesian regression calculates the posterior distribution of given the data and then filters out small weights, iteratively until convergence (Algorithm 3). A comparison of these three sparse methods is illustrated in zhang_robust_2018 and shows that threshold sparse Bayesian regression is more accurate and robust than the other two methods.

The same mechanism as above also applies to the discovery of general differential equations including higher-order differential equations and implicit differential equations zhang_robust_2018 , besides the differential equations of the form (1). Nevertheless, the mechanism is described in the pattern (1) here for convenience and simplification, so that more attention is given to the essence of the algorithm itself. In addition, to apply the algorithm to real-world problems, dimensional analysis can be incorporated in the construction of the basis-functions zhang_robust_2018 . Any physically meaningful equation has the same dimensions on every term, which is a property known as dimensional homogeneity. Therefore, when summing up terms in the equations, the addends should be of the same dimension.

Sparse regression methods for data-driven discovery of differential equations are developed recently with a wide range of applications, for example, inferring biological networks mangan_inferring_2016 , sparse identification of a predator-prey system dam_sparse_2017 , model selection via integral terms schaeffer_sparse_2017 , extracting high-dimensional dynamics from limited data schaeffer_extracting_2017 , recovery of chaotic systems from highly corrupted data tran_exact_2017 , model selection for dynamical systems via information criteria mangan_model_2017 , model predictive control in the low-data limit kaiser_sparse_2017 , sparse learning of stochastic dynamical systems boninsegna_sparse_2018 , model selection for hybrid dynamical systems mangan_model_2018

, identification of parametric partial differential equations

rudy_data-driven_2018 , extracting structured dynamical systems with very few samples schaeffer_extracting_2018 , constrained Galerkin regression loiseau_constrained_2018 , rapid model recovery quade_sparse_2018 , convergence of the SINDy algorithm zhang_convergence_2018

. Moreover, other methods for data-driven discovery of differential equations are proposed as well, for instance, deep neural networks

rudy_deep_2018 ; raissi_physics_2017-1 ; raissi_physics_2017 and Gaussian process raissi_machine_2017 . One of the advantages of the sparse regression methods is the ability to provide explicit formulas of the differential equations, from which further analysis on the systems may be performed, while deep neural networks usually provide “black boxes”, in which the mechanism of the systems is not very clearly revealed. Another advantage of the sparse regression methods is that they do not require too much prior knowledge of the differential equations, while Gaussian process methods have restrictions on the form of the differential equations and are used to estimate a few parameters.

Previous developments and applications based on sparse regression methods mostly employ either sequential threshold least squares or lasso, or their variations. One of the reasons why data-driven discovery of differential equations has not yet been applied to industry is the instability of its methods. Previous methods require the data of very high quality, which is usually not the case in industry. Although threshold sparse Bayesian regression is more accurate and robust than the other two methods and provides error bars that quantify the uncertainties zhang_robust_2018 , its performance is still unsatisfactory if the provided data are of large noise or contain outliers. Therefore, it is instructive to improve the threshold sparse Bayesian regression algorithm and apply it to the fields above, as this will improve the overall performance of the method in most cases. In this paper, we develop a subsampling-based technique for improving the threshold sparse Bayesian regression algorithm, so that the new algorithm is robust to large noise and outliers. Note that subsampling methods are usually employed for estimating statistics efron_jackknife_1981 or speeding up algorithms rudy_data-driven_2017 in the literatures, but the subsampling method in this paper is used for improving the accuracy of the Bayesian learning algorithm. In practice, denoising techniques can be used to reduce part of the noise and outliers in the data before our algorithm is performed.

The remainder of this paper is structured as follows. In Section 2, we introduce the threshold sparse Bayesian regression algorithm. In Section 3, we detail our new subsampling-based algorithm. In Section 4, we investigate the robustness of our new algorithm through an example of discovering the predator-prey model with noisy data. In Section 5, we discuss how to use our new algorithm to remove outliers, with an example of discovering the shallow water equations using the data corrupted by outliers. In Section 6, we tackle the challenge of data integration through an example of discovering the heat diffusion model with random initial and boundary conditions. In Section 7, we tackle the challenge of extrapolation through an example of discovering the fish-harvesting model with bifurcations. Finally, the summary is given in Section 8.

2 Threshold sparse Bayesian regression

2.1 Bayesian hierarchical model setup

Let be a known vector, be a known matrix, be the weight-vector to be estimated sparsely, and be the model error:

(9)

We adopt a sparse Bayesian framework based on RVM (relevance vector machine tipping_sparse_2001 , which is motivated by automatic relevance determination domany_bayesian_1996 ; neal_bayesian_1996 ) to estimate the weight-vector

. The Bayesian framework assumes that the model errors are modeled as independent and identically distributed zero-mean Gaussian with variance

. The variance may be specified beforehand, but in this paper it is fitted by the data. The model gives a multivariate Gaussian likelihood on the vector :

(10)

Now we introduce a Gaussian prior over the weight-vector. The prior is governed by a set of hyper-parameters, one hyper-parameter associated with each component of the weight-vector:

(11)

where . The values of the hyper-parameters are estimated from the data. See Figure 1 for the graphical structure of this model.

Figure 1: Graphical structure of the sparse Bayesian model.

2.2 Inference

The posterior over all unknown parameters given the data can be decomposed as follows:

(12)

As analytic computations cannot be performed in full, we approximate using the Dirac delta function at the maximum likelihood estimation:

(13)

with . We may use the Dirac delta function as an approximation on the basis that this point-estimate is representative of the posterior in the sense that the integral calculation for the posterior using the point-estimate is roughly equal to the one obtained by sampling from the full posterior distribution tipping_sparse_2001 . This maximization is a type-II maximum likelihood and can be calculated using a fast method tipping_fast_2003 . Now, we may integrate out and to get the posterior over the weight-vector:

(14)

in which the posterior covariance and mean are:

(15)
(16)

Therefore the posterior for each weight can be deduced from (14):

(17)

with mean

and standard deviation

. Thus the mean posterior prediction of the weight-vector

and other quantities that we want to obtain are determined by the values of and

. This is the beauty of the Bayesian approach: there is no need to determine a regularization parameter via expensive cross-validation, and moreover likelihood values and confidence intervals for the solution can be easily calculated

schmolck_smooth_2007 . Another merit of the Bayesian approach is that it can work with limited data since it incorporates prior information into the problem to supplement limited data.

2.3 Implementation

In practice the optimal values of many hyper-parameters in (13) are infinite tipping_sparse_2001 , and thus from (15)-(17) the posterior distributions of many components of the weight-vector are sharply peaked at zero. This leads to the sparsity of the resulting weight-vector. To further encourage the accuracy and robustness, a threshold is placed on the model to filter out possible disturbance present in the weight-vector. Then the weight-vector is reestimated using the remaining terms, iteratively until convergence. The entire procedure is summarized in Algorithm 3. A discussion about how to choose the threshold and its impact on the solution is detailed in zhang_robust_2018 .As the threshold is a parameter representing the model complexity, we may use machine learning algorithms such as cross-validation to determine it. Note that in Algorithm 3, is more and more sparse after each loop by design. Thus the convergence of the algorithm is guaranteed given the convergence of the calculation of maximum likelihood in (13).

In this paper, we define an error bar value that quantifies the quality of the posterior estimated model as follows:

(18)

where each estimated variance is divided by the square of the corresponding estimated mean to normalize the variance on each weight. This definition adds up all the normalized variances of each weight present in the result, and penalizes the unsureness of the estimations. In other words, a smaller error bar value means smaller normalized variances and higher posterior confidence, and implies higher model quality. If given a set of candidate models for the data, the preferred model would be the one with the minimum error bar value.

As a comparison, other sparse regression algorithms are listed: sequential threshold least squares (Algorithm 1) and lasso (Algorithm 2).

Input: , , threshold
Output:
Solve in ;
For components of with absolute value less than the threshold, set them as 0;
while  do
       Delete the columns of whose corresponding weight is 0, getting ;
       Solve in ;
       Update the corresponding components of using ;
       For components of with absolute value less than the threshold, set them as 0;
       if  is the same as the one on the last loop then
             break;
            
       end if
      
end while
Algorithm 1 Sequential threshold least squares:
Input: ,
Output:
, where is fitted by five-fold cross-validation with minimum mean squared error (MSE) on validation sets.
Algorithm 2 Lasso:
Input: , , threshold
Output: ,
Calculate the posterior distribution in , and let the mean be ;
For components of with absolute value less than the threshold, set them as 0;
while  do
       Delete the columns of whose corresponding weight is 0, and let the result be ;
       Calculate the posterior distribution in , and let the mean be ;
       Update the corresponding components of using ;
       For components of with absolute value less than the threshold, set them as 0;
       if  is the same as the one on the last loop then
             break;
            
       end if
      
end while
Set the submatrix of corresponding to non-zero components of as the last estimated posterior variance in the preceding procedure, and set the other elements of as 0.
Algorithm 3 Threshold sparse Bayesian regression:

3 Subsampling-based threshold sparse Bayesian regression

3.1 Motivation

In the regression problem (3), when we have more data than the number of unknown weights: , we may use a subset of the data to estimate the weights. We do this on the basis that the data sets collected from real-world cases may contain outliers or a percentage of data points of large noise. Classical methods for parameter estimation, such as least squares, fit a model to all of the presented data. These methods have no internal mechanism for detecting or discarding outliers. Other robust regression methods designed to overcome the limitations of traditional methods include iteratively reweighted least squares holland_robust_1977 ; chartrand_iteratively_2008 , random sample consensus fischler_random_1981 , least absolute deviations karst_linear_1958 , Theil-Sen estimator hallet_rank-invariant_1992 ; sen_estimates_1968 , repeated median regression siegel_robust_1982 ; rousseeuw_least_1984 , but they do not fit very well into the framework of data-driven discovery of differential equations. Classical methods for parameter estimation are averaging methods based on the assumption (the smoothing assumption) that there will always be enough good values to smooth out any gross noise fischler_random_1981 . In practice the data may contain more noise than what the good values can ever compensate, breaking the smoothing assumption. To deal with this situation, an effective way is to discard the “bad” data points (outliers or those of large noise), and use the rest “good” data points to run estimations. Based on this idea, we propose an algorithm called subsampling-based threshold sparse Bayesian regression (SubTSBR) in this paper. This method filters out bad data points by the means of random subsampling.

3.2 Implementation

SubTSBR approaches the problem by randomly selecting data points to estimate the weights and using a measure (error bar) to evaluate the estimations. When an estimation is “good” (small error bar), our algorithm identifies the corresponding selected data points as good data points and the estimation as the final result. To be specific, our algorithm is given a user-preset subsampling size and the number of loops at the very beginning. For each loop, a subset of the data consisting of data points is randomly selected: and used to estimate the weights , , …, in the following regression problem:

(19)

where are the basis-functions and is the model error. This regression problem can be symbolized into the form as follows:

(20)

which is (9). By running Algorithm 3, we obtain a differential equation:

(21)

along with an error bar calculated by (18). After repeating this procedure times with different randomly selected data points, the differential equation with the smallest error bar among all the loops is chosen as the final result of the whole subsampling algorithm. Our algorithm has two user-preset parameters: the subsampling size and the number of loops. Their impact on the accuracy of the final result is discussed in Section 4 with an example. Note that the above mechanism is described in the pattern (1) for convenience and simplification. It also applies to higher-order differential equations and implicit differential equations, as long as the differential equations can be symbolized into the form (20). The SubTSBR procedure is summarized in Algorithm 4, where the for-loop can be coded parallelly.

Input: , , threshold, subsampling size , the number of loops
Output: ,
Let be the identity matrix, where is the number of rows in ;
for  to  do
       Let be an submatrix of with randomly chosen rows;
       Use Algorithm 3 to solve the problem , getting ;
       Calculate using and (18);
      
end for
Let ;
Let and .
Algorithm 4 Subsampling-based threshold sparse Bayesian regression:

3.3 Why it works

The numerical results in this paper show that our subsampling algorithm can improve the overall accuracy in the discovery of differential equations, and the error bar (18) is capable of evaluating the estimations. The given data contain a part of data points of small noise and a part of data points of large noise. When a subset consisting of only data points of small noise is selected, our algorithm would estimate the weights well and indicate that this is the case by showing a small error bar (18). As we do not know which data points are of small noise and which data points are of large noise before the model is discovered, we select a subset from the data randomly, repeating multiple times. When it happens that the selected data points are of small noise, we would have a good estimation of the weights and at the same time recognize this case.

The numerical results also show that in order to attain optimal performance, the subsampling size should be moderate, neither too small nor too big, while the number of loops is as big as possible when computing time permits. As the correctness of the governing differential equations is crucial, computing time can be a secondary issue to consider. In practice, we can increase the number of loops gradually and stop the algorithm when the smallest error bar among all the loops drops below a certain preset value or the smallest error bar stops decreasing.

4 The challenge of large noise

4.1 Problem description

The noise in the data can hinder model-discovering algorithms from getting the correct results. Here we detail the mechanism of our algorithm, tune the parameters, and investigate the robustness against noise.

4.2 Example: the predator-prey model with noise

The predator-prey model is a system of a pair of first-order nonlinear differential equations and is frequently used to describe the interaction between two species, one as a predator and the other as prey. The population change by time is as follows:

(22)
(23)

where is the number of the prey, is the number of the predator, and are positive real parameters describing the interaction of the two species. In this example, we fix the parameters as follows:

(24)
(25)

We assume that we do not know about the formula of the system (24) - (25), neither the terms nor the parameters, and try to discover the model using noisy data.

4.2.1 Data collection

We first generate data points from the system (24) - (25), with the initial value and , during time to

. Then independent and identically distributed white noise

is added to all the data and all the data . See Figure 2 for the true model and the noisy data.

[] []

Figure 2: The true predator-prey model and the noisy data. (a) Population vs time. (b) Population(predator) vs population(prey).

4.2.2 Calculate numerical derivatives using total-variation derivative

Now we need to calculate the derivatives of the data to estimate the left-hand-side terms in (24) - (25). The noise in the data would be amplified greatly if the derivatives are calculated using numerical differentiation. See Figure 3 for the approximated derivatives using gradient. Therefore, we use total-variation derivative rudin_nonlinear_1992 ; chartrand_numerical_2011 instead. For a real function on , total-variation derivative computes the derivatives of as the minimizer of the functional:

(26)

where is the operator of antidifferentiation and is a regularization parameter that controls the balance between the two terms. The numerical implementation is introduced in chartrand_numerical_2011 . See Figure 3 for the approximated derivatives of and using total-variation derivative with .

As we can see in Figure 3, the robust differentiation method is critical for getting high-quality derivatives. Besides total-variation derivative, many other methods are available for robust differentiation. Another approach is to use denoising techniques to reduce the noise in the data before taking derivatives. For example, a neural network is used to denoise data and approximate derivatives in lagergren_learning_2019 . Those methods may have better performance in practical use, depending on the situation. Here we do not use denoising techniques for the sake of demonstrating the robustness of our algorithm against noise. In practice, the denoised data may still contain noise. Our algorithm can be used following the denoising processes and may achieve good results.

[] []

Figure 3: (a) The true derivatives of the data and the approximated derivatives using gradient. (b) The true derivatives of the data and the approximated derivatives using total-variation derivative (tvd).

4.2.3 Discover the model using different sparse algorithms

If sequential threshold least squares (Algorithm 1) is applied to discover the model (24) - (25), we get the following result:

(27)
(28)

If lasso (Algorithm 2) is used, we have:

(29)
(30)

If TSBR (Algorithm 3) is used, we get:

(31)
(32)

with error bars and , where the numbers in front of each term read as “mean (standard deviation)” of the corresponding weights. Note that sequential threshold least squares, lasso, and TSBR use all data points at the same time to discover the model respectively.

In contrast, if SubTSBR (Algorithm 4) is applied to discover the model, we have:

(33)
(34)

with error bars and . The approximated weights in (33) - (34) are slightly smaller in absolute value than the true weights in (24) - (25) because when we calculate the numerical derivatives, total-variation derivative (26) smooths out some variation in the derivatives. This defect does not have much impact on the result and can be addressed by using different methods to calculate the derivatives or collecting the derivatives along with the data directly. Here the subsampling size is and the number of loops is . All methods in this example have the threshold set at and use monomials generated by up to degree three as basis-functions ( terms in total). The result of SubTSBR (33) - (34) approximates the true system (24) - (25) significantly better than the other sparse algorithms. Although the data contain a considerable amount of noise, SubTSBR successfully finds the exact terms in the true system and accurately estimates the parameters. See Figure 4 for the final selected data points in SubTSBR. See Figure 5 for the dynamics calculated by the systems derived from each algorithm. Note that since the data are collected from to , Figure 5 shows the approximation and Figure 5 shows the prediction.

Figure 4: The final selected data points in SubTSBR.

[] []

Figure 5: (a) Approximated dynamics by SubTSBR, TSBR, lasso, and sequential threshold least squares from to . (b) Predicted dynamics from to .

4.2.4 Basis-selection success rate vs subsampling size and the number of loops

In SubTSBR (Algorithm 4), we have two parameters to set, one of which is the subsampling size and the other is the number of loops. Here we first investigate the impact on the basis-selection success rate by different subsampling sizes. In Figure 6, each curve is drawn by fixing the number of loops. Then given each subsampling size, SubTSBR is applied to the data set collected above. This method is performed times for each fixed number of loops and subsampling size. Then the percentage of successful identification of the exact terms in the system (24) - (25) is calculated and plotted. In the discovery process, the most difficult part is to identify the exact terms in the system. If the exact terms are successfully identified, it is usually easy to estimate the weights.

Figure 6 shows that for each fixed number of loops, basis-selection success rate goes up and then down when the subsampling size increases. When the subsampling size equals , all the data points are used and SubTSBR is equivalent to TSBR (Algorithm 3). In this case the true terms cannot be identified. In addition, for each chosen number of loops there is an optimal subsampling size, and the optimal subsampling size increases by the number of loops.

Now we investigate the impact on the basis-selection success rate by different numbers of loops. In Figure 6, each curve is drawn by fixing the subsampling size. Then given each number of loops, SubTSBR is applied to the data set to discover the model. The discovery is done times for each fixed subsampling size and number of loops. Then the percentage of successful identification of the exact terms in the system (24) - (25) is calculated and plotted.

Figure 6 shows that for each fixed subsampling size, basis-selection success rate keeps going up when the number of loops increases. In addition, the larger the subsampling size is, more loops are needed for the basis-selection success rate to reach a certain level. This is because our data set is polluted by Gaussian noise and naturally contains some data points of large noise and some of small noise. As the subsampling size gets bigger, it is less likely for each of the random subsets of data to exclude the data points of large noise. When more loops are used, the likelihood for one of the loops to exclude the data points of large noise increases. As long as one of the loops excludes the data points of large noise, this loop may successfully select the true basis functions and have the smallest error bar. If it happens, the final result would come from this loop and SubTSBR selects the true basis functions successfully in this one of runs. This explains why the curves of smaller subsampling size go up faster.

On the other hand, when more data points are used, the noise inside the subsamples gets smoothed out easier in the regression (20). If all the included data points in the subsample are of small noise, then the result from a larger subsampling size may be a better one. This explains why the saturated basis-selection success rate is higher for larger subsampling size within a certain range. In conclusion, there is tradeoff for larger subsampling size—it is more difficult to include only data points of small noise while it is easier to smooth out the noise inside the subsample.

[] []

Figure 6: The total number of data points is . (a) Basis-selection success rate vs subsampling size, with different numbers of loops. (b) Basis-selection success rate vs the number of loops, with different subsampling sizes.

4.2.5 Adjusted error bar and auto-fitting of subsampling size

In real-world applications, the case is usually more complicated and the problem of setting the best subsampling size is subtle. Since the true equations are unknown, the basis-selection success rate cannot be calculated. Therefore, drawing a curve like the ones in Figure 6 to find the best subsampling size is not available. Here we define an adjusted error bar as an indicator for the quality of the approximated model and fit the subsampling size automatically.

The error bar defined in (18) depends on the number of data points used and the quality of the approximated model. When the subsampling size is within a reasonable range, the quality of the approximated model does not differ too much, so the error bar is dominated by the number of data points (see Figure 7). By the formula (18), the error bar is negatively correlated with the number of data points. If we want to compare results among different subsampling sizes, we need to adjust the error bar such that it is independent of the subsampling size. Inspired by the rate of convergence of Monte Carlo method, we give the following empirical formula:

(35)

In Figure 8, we can see that the adjusted error bar almost only depends on the quality of the approximated model. Note that we have the best models when the subsampling size is between and (see Figure 6). Meanwhile, the percentile curves in Figure 8 indicate that the optimal subsampling sizes are in about that range. This observation confirms that the adjusted error bar depends on the quality of the discovered model. The better the model is, the smaller the adjusted error bar is.

Now, we do not have to set the subsampling size at the beginning but we may try different subsampling sizes to discover the model, and the best result can be selected from all the results with different subsampling sizes. (In Figure 7 and Figure 8, for each fixed number of loops and subsampling size, we run SubTSBR times and discover models with their error bars or adjusted error bars. Then the 25th, 50th, and 75th percentiles of these error bars or adjusted error bars are plotted.)

Figure 7: Error bar vs subsampling size. The total number of data points is .

[] []
[] []

Figure 8: Adjusted error bar vs subsampling size, with different numbers of loops. The total number of data points is .

4.2.6 A new data set with larger noise

Here we use a new data set with white noise to discover the predator-prey model (24) - (25). All other settings remain the same. Corresponding to Figure 2, the noisy data are presented in Figure 9 and Figure 9. Corresponding to Figure 6, the basis-selection success rates are presented in Figure 9 and Figure 9. With larger noise, the chance of successfully picking out the true terms from the basis-functions is lower. As a result, more loops would be needed in this case. Figure 9 only demonstrates the results with the numbers of loops up to , but many more loops may be used in practical problems. As for computation time, it takes about seconds to discover the dynamical system in this example with subsampling size and the number of loops on one core of the CPU Intel i7-6700HQ (coded in MATLAB 2018a).

[] []
[] []

Figure 9: (a) and (b) The noisy data with larger noise than the ones in Figure 2. (c) and (d) The basis-selection success rates with larger noise than the ones in Figure 6.

4.2.7 Another new data set with smaller noise

Now we use another new data set with white noise to discover the predator-prey model (24) - (25). All other settings remain the same. The numerical results are presented in Figure 10. In this case, we have a large portion of data points of small noise to use, so a basis-selection success rate can be reached with just loops. Also, the noise inside the subsamples is small, so we do not need a large subsampling size to smooth out the noise inside the subsamples. The basis-selection success rate is very high even with a subsampling size of .

[] []
[] []

Figure 10: (a) and (b) The noisy data with smaller noise than the ones in Figure 2. (c) and (d) The basis-selection success rates with smaller noise than the ones in Figure 6.

5 The challenge of outliers

5.1 Problem description

The outliers in the data can cause serious problems in the estimations and should be removed. The subsampling procedure in Algorithm 4 is designed to be resistant to outliers. Here we calculate how many loops are needed to exclude the outliers from a data set. Then we apply our theory to an example.

5.2 The number of loops needed to remove outliers

Suppose we are given data points, a portion of which are outliers. Suppose the subsampling size is . We try to determine the number of loops such that with confidence , at least one of the randomly selected subsets of the data does not contain any outlier.

The number of outliers is and the number of “good” data points is . For a random subset of size

not containing any outlier, the probability is

(36)

When and , the probability is approximately

(37)

For a random subset of size containing at least one outlier, the probability is

(38)

For random subsets of size each containing at least one outlier, the probability is

(39)

For random subsets of size at least one subset not containing any outlier, the probability is

(40)

Set the probability greater than or equal to :

(41)

Then we have

(42)

See Figure 11 for the relationship between the subsampling size , the portion of outliers , and the minimum number of loops when .

Figure 11: The relationship between the subsampling size , the portion of outliers , and the minimum number of loops when in (42). The black curves are the contours with fixed minimum .

5.3 Example: shallow water equations with outliers

Consider the following 2-D conservative form of shallow water equations:

(43)
(44)
(45)

on and , with reflective boundary conditions and a water drop initiating gravity waves, where is the total fluid column height, is the fluid’s horizontal flow velocity averaged across the vertical column, and is the gravitational acceleration. The first equation can be derived from mass conservation, the last two from momentum conservation. Here, we have made the assumption that the fluid density is a constant.

5.3.1 Data collection

We generate the numerical solution to the shallow water equations using Lax-Wendroff finite difference method with and . See Figure 12. The data are collected at and the partial derivatives , , , , , and are calculated by the three-point central-difference formula. The calculation of the partial derivatives , , and uses the points from two adjacent time frames. Assume that only the central part of the data is made accessible and of them are added on the values of , , and by independent and identically distributed random error

, the uniform distribution on

. There are accessible data points. See Figure 12. Thus, the accessible data to discover the model are:

(46)

of which are outliers.

[] []
[] []

Figure 12: Surface plot displays height colored by momentum. (a) A water drop falls into the pool. (b) The gravity waves are traveling and being reflected by the boundary. (c) The water surface state when the data are collected. (d) The accessible data are corrupted by outliers.

5.3.2 Discovery of the model

We apply the dimensional analysis method introduced in zhang_robust_2018 to construct the basis-functions of the same dimension as () to discover it: , , , , , , , , , , , . Similarly, we can construct the basis-functions of the same dimension as and () to discover them: , , , , , , , , , , , , , . The threshold for all algorithms is set at . If sequential threshold least squares (Algorithm 1) is applied, we get the following result:

(47)
(48)
(49)

If lasso (Algorithm 2) is used, we have:

(50)
(51)
(52)

If TSBR (Algorithm 3) is used, we get:

(53)
(54)
(55)

In contrast, if SubTSBR (Algorithm 4) is applied to discover the model, we have:

(56)