The search for physical laws has been a fundamental aim of science for centuries. The physical laws are critical to the understanding of natural phenomena and the prediction of future dynamics. They are either derived by other known physical laws or generalized based on empirical observations of physical behavior. We focus on the second task, which is also called data-driven discovery of governing physical laws. It deals with the case where experimental data are given while the governing physical model is unclear. Traditional methods for discovering physical laws from data include interpolation and regression. Supposeis an unknown physical law. Given data , traditional methods approximate the expression of in terms of a class of functions of . This approach has two limitations:
If the data are collected from different experiments, traditional methods would not be able to use all of the data together to discover the physical laws. For example, free falls from different initial height  follow the same physical law but they have different trajectories. Traditional methods can only use the data on the same trajectory to discover the path and predict future motion. However, using all the data can increase the accuracy.
Traditional methods are incapable of extrapolating to the areas where no data are given.
To resolve these two limitations, we adopt the strategy: first, discover the differential equations that satisfies; second, solve the differential equations analytically or numerically. For the free fall example, we can use all the data from different trajectories together to discover the differential equation and extrapolate to any other given initial height. This discovery pattern is applicable to a larger class of models than traditional methods and derives the governing differential equations, which provide insights to the governing physical laws behind the observations zhang_robust_2018 . Many fundamental laws are formulated in the form of differential equations, such as Maxwell equations in classical electromagnetism, Einstein field equations in general relativity, Schrodinger equation in quantum mechanics, Navier-Stokes equations in fluid dynamics, Boltzmann equation in thermodynamic, predator-prey equations in biology, and Black-Scholes equation in economics. While automated techniques for generating and collecting data from scientific measurements are more and more precise and powerful, automated processes for extracting knowledge in analytic forms from data are limited schmidt_distilling_2009 . Our goal is to develop automated algorithms for extracting the governing differential equations from data.
Consider a differential equation of the form
with the unknown function . Given the data collected from a space governed by this differential equation, where and , automated algorithms for deriving the expression of are studied from various approaches. One of the approaches assumes that is a linear combination of simple functions of and . First, construct a moderately large set of basis-functions that may contain all the terms of ; then, apply algorithms to select a subset that is exactly all the terms of from the basis-functions and estimate the corresponding weights in the linear combination.
Suppose the basis-functions are chosen as . Then we need to estimate the weights in the following linear combination:
Given the data , where and , the above problem becomes a regression problem as follows:
where is the model error. Let
) may be written in the vector form as follows:
Now the problem is to estimate the weight-vector given a known vector and a known matrix .
Since many physical systems have few terms in the equations, the set of basis-functions usually has many more terms than : , which suggests the use of sparse methods to select the subset of basis-functions and estimate the weights. These sparse methods can be sequential threshold least squares (also called sparse identification of nonlinear dynamics (SINDy)) brunton_discovering_2016 ; rudy_data-driven_2017 , lasso (least absolute shrinkage and selection operator) tibshirani_regression_1996 ; schaeffer_learning_2017 , or threshold sparse Bayesian regression zhang_robust_2018 . Sequential threshold least squares does least-square regression and eliminates the terms with small weights iteratively (Algorithm 1); lasso solves the following optimization problem:
where the regularization parameter may be fitted by cross-validation (Algorithm 2); threshold sparse Bayesian regression calculates the posterior distribution of given the data and then filters out small weights, iteratively until convergence (Algorithm 3). A comparison of these three sparse methods is illustrated in zhang_robust_2018 and shows that threshold sparse Bayesian regression is more accurate and robust than the other two methods.
The same mechanism as above also applies to the discovery of general differential equations including higher-order differential equations and implicit differential equations zhang_robust_2018 , besides the differential equations of the form (1). Nevertheless, the mechanism is described in the pattern (1) here for convenience and simplification, so that more attention is given to the essence of the algorithm itself. In addition, to apply the algorithm to real-world problems, dimensional analysis can be incorporated in the construction of the basis-functions zhang_robust_2018 . Any physically meaningful equation has the same dimensions on every term, which is a property known as dimensional homogeneity. Therefore, when summing up terms in the equations, the addends should be of the same dimension.
Sparse regression methods for data-driven discovery of differential equations are developed recently with a wide range of applications, for example, inferring biological networks mangan_inferring_2016 , sparse identification of a predator-prey system dam_sparse_2017 , model selection via integral terms schaeffer_sparse_2017 , extracting high-dimensional dynamics from limited data schaeffer_extracting_2017 , recovery of chaotic systems from highly corrupted data tran_exact_2017 , model selection for dynamical systems via information criteria mangan_model_2017 , model predictive control in the low-data limit kaiser_sparse_2017 , sparse learning of stochastic dynamical systems boninsegna_sparse_2018 , model selection for hybrid dynamical systems mangan_model_2018
, identification of parametric partial differential equationsrudy_data-driven_2018 , extracting structured dynamical systems with very few samples schaeffer_extracting_2018 , constrained Galerkin regression loiseau_constrained_2018 , rapid model recovery quade_sparse_2018 , convergence of the SINDy algorithm zhang_convergence_2018
. Moreover, other methods for data-driven discovery of differential equations are proposed as well, for instance, deep neural networksrudy_deep_2018 ; raissi_physics_2017-1 ; raissi_physics_2017 and Gaussian process raissi_machine_2017 . One of the advantages of the sparse regression methods is the ability to provide explicit formulas of the differential equations, from which further analysis on the systems may be performed, while deep neural networks usually provide “black boxes”, in which the mechanism of the systems is not very clearly revealed. Another advantage of the sparse regression methods is that they do not require too much prior knowledge of the differential equations, while Gaussian process methods have restrictions on the form of the differential equations and are used to estimate a few parameters.
Previous developments and applications based on sparse regression methods mostly employ either sequential threshold least squares or lasso, or their variations. One of the reasons why data-driven discovery of differential equations has not yet been applied to industry is the instability of its methods. Previous methods require the data of very high quality, which is usually not the case in industry. Although threshold sparse Bayesian regression is more accurate and robust than the other two methods and provides error bars that quantify the uncertainties zhang_robust_2018 , its performance is still unsatisfactory if the provided data are of large noise or contain outliers. Therefore, it is instructive to improve the threshold sparse Bayesian regression algorithm and apply it to the fields above, as this will improve the overall performance of the method in most cases. In this paper, we develop a subsampling-based technique for improving the threshold sparse Bayesian regression algorithm, so that the new algorithm is robust to large noise and outliers. Note that subsampling methods are usually employed for estimating statistics efron_jackknife_1981 or speeding up algorithms rudy_data-driven_2017 in the literatures, but the subsampling method in this paper is used for improving the accuracy of the Bayesian learning algorithm. In practice, denoising techniques can be used to reduce part of the noise and outliers in the data before our algorithm is performed.
The remainder of this paper is structured as follows. In Section 2, we introduce the threshold sparse Bayesian regression algorithm. In Section 3, we detail our new subsampling-based algorithm. In Section 4, we investigate the robustness of our new algorithm through an example of discovering the predator-prey model with noisy data. In Section 5, we discuss how to use our new algorithm to remove outliers, with an example of discovering the shallow water equations using the data corrupted by outliers. In Section 6, we tackle the challenge of data integration through an example of discovering the heat diffusion model with random initial and boundary conditions. In Section 7, we tackle the challenge of extrapolation through an example of discovering the fish-harvesting model with bifurcations. Finally, the summary is given in Section 8.
2 Threshold sparse Bayesian regression
2.1 Bayesian hierarchical model setup
Let be a known vector, be a known matrix, be the weight-vector to be estimated sparsely, and be the model error:
We adopt a sparse Bayesian framework based on RVM (relevance vector machine tipping_sparse_2001 , which is motivated by automatic relevance determination domany_bayesian_1996 ; neal_bayesian_1996 ) to estimate the weight-vector
. The Bayesian framework assumes that the model errors are modeled as independent and identically distributed zero-mean Gaussian with variance. The variance may be specified beforehand, but in this paper it is fitted by the data. The model gives a multivariate Gaussian likelihood on the vector :
Now we introduce a Gaussian prior over the weight-vector. The prior is governed by a set of hyper-parameters, one hyper-parameter associated with each component of the weight-vector:
where . The values of the hyper-parameters are estimated from the data. See Figure 1 for the graphical structure of this model.
The posterior over all unknown parameters given the data can be decomposed as follows:
As analytic computations cannot be performed in full, we approximate using the Dirac delta function at the maximum likelihood estimation:
with . We may use the Dirac delta function as an approximation on the basis that this point-estimate is representative of the posterior in the sense that the integral calculation for the posterior using the point-estimate is roughly equal to the one obtained by sampling from the full posterior distribution tipping_sparse_2001 . This maximization is a type-II maximum likelihood and can be calculated using a fast method tipping_fast_2003 . Now, we may integrate out and to get the posterior over the weight-vector:
in which the posterior covariance and mean are:
Therefore the posterior for each weight can be deduced from (14):
. Thus the mean posterior prediction of the weight-vectorand other quantities that we want to obtain are determined by the values of and
. This is the beauty of the Bayesian approach: there is no need to determine a regularization parameter via expensive cross-validation, and moreover likelihood values and confidence intervals for the solution can be easily calculatedschmolck_smooth_2007 . Another merit of the Bayesian approach is that it can work with limited data since it incorporates prior information into the problem to supplement limited data.
In practice the optimal values of many hyper-parameters in (13) are infinite tipping_sparse_2001 , and thus from (15)-(17) the posterior distributions of many components of the weight-vector are sharply peaked at zero. This leads to the sparsity of the resulting weight-vector. To further encourage the accuracy and robustness, a threshold is placed on the model to filter out possible disturbance present in the weight-vector. Then the weight-vector is reestimated using the remaining terms, iteratively until convergence. The entire procedure is summarized in Algorithm 3. A discussion about how to choose the threshold and its impact on the solution is detailed in zhang_robust_2018 .As the threshold is a parameter representing the model complexity, we may use machine learning algorithms such as cross-validation to determine it. Note that in Algorithm 3, is more and more sparse after each loop by design. Thus the convergence of the algorithm is guaranteed given the convergence of the calculation of maximum likelihood in (13).
In this paper, we define an error bar value that quantifies the quality of the posterior estimated model as follows:
where each estimated variance is divided by the square of the corresponding estimated mean to normalize the variance on each weight. This definition adds up all the normalized variances of each weight present in the result, and penalizes the unsureness of the estimations. In other words, a smaller error bar value means smaller normalized variances and higher posterior confidence, and implies higher model quality. If given a set of candidate models for the data, the preferred model would be the one with the minimum error bar value.
3 Subsampling-based threshold sparse Bayesian regression
In the regression problem (3), when we have more data than the number of unknown weights: , we may use a subset of the data to estimate the weights. We do this on the basis that the data sets collected from real-world cases may contain outliers or a percentage of data points of large noise. Classical methods for parameter estimation, such as least squares, fit a model to all of the presented data. These methods have no internal mechanism for detecting or discarding outliers. Other robust regression methods designed to overcome the limitations of traditional methods include iteratively reweighted least squares holland_robust_1977 ; chartrand_iteratively_2008 , random sample consensus fischler_random_1981 , least absolute deviations karst_linear_1958 , Theil-Sen estimator hallet_rank-invariant_1992 ; sen_estimates_1968 , repeated median regression siegel_robust_1982 ; rousseeuw_least_1984 , but they do not fit very well into the framework of data-driven discovery of differential equations. Classical methods for parameter estimation are averaging methods based on the assumption (the smoothing assumption) that there will always be enough good values to smooth out any gross noise fischler_random_1981 . In practice the data may contain more noise than what the good values can ever compensate, breaking the smoothing assumption. To deal with this situation, an effective way is to discard the “bad” data points (outliers or those of large noise), and use the rest “good” data points to run estimations. Based on this idea, we propose an algorithm called subsampling-based threshold sparse Bayesian regression (SubTSBR) in this paper. This method filters out bad data points by the means of random subsampling.
SubTSBR approaches the problem by randomly selecting data points to estimate the weights and using a measure (error bar) to evaluate the estimations. When an estimation is “good” (small error bar), our algorithm identifies the corresponding selected data points as good data points and the estimation as the final result. To be specific, our algorithm is given a user-preset subsampling size and the number of loops at the very beginning. For each loop, a subset of the data consisting of data points is randomly selected: and used to estimate the weights , , …, in the following regression problem:
where are the basis-functions and is the model error. This regression problem can be symbolized into the form as follows:
along with an error bar calculated by (18). After repeating this procedure times with different randomly selected data points, the differential equation with the smallest error bar among all the loops is chosen as the final result of the whole subsampling algorithm. Our algorithm has two user-preset parameters: the subsampling size and the number of loops. Their impact on the accuracy of the final result is discussed in Section 4 with an example. Note that the above mechanism is described in the pattern (1) for convenience and simplification. It also applies to higher-order differential equations and implicit differential equations, as long as the differential equations can be symbolized into the form (20). The SubTSBR procedure is summarized in Algorithm 4, where the for-loop can be coded parallelly.
3.3 Why it works
The numerical results in this paper show that our subsampling algorithm can improve the overall accuracy in the discovery of differential equations, and the error bar (18) is capable of evaluating the estimations. The given data contain a part of data points of small noise and a part of data points of large noise. When a subset consisting of only data points of small noise is selected, our algorithm would estimate the weights well and indicate that this is the case by showing a small error bar (18). As we do not know which data points are of small noise and which data points are of large noise before the model is discovered, we select a subset from the data randomly, repeating multiple times. When it happens that the selected data points are of small noise, we would have a good estimation of the weights and at the same time recognize this case.
The numerical results also show that in order to attain optimal performance, the subsampling size should be moderate, neither too small nor too big, while the number of loops is as big as possible when computing time permits. As the correctness of the governing differential equations is crucial, computing time can be a secondary issue to consider. In practice, we can increase the number of loops gradually and stop the algorithm when the smallest error bar among all the loops drops below a certain preset value or the smallest error bar stops decreasing.
4 The challenge of large noise
4.1 Problem description
The noise in the data can hinder model-discovering algorithms from getting the correct results. Here we detail the mechanism of our algorithm, tune the parameters, and investigate the robustness against noise.
4.2 Example: the predator-prey model with noise
The predator-prey model is a system of a pair of first-order nonlinear differential equations and is frequently used to describe the interaction between two species, one as a predator and the other as prey. The population change by time is as follows:
where is the number of the prey, is the number of the predator, and are positive real parameters describing the interaction of the two species. In this example, we fix the parameters as follows:
4.2.1 Data collection
4.2.2 Calculate numerical derivatives using total-variation derivative
Now we need to calculate the derivatives of the data to estimate the left-hand-side terms in (24) - (25). The noise in the data would be amplified greatly if the derivatives are calculated using numerical differentiation. See Figure 3 for the approximated derivatives using gradient. Therefore, we use total-variation derivative rudin_nonlinear_1992 ; chartrand_numerical_2011 instead. For a real function on , total-variation derivative computes the derivatives of as the minimizer of the functional:
where is the operator of antidifferentiation and is a regularization parameter that controls the balance between the two terms. The numerical implementation is introduced in chartrand_numerical_2011 . See Figure 3 for the approximated derivatives of and using total-variation derivative with .
As we can see in Figure 3, the robust differentiation method is critical for getting high-quality derivatives. Besides total-variation derivative, many other methods are available for robust differentiation. Another approach is to use denoising techniques to reduce the noise in the data before taking derivatives. For example, a neural network is used to denoise data and approximate derivatives in lagergren_learning_2019 . Those methods may have better performance in practical use, depending on the situation. Here we do not use denoising techniques for the sake of demonstrating the robustness of our algorithm against noise. In practice, the denoised data may still contain noise. Our algorithm can be used following the denoising processes and may achieve good results.
4.2.3 Discover the model using different sparse algorithms
If lasso (Algorithm 2) is used, we have:
If TSBR (Algorithm 3) is used, we get:
with error bars and , where the numbers in front of each term read as “mean (standard deviation)” of the corresponding weights. Note that sequential threshold least squares, lasso, and TSBR use all data points at the same time to discover the model respectively.
In contrast, if SubTSBR (Algorithm 4) is applied to discover the model, we have:
with error bars and . The approximated weights in (33) - (34) are slightly smaller in absolute value than the true weights in (24) - (25) because when we calculate the numerical derivatives, total-variation derivative (26) smooths out some variation in the derivatives. This defect does not have much impact on the result and can be addressed by using different methods to calculate the derivatives or collecting the derivatives along with the data directly. Here the subsampling size is and the number of loops is . All methods in this example have the threshold set at and use monomials generated by up to degree three as basis-functions ( terms in total). The result of SubTSBR (33) - (34) approximates the true system (24) - (25) significantly better than the other sparse algorithms. Although the data contain a considerable amount of noise, SubTSBR successfully finds the exact terms in the true system and accurately estimates the parameters. See Figure 4 for the final selected data points in SubTSBR. See Figure 5 for the dynamics calculated by the systems derived from each algorithm. Note that since the data are collected from to , Figure 5 shows the approximation and Figure 5 shows the prediction.
4.2.4 Basis-selection success rate vs subsampling size and the number of loops
In SubTSBR (Algorithm 4), we have two parameters to set, one of which is the subsampling size and the other is the number of loops. Here we first investigate the impact on the basis-selection success rate by different subsampling sizes. In Figure 6, each curve is drawn by fixing the number of loops. Then given each subsampling size, SubTSBR is applied to the data set collected above. This method is performed times for each fixed number of loops and subsampling size. Then the percentage of successful identification of the exact terms in the system (24) - (25) is calculated and plotted. In the discovery process, the most difficult part is to identify the exact terms in the system. If the exact terms are successfully identified, it is usually easy to estimate the weights.
Figure 6 shows that for each fixed number of loops, basis-selection success rate goes up and then down when the subsampling size increases. When the subsampling size equals , all the data points are used and SubTSBR is equivalent to TSBR (Algorithm 3). In this case the true terms cannot be identified. In addition, for each chosen number of loops there is an optimal subsampling size, and the optimal subsampling size increases by the number of loops.
Now we investigate the impact on the basis-selection success rate by different numbers of loops. In Figure 6, each curve is drawn by fixing the subsampling size. Then given each number of loops, SubTSBR is applied to the data set to discover the model. The discovery is done times for each fixed subsampling size and number of loops. Then the percentage of successful identification of the exact terms in the system (24) - (25) is calculated and plotted.
Figure 6 shows that for each fixed subsampling size, basis-selection success rate keeps going up when the number of loops increases. In addition, the larger the subsampling size is, more loops are needed for the basis-selection success rate to reach a certain level. This is because our data set is polluted by Gaussian noise and naturally contains some data points of large noise and some of small noise. As the subsampling size gets bigger, it is less likely for each of the random subsets of data to exclude the data points of large noise. When more loops are used, the likelihood for one of the loops to exclude the data points of large noise increases. As long as one of the loops excludes the data points of large noise, this loop may successfully select the true basis functions and have the smallest error bar. If it happens, the final result would come from this loop and SubTSBR selects the true basis functions successfully in this one of runs. This explains why the curves of smaller subsampling size go up faster.
On the other hand, when more data points are used, the noise inside the subsamples gets smoothed out easier in the regression (20). If all the included data points in the subsample are of small noise, then the result from a larger subsampling size may be a better one. This explains why the saturated basis-selection success rate is higher for larger subsampling size within a certain range. In conclusion, there is tradeoff for larger subsampling size—it is more difficult to include only data points of small noise while it is easier to smooth out the noise inside the subsample.
4.2.5 Adjusted error bar and auto-fitting of subsampling size
In real-world applications, the case is usually more complicated and the problem of setting the best subsampling size is subtle. Since the true equations are unknown, the basis-selection success rate cannot be calculated. Therefore, drawing a curve like the ones in Figure 6 to find the best subsampling size is not available. Here we define an adjusted error bar as an indicator for the quality of the approximated model and fit the subsampling size automatically.
The error bar defined in (18) depends on the number of data points used and the quality of the approximated model. When the subsampling size is within a reasonable range, the quality of the approximated model does not differ too much, so the error bar is dominated by the number of data points (see Figure 7). By the formula (18), the error bar is negatively correlated with the number of data points. If we want to compare results among different subsampling sizes, we need to adjust the error bar such that it is independent of the subsampling size. Inspired by the rate of convergence of Monte Carlo method, we give the following empirical formula:
In Figure 8, we can see that the adjusted error bar almost only depends on the quality of the approximated model. Note that we have the best models when the subsampling size is between and (see Figure 6). Meanwhile, the percentile curves in Figure 8 indicate that the optimal subsampling sizes are in about that range. This observation confirms that the adjusted error bar depends on the quality of the discovered model. The better the model is, the smaller the adjusted error bar is.
Now, we do not have to set the subsampling size at the beginning but we may try different subsampling sizes to discover the model, and the best result can be selected from all the results with different subsampling sizes. (In Figure 7 and Figure 8, for each fixed number of loops and subsampling size, we run SubTSBR times and discover models with their error bars or adjusted error bars. Then the 25th, 50th, and 75th percentiles of these error bars or adjusted error bars are plotted.)
4.2.6 A new data set with larger noise
Here we use a new data set with white noise to discover the predator-prey model (24) - (25). All other settings remain the same. Corresponding to Figure 2, the noisy data are presented in Figure 9 and Figure 9. Corresponding to Figure 6, the basis-selection success rates are presented in Figure 9 and Figure 9. With larger noise, the chance of successfully picking out the true terms from the basis-functions is lower. As a result, more loops would be needed in this case. Figure 9 only demonstrates the results with the numbers of loops up to , but many more loops may be used in practical problems. As for computation time, it takes about seconds to discover the dynamical system in this example with subsampling size and the number of loops on one core of the CPU Intel i7-6700HQ (coded in MATLAB 2018a).
4.2.7 Another new data set with smaller noise
Now we use another new data set with white noise to discover the predator-prey model (24) - (25). All other settings remain the same. The numerical results are presented in Figure 10. In this case, we have a large portion of data points of small noise to use, so a basis-selection success rate can be reached with just loops. Also, the noise inside the subsamples is small, so we do not need a large subsampling size to smooth out the noise inside the subsamples. The basis-selection success rate is very high even with a subsampling size of .
5 The challenge of outliers
5.1 Problem description
The outliers in the data can cause serious problems in the estimations and should be removed. The subsampling procedure in Algorithm 4 is designed to be resistant to outliers. Here we calculate how many loops are needed to exclude the outliers from a data set. Then we apply our theory to an example.
5.2 The number of loops needed to remove outliers
Suppose we are given data points, a portion of which are outliers. Suppose the subsampling size is . We try to determine the number of loops such that with confidence , at least one of the randomly selected subsets of the data does not contain any outlier.
The number of outliers is and the number of “good” data points is . For a random subset of size
not containing any outlier, the probability is
When and , the probability is approximately
For a random subset of size containing at least one outlier, the probability is
For random subsets of size each containing at least one outlier, the probability is
For random subsets of size at least one subset not containing any outlier, the probability is
Set the probability greater than or equal to :
Then we have
See Figure 11 for the relationship between the subsampling size , the portion of outliers , and the minimum number of loops when .
5.3 Example: shallow water equations with outliers
Consider the following 2-D conservative form of shallow water equations:
on and , with reflective boundary conditions and a water drop initiating gravity waves, where is the total fluid column height, is the fluid’s horizontal flow velocity averaged across the vertical column, and is the gravitational acceleration. The first equation can be derived from mass conservation, the last two from momentum conservation. Here, we have made the assumption that the fluid density is a constant.
5.3.1 Data collection
We generate the numerical solution to the shallow water equations using Lax-Wendroff finite difference method with and . See Figure 12. The data are collected at and the partial derivatives , , , , , and are calculated by the three-point central-difference formula. The calculation of the partial derivatives , , and uses the points from two adjacent time frames. Assume that only the central part of the data is made accessible and of them are added on the values of , , and by independent and identically distributed random error
, the uniform distribution on. There are accessible data points. See Figure 12. Thus, the accessible data to discover the model are:
of which are outliers.
5.3.2 Discovery of the model
We apply the dimensional analysis method introduced in zhang_robust_2018 to construct the basis-functions of the same dimension as () to discover it: , , , , , , , , , , , . Similarly, we can construct the basis-functions of the same dimension as and () to discover them: , , , , , , , , , , , , , . The threshold for all algorithms is set at . If sequential threshold least squares (Algorithm 1) is applied, we get the following result:
If lasso (Algorithm 2) is used, we have:
If TSBR (Algorithm 3) is used, we get:
In contrast, if SubTSBR (Algorithm 4) is applied to discover the model, we have: