Scientific discovery has attracted increasing attention over the past years due to the boosting interest in exploring the new world. Data-driven approaches (machine learning, etc.) recently became possible and popular owing to revolutionary advances in sensing technologies and computational powers. Among all methods investigated for discovering scientific laws, sparse regression [rudy2017data]
gains the most attention in extracting scientific laws (expressed as ODEs or PDEs) from measured/simulated data. Sparse regression methods inherently promote the model parsimony which is widely recognized as an intrinsic property of the governing model for a real physical system. Moreover, techniques including genetic algorithm[xu2020dlga]bassenne2019computational], and symbolic regression [vaddireddy2020feature, reinbold2021robust] have been adopted to help build/enrich the term library of the underlying model. Sparse regression methods, however, lack robustness due to issues including inaccurate numerical differentiation (especially with large noise) [meidani2021data], hard-thresholding [zhang2021robust], and incomplete library (even with library enriching techniques) [xu2020dlga]. Therefore, a robust method of scientific discovery is necessitated for discovering the underlying physics from measured data that are usually incomplete and/or imprecise.
Recently, fusing prior domain knowledge in solving computational physics problems has been widely recognized as potential and promising for improving the stability, generality, and robustness of the computational model. This potential has been demonstrated in physics informed machine learning [raissi2019physics, wang2017physics, zhang2021structural] and hybrid modeling [quaghebeur2021incorporating, rai2020driven]. The prior knowledge about the general physical principles has also been fused into discovering the governing PDE(s) when establishing the candidate model library [reinbold2021robust, meidani2021data]. In ref. [meidani2021data]
, a machine learning classification method was proposed to detect the presence of physical patterns (such as diffusion and convection) in the investigated system and subsequently identify the governing PDE(s). Features used for this classification problem are extracted from the spatial and temporal derivatives of data and their statistics and frequency domain characteristics, considering the behaviors of physical patterns potentially existing in the investigated system. A classifier is first trained using a library of candidate models and then tested on the target system. Results show that the robustness of derived features is very limited when the level of noise goes above 0.1%, despite the novelty of this scientific discovery scheme.
In this study, we use the Euler characteristics (ECs) as features for characterizing dynamical systems and will examine its robustness when large noise exists in the measured/simulated data. EC is a general topological descriptor that characterizes the geometrical features of complex datasets with reduced dimensionality and complexity while preserving the maximum information [SMITH2021107463]. Compared with other topological descriptors, EC has improved generality in characterizing a wide range of mathematical objects due to its fundamental connections with other descriptive tools such as statistics, field theory, and graph theory. Given a data object, its topology will be reduced to an EC curve by using the filtration transformation technique.
The general definition of the EC () for a 2D manifold is: , in which is the number of 0-dimensional topological bases (known as connected components), is the number of 1-dimensional topological bases (known as holes), and where is the set of nonnegative integers. The relationship between the number of topological bases and dimensions can be extended to shapes with higher dimensions. The EC of an -dimensional shape is defined as , in which is the Betti number and denotes the number of unique -dimensional topological bases for the given shape [adler2008some].
For a dynamical system, its states (or solutions, i.e., or ) lie in a high-dimensional field , in which is the spatial domain and is the temporal domain. As a result, an -dimensional manifold will be formulated with the solutions using the chart . Here , the dimension of the topological space, is larger than the dimension of by two due to the addition of the temporal coordinate and the function ; is a Euclidean space formed by ; the filed/function maps from to a scalar function of the system solutions. Taking the dissipative system characterized by the 1D Burgers equation ( is the diffusion coefficient) as an example, its solution lies in a 2D field and is embedded in a 3D manifold given by a 3D chart (i.e., ) (see Figure 2 (a)). The function maps from the spatiotemporal coordinates (i.e., and ) to the scalar solutions (i.e., ) of the equation. The chart is also often referred to as the graph of field . It should be noted that the graph does not live in a Euclidean space while does. As a result, the ECs and associated EC curve focus on the global topology of the graph during a filtration instead of the specific local connectivity information.
Filtration can be applied to manifolds by virtue of the concept of superlevel set [poincare1895analysis]. Given a manifold with field and domain , the superlevel set at a threshold is defined as: }. The super level set has an associated field graph with defined over . The field graph contains all points of the manifold that have a function value greater than or equal to (it is a filtration of the manifold). The filtration creates a nested set of field graphs which are obtained by defining a sequence of decreasing filtration values with associated field graphs: . Here, the field graphs are sparser with larger threshold values; we also have that (the original graph) if . For each superlevel set we obtain the field graph and we compute and record its EC value (e.g., we determine the number of connected components and number of holes). This information is used to construct an EC curve, which contains the pairs . It is important to highlight that the EC curve provides a topological descriptor for a general -dimensional field. The types of topological bases change with dimension.
Taking the dissipative system characterized by the 1D Burgers equation as an example, with the solutions of the 1D Burgers equation that are embedded in a 3D manifold, the threshold in filtration is a 2D plane that cuts through the 3D graph yielding connected components and holes (see Figure 2 (a)). When the plane passes through a local maximum we have that connected components are formed, when it passes a saddle point components are joined to form holes, and when a local minimum is passed holes are filled. This reveals that filtration captures the incidence of different types of critical points in the field (its topological features) and this information is summarized in the EC curve (see Figure 2 (b)). Moreover, we found that the EC curve is immune from noise contamination (see Figure 3), which has been mathematically proved in ref. [ghrist2008barcodes]. This advantageous property of ECs over other non-topological features will largely improve the robustness of physics discovery especially when the measured data contain a large level of noise. Therefore, this methodology we propose in this study will potentially solve the most critical challenge in data-driven physics discovery [rudy2017data, reinbold2021robust, zhang2021robust].
Given the measured data from an instrumented system, we formulate the manifold
on which the filtration will be performed. The resulting EC vector will be used as features for characterizing the system and distinguishing it from others with a different (though maybe similar) governing equation. First, we establish a library of candidate models either using the results of sparse regression methods or with our knowledge about the investigated system. Data (i.e.,) and features (i.e. ECs) will be simulated with extensive combinations of random initial and boundary conditions so that the final decision regarding physics discovery will be as robust as possible. Simulated data may contain a certain level of measurement noise. We subsequently train a classifier using the simulated data with the label set as the ID of the system model that a dataset belongs to. Taking the above dissipative system as an example, with a collected noisy dataset, the -PDE method (see details in Methods) yields two candidate models: (1) and (2) . Then we simulate a collection of systems for each candidate model by assigning random values of and
and specifying random initial conditions and boundary conditions. The outcome feature vectors from these two collections are labeled with the model IDs, i.e., 1 and 2, respectively. These simulated data are used to train a classifier. The Support Vector Machine (SVC) classifier is used in this study. The trained SVC model will be used to determine the governing model of a new system by feeding its output features into the classifier (see details of this methodology inMethods). To demonstrate the effectiveness of ECs in representing dynamical systems, the following scenarios are investigated in this study.
First, we extend the analysis in ref. [SMITH2021107463] for a reaction-diffusion system from the perspective of system identification and physics discovery.
Second, we use this machine learning classification scheme to improve the confidence level of the sparse regression results. As mentioned above, most sparse regression methods lack robustness and may yield a considerably different model when the hyperparameter settings change slightly. The classification outcome using the EC features can help evaluate the representativeness of each outputted model.
Third, we use the EC features to improve the accuracy of classification for identifying the 2D dynamical systems investigated in ref. [meidani2021data]. The robustness will be examined when the data contain large measurement noise.
When investigating a 2D system, the component of the system states is used for calculating the ECs, and we found that the identification results are not significantly affected when using the component for the EC calculations.
Charactering reaction and diffusion
In ref. [SMITH2021107463], ECs are used to characterize the 2D reaction-diffusion system ( and ) with different nonzero reaction and diffusion coefficients (i.e., and , respectively). In this study, we remove either the reaction or diffusion term from the governing equation such that we will have three candidate models in the library for the physics discovery problem: (1) and , (2) and , and (3) and . 100 examples are simulated for each model with random coefficient values and initial and boundary conditions (details about data generation can be found in Methods). From the reduced representations (see Figures 4 and 5), we found that the three systems can be well separated using the ECs. Both the SVD projection (see Figure 4) and the UMAP embeddings (see Figure 5) show that examples of model (2) with only the reaction term are more “distant” from that of the other two models that both have the diffusion term. We speculate from this phenomenon that this 2D system is more dominated by the diffusion term than the reaction term. The average testing accuracy using the trained classifier is 0.95, 0.93, 0.98, 0.92, 0.97, and 0.98 for the cases with 0%, 10%, 20%, 30%, 40%, 50% measurement noise, respectively. Therefore, we found in this analysis that the ECs can be used for identifying the 2D reaction-diffusion system in addition to distinguishing the system with different model parameter values (as shown in ref. [SMITH2021107463]).
Classification for confidence enhancement
In this section, we use the classification scheme to enhance the outcome of sparse regression methods of physics discovery. The 1D dissipative system characterized by the Burgers equation is used as an example. With the simulated data from this system, the -PDE method recommends two candidate models: (1) and (2) . Our previous study [zhang2021robust] shows that it could be considerably challenging to make a final decision between these two models if the initial and boundary conditions are not known a priori. In addition to the above two models from the -PDE method, we added a third model into the model library: (3) for better comparison. We found that the simulated examples from the three models can be well separated by using the ECs as characterizing features (see Figures 6 and 7). Moreover, the measurement noise (as much as 50%) does not undermine the representativeness of ECs in characterizing the system models. As expected, the average testing accuracy using the trained classifier is satisfactorily high (0.99, 0.99, 0.99, 1.00, 1.00, and 1.00 for cases with 0% to 50% noise, respectively). Comparing the results with previous studies in the literature, we can find that our methods outperform many existing methods of physics discovery including the popular PDE-FIND method [rudy2017data] and DLGA-PDE method [xu2020dlga], in terms of its robustness when large measurement noise exists in data.
Identification of 2D system in ref. [meidani2021data]
In this section, we use the EC features to characterize the six 2D spatiotemporal systems in ref. [meidani2021data] (see column 2 of Table 1 for the governing equations). It can be observed that many models share one or more terms in common, which significantly amplifies the challenge of data-driven system identification. As a result, these models are not as well separated as in the above studies (see Figures 8 and 9). It should be noted that both the SVD projection and the UMAP embeddings are reduced representations of the original data and may not fully reflect the characterizing capability of the ECs. The classification results in Table 1 show that we can accurately identify the system models using the ECs as features. Compared with the hand-crafted features in ref. [meidani2021data] (in which the average accuracy reduces from 98.7% to 86.4% when 0.5% noise is added to the clean data), the ECs are much more robust in cases with noisy data. In this section, we demonstrate that the EC-based classification method can be used for identifying a wide range of 2D dynamical systems.
|model class||model equation||precision||recall||f1-score|
|0% noise||50% noise||0% noise||50% noise||0% noise||50% noise|
As we have demonstrated in this study, the ECs can be used as representative features for characterizing a dynamical system and distinguishing it from other systems with different governing models. The robustness of the data-driven physics discovery using ECs has been verified with data containing a wide range of levels of measurement noise. Compared with previous sparse regression methods, the proposed method does not require implementing the fragile numerical differentiation, non-stable hyperparameter tuning, and hard-thresholding that requires empirical specification of the threshold values. Moreover, in this study, we provide an alternative (i.e., machine learning classification) to make confident choices among the candidate models provided by sparse regression. The adoption of EC features largely improves the robustness of the classification scheme proposed in ref. [meidani2021data] and makes this scheme more feasible for practical application. Additionally, the method we proposed in this study requires much less data than most existing methods of physics discovery. The calculation of topological features (such as ECs) does not require a dense grid for the solutions of the system (i.e. or ), unlike the numerical differentiation in sparse regression methods (such as ref. [rudy2017data]) for preparing the library of candidate terms or in classification methods (such as in ref. [meidani2021data]) for calculating the spatial/temporal derivative based features. This advantage can largely reduce the amount of data for discovering the underlying physics of a new dynamical system.
Although the results validate the potential of data-driven physics discovery using the EC-boosted classification approach, it requires establishing a complete (or overcomplete) model library to make the decision about the governing model. In case establishing such a library is not guaranteed, especially for a novel system for which little prior knowledge is available, a purely data-driven approach can be used to compensate for the discovered physics model. Hybrid system modeling integrates first-principle physics modeling with a data-driven approach (such as a black-box neural network model) for improved accuracy and robustness of system modeling. Our method can be used to enhance the confidence level of the physics modeling part so that the black-box modeling part has less burden of correcting the misrepresented physics.
Experimental system and data collection.
One 1D system and seven 2D systems (including the reaction-diffusion system in ref. [SMITH2021107463] and six 2D systems in ref. [meidani2021data]) are investigated in this study.The 1D dissipative system is characterized by the Burgers equation with being the diffusion coefficient. The range of its solution is specified as and . The time and spatial ranges are discretized into 99 and 255 uniform intervals, respectively. As a result, , , and the system state has a dimension of
. Random initial and boundary conditions are adopted to demonstrate the generality of the method. 0% to 50% white Gaussian noise is added to the synthetic data to examine the robustness of the method when a considerable level of noise exists. In this study, the noise level is quantified by the percentage of the standard deviation of the measured variable. For example, if 10% noise is added to the solutions, then the outcome is where generates white Gaussian noise of the specified dimension. The code for generating the solutions of this system can be found in the Data availability and Code availability sections. The data for the 2D reaction-diffusion system investigated in this study are the same with that used in ref. [SMITH2021107463] and can be found in Data availability. The data for the rest six 2D systems investigated in this study (listed in Table 1 are the same with that used in ref. [meidani2021data] and can be found in Data availability.)
The -PDE method proposed in the authors’ previous study [zhang2021robust]
is used to generate candidate models for the investigated 1D dissipative system. First, the collected data are split into training and validation sets and processed using a neural network model with the early-stopping strategy. This step is expected to significantly remove the measurement noise. Following the data preprocessing, the spatial and temporal derivatives are numerically calculated using the finite difference or polynomial interpolation method, and the library matrix will be formulated with the spatial derivatives. Subsequently, the fast Fourier transform (FFT) will be implemented to transform the sparse regression problem to the frequency domain and further reduce the influence of noise via frequency cutoff. Finally, the-algorithm yields several candidate models. More details of the -PDE method can be found in ref. [zhang2021robust].
Robust physics discovery via classification using the EC features
In this study, the machine learning classification method is applied to examine and subsequently improve the confidence in the learned model terms. With the spatiotemporal data collected from the investigated system, the collection of candidate models can be built whether using the sparse regression results (as in ref. [zhang2021robust]) or with the prior knowledge about the system (as in ref. [meidani2021data]
). Subsequently, the data are simulated for all the candidate models with random settings of model parameter values, initial and boundary conditions, and the noise level. Following this step, ECs are calculated for each dataset and labeled with the model ID that the dataset is generated from. Next, with the labeled data, an SVM classifier can be trained with the hyperparameter tuned using a grid search strategy. Finally, the measured data from the investigated system are fed into the trained classifier for testing, and the model yielding the highest classification score is determined as the most probable model for this dynamical system.
Taking the dissipative system characterized by the Burgers equation as an example, the -PDE method yields two candidate models: (1) and (2) , which constitute the collection of candidate models together with a third candidate model: (3) . Then the spatiotemporal data and corresponding ECs are simulated for the three models with different parameter combinations. Finally, the trained classifier yields model (1) when tested on the original dataset. Then we decide that the governing equation for this system is . The framework of this method is shown in Figure 10.
The data used for 1D Burgers equation and 2D reaction-diffusion equation can be found on GitHub at https://github.com/ymlasu/EC-PDE. The data used for 2D system identification in Table 1 can be found on Github at https://github.com/BaratiLab/PDE-Identification-Features referring to ref. [meidani2021data].
All python codes used for physics discovery in this study can be found on GitHub at https://github.com/ymlasu/EC-PDE. Any other requests should be made to the corresponding author.
The research reported in this paper was supported by funds from NASA University Leadership Initiative Program (Contract No. NNX17AJ86A, Project Officer: Dr. Anupa Bajwa, Principal Investigator: Dr. Yongming Liu). The support is gratefully acknowledged.
Author contributions statement
Z.Z. was responsible for the concept, research design, data analysis, and result interpretation. N.X. provided assistance in the manifold learning using the UMAP method. Y.L. supervised this research and was involved in the conceptualization and research design. All authors were involved in the manuscript writing and correction.
The authors declare no competing interests.