1 Introduction
Discovering relationships between the observable features and responses is a regression problem of supervised learning. We formulate the response variable as a deterministic expression of the features, i.e. linear, polynomial, logarithmic, together with a random error with zero mean and independent of the features
[8]. The physical dynamics is usually governed by a ordinary or partial differential equation. With timeseries observations, Brunton et al. and Rudy et al.
[1, 17] devised a method to find the governing equations, by a sparse regression algorithm.To determine a governing equation, we face two significant issues — selecting appropriate measurable features as basis and sparsing the coefficients of the features [25]. There is no universal basis that performs well for all the cases. However, based on the theories of Taylor expansion and trigonometric expansion, polynomial and trigonometric terms are naturally selected as features [1, 17]. The remaining sparse regression is aimed at making some or more coefficients zero to eliminate the uncorrelated features and prevent overfitting caused by minimizing leastsquare loss. The
norm, which counts the number of the nonzero coefficients, is naturally added to the loss function as a penalty. However, it makes the loss function nonconvex and the optimization problem becomes intractable to solve
[14]. Consequently, the penalty is then relaxed to the less shrinkage but resolvable(Ridge) regression or the
norm (Lasso) [4, 20]. Comparatively, is sparser than . In order to be closer proximate the sparsity of the norm, some properties and iterative algorithms of the nonconvex norm are studied [5, 23, 26].The observations are unavoidably contaminated by external noise, no matter the data are for the response variables [23] or for timeseries trajectories of processes [1, 17]. Facing the undesirable noise, one way [1, 17] is to filter it, and employe a total variation regularization to denoise the derivatives and then construct a data matrix. However, stronger noise would generally result in greater deviation for the sparse solution, a better penalty is employed or a logical sparse regression framework is proposed, for example Sparse Identification of Nonlinear Dynamics (SINDy) [1, 17], which incessantly vanish the estimators with absolute values smaller than a threshold, and reestimate the nonzero coefficients until a stable solution is reached. The threshold is selected by cross validation as a value minimizing the loss function.
Without denoising, we here attempt to extract a governing stochastic differential equation with the drift term and diffusion coefficient, from a timeseries observation for its most probable transition trajectory. Here, the additive diffusion term characterizes multiple unmeasurable external environment influences. Practically, the evolutionary history of transition pathways or the most probable transition pathways may be obtained in different fields. For example, the metabolic pathways on a prokaryotic genome [10], and the primary absorbance and fluorescence signals for gene expression [19]. Technically, the onedimensional diffusion has also been predicted by extension (time) domain averages, which match the dominant path shape for transition pathways [9]
. The most probable transition trajectory, as substitution of all the pathways in the state space, has been used to approximate the transition probability density function
[16]. The most probable pathways between different brain regions have been simulated by a probabilistic fiber tractography algorithm, called ConTrack algorithm, from measurable pathways [18]. Theoretically, a statistic for a process of a stochastic differential equation satisfies a determined differential equation [22]. The most probable transition trajectory in a finite time interval with fixed boundary points, of a diffusion process (e.g., a solution of a stochastic differential equation) can be characterized by a secondorder ordinary differential equation (ODE) with boundary conditions [7]. This is attained by expressing the transition probability of the diffusion process as an integral, whose integrand is called OnsagerMachlup (OM) function. And then a variational principle reveals that the OM function satisfies the EulerLagrange equation, which furthermore leads to a desired secondorder ODE. The corresponding boundary value problem is numerically solved by a shooting method [11, 12]. The OM function and most probable transition pathway for a jumpdiffusion process has been studied recently by Chao and Duan [3].In this present paper, we first present the background of OM function for a diffusion system, and the EulerLagrange equation for the most probable transition trajectory, i.e., the secondorder ODE with boundary conditions, in section 2. Then in section 3.1, we devise a method to determine the drift and diffusion coefficient for a governing stochastic differential equation, with the observation data on the most probable transition trajectory. To this end, we need a function basis for the drift, and we pick a polynomial basis up to fifth order (or any finite order), together with an appropriate cost with penalty. This optimization problem is set up in section 3.2. We devise an algorithm to recursively estimate the coefficients in the diffusion model by feat of the covnex optimization package cvx in Matlab, and then sparse the estimators by vanishing the ones with absolute value smaller than a threshold, in section 3.3. In section 4, we make some applications and present the results.
2 The OnsagerMachlup function and the most probable transition trajectory
We start from a scalar stochastic differential equation (SDE) [6] as the model for the governing stochastic law
(2.1) 
where is a Brownian motion defined on a sample space with probability , and is the noise intensity (i.e., diffusion coefficient). The random sample trajectories satisfy the twopoint boundary conditions
with a finite transition time .
The most probable transition trajectory, denoted by , for this stochastic differential equation, from state to within a time period , is the pathway which maximizes the OnsagerMachlup (OM) action functional, . Here the integrand (like a Lagrangian function in classical mechanics) is known as [7, 15, 21]:
(2.2) 
We recall the notions and . By the variational principle, we obtain the EulerLagrange equation [7] to be satisfied by the most probable transition trajectory :
(2.3) 
Thus, we get a second order ODE for the most probable transition trajectory , with twopoint boundary conditions
(2.4) 
As discussed in the previous section, the most probable transition trajectory is sometimes observable. With the observed most probable transition trajectory, we will extract the underlying governing stochastic model in the next section. That is, we will determine the drift and the diffusion coefficient .
3 Sparse identification of the governing stochastic differential equation
3.1 The equation for the most probable transition trajectory
Based on a timeseries data collected with time step , that is, , we want to identify the drift term and the diffusion coefficient .
We construct a basis library consisting of polynomial of observational feature, i.e. . To illustrate our method, we take . The drift term can then be expressed as
(3.1) 
Putting this expression into (2.4), we then have the secondorder ODE
(3.2) 
with coefficients
Denote the vectors
, , where means transposition, and the r.h.s of the above equalities as , . Additionally, denote . Then the coefficient vector in (3.1) is nonlinear related to the coefficients in (3.2), and we indicate this fact by(3.3) 
3.2 The optimization problem
We use simulated data for the most probable transition trajectory . That is, we collect a time history of observation with time step , and denote , with , and . Taking centerdifference to approximate the secondorder accurate solution of , we get a series of , with , and , . Each pair of data , satisfies the ODE (3.2), that is
(3.4) 
Define
(3.5) 
The matrix formulation for the ‘unknown’ is
(3.6) 
This is an overdetermined system consists of equalities and 10 unknown coefficients.
To determine the SDE system, we need to distill the sparse coefficients and determine the diffusion coefficient . Starting from the most probable trajectory satisfying (2.4), we initially estimate the coefficients in (3.6), and then work out and from the relationship (3.3).
Finally, the sparse problem is
(3.7) 
For , we denote the norm , the norm and the norm the number of nonzero . The sparse problem can be transformed into the following regularized leastsquare problem with an penalty [24]
(3.8) 
with , , and small . The first two terms in this optimization problem is a weighted sum of the inconsistency of the two constraint functions in (3.2). To make the optimization problem solvable, we relax in (3.8) to , and take
(3.9) 
as loss function.
3.3 An implementable algorithm for the optimization problem
Note that the first constraint condition in (3.2) is linear with respect to , and the second constraint condition is linear with respect to but nonlinear with respect to . For linear system , the analytical solution of the leastsquare problem is generally used as optimization solution. But the backslash, i.e. is more prone to result in sparse solution [12].
Our destination is sparse , and estimate which are nonlinear related to . As appears linearly in the first two terms of (3.8), while and present nonlinearly in the latter two terms, we attempt to split the loss functions into two parts. We numerically estimate and (, ), respectively, and update the estimators cyclically [27]. Totally, we take an initial estimation and cycle an improvement to implement this procedure.
In the initial estimation step, we leverage as an estimator of , and then compute the real values and one by one from . Note that, the error of results in that of and . Moreover, not all the 10 equalities are used to get and , those not used equalities bring about differences between and . The comparatively bigger difference means a reduced confidence for the estimators, especially for and with small .
In the improvement step, we update by minimizing . This is implemented by the convex optimization package cvx. We then solve for real numbers and with the updated . Note that we have added a penalty term in the estimation of , as the sparsity of , especially the nonexistent higher order terms, causes that of .
We loop the improvement step until a stable solution is reached. For each updated estimator, we manage with sequence threshold as in [1, 17]. The threshold parameters and the weight parameters are chosen by cross validation. The exact procedures of the estimation are as follows:

For a set of and , we simulate a timeseries data with equal time interval from (2.4) by shooting method, then approximate the timeseries data of from these observations by centerdifference, and get .

Construct a basis library of power functions , and develop the basis to get the matrix of time history as in (3.5).

Split the observations randomly to training part (80% data) and testing part (the left 20%). Accordingly, and are divided into , and , , respectively.

Get an initial estimator and then calculate the real values and one by one from .

Calculate the test error of the loss function .

Set , and .

Calculate , and then let if . Leverage cvx package to update nonzero which minimizes .

Work out and from (3.3), and let if .

Loop steps 3b and 3c, until the last two set of estimators are close enough. Denote the estimators as , and .

Calculate the test error as in step 2b with the estimators and the test data, and compare it with . If , increase by a step and replace the best estimators, else take back two steps of , half the step and put forward a new step.

Loop steps 35, find the best and minimizing the loss function and take the corresponding results as estimators.
In this algorithm, we make preparations in step 1, get initial estimators in step 2, and loop step 3 to improve the estimators with weight parameters , , , thresholds and . Recall that our destination is getting sparse and estimate , while the coefficients in (3.2) are the results of and as in (3.3). In step 3b, we vanish the absolute small coefficients of the calculated , update the nonzero s with the estimated , and the aspiration of sparsing . As results from and , it can be close to but not 0, we update the threshold and compare the test errors result from different thresholds in step 5 in order to determine a more preciser . We compare the test error with different weights and in step 6, and take the results with minimum test error as the estimators.
4 Numerical experiments
 Example

We consider to reconstruct the stochastic differential equation
(4.1) 
where the drift and diffusion coefficient . We are given the observation (simulated data) of the most probable transition trajectory , which theoretically satisfies the EulerLagrange equation
(4.2) 
Given a group of true values as in Table 1, we can simulate a most probable trajectory from the 2ndorder ODE system (4.2) by shooting method, and get a timeseries of observations , with , , and . Based on the timeseries data, we reversely reconstruct the system (4.1), namely, discover the drift and determine the diffusion coefficient .
We initially assume to be a higherorder polynomial, such as in (3.1). Starting from the linear system (3.2) and combining with (3.3), we extrapolate the sparse and estimate according to the algorithm in section 3.3. Some groups of true values and the corresponding estimators are illustrated in Table 1.
For this black box problem, the small test error with and , listed in Table 2, gives us confidence to the estimators. Note that, are real values, we firstly get or if , otherwise, . The error in the estimator causes that of . And then get from if , otherwise, from . The error of results from that of , or together with that of . We similarly solve out real values one by one from in (3.3), with takes value successively. The latter solved out real value and have more error sources. The value of should at last make . In fact, not all the 10 equalities have been used to solve out the 7 coefficients, the unused equalities make the error . The bigger means errors in the estimators, and it reduces our confidence to and solved out latter, which have more error sources. The relatively bigger in the 6th group and 2nd group indicate relatively bigger error of than in the other groups. Moreover, the estimator may also not as good as in the other groups.
True value 1  Estimator  True value 2  Estimator  True value 3  Estimator  
0  0  0  0  0  0  
1.1  1.1008  1  1.0022  1  0.9991  
0.8  0.8047  0.6  0.6059  0.6  0.6046  
1.3  1.3001  1.2  1.2021  1.2  1.1997  
0  0  0  0  0  0  
0  0  0  0  0  0  
0.1  0.1142  0.7  0.6416  0.2  0.1895  
True value 4  Estimator  True value 5  Estimator  True value 6  Estimator  
0  0  0  0  0  0  
1  0.9990  1  1.0047  0.5  0.4836  
0  0  0  0  1  1.0217  
1  0.9966  1  1.0000  1  1.0077  
0  0  0  0  0  0  
0  0  0  0  0  0  
0.2  0.1998  1  0.9984  1  0.8107 
Group 1  Group 2  Group 3  Group 4  Group 5  Group 6  
0.0045  0.0047  0.0058  0.0111  0.0216  0.0031  
0.0310  0.2495  0.0250  0  4.4409  0.6681  
0.0058  0.0169  0.0067  0.0078  0.0108  0.0363  
0.95  0.95  0.95  0.7  0.5  0.95  
0.04  0.01  0.01  0.02  0.3  0.02 
5 Discussion
Based on the theoretical relationship between the SDE governing system and the EulerLagrange equation (a second order ODE) for its most probable transition trajectory, we have devised an algorithm to discover the drift term and the diffusion coefficient of the SDE, from a timeseries data for the most probable transition trajectory.
We have employed polynomials as the basis of the drift term owning to the fact that a function can be approximated by a Taylor expansion. For more complicated cases, the trigonometric function may be added to the basis polynomials [1, 17]. Moreover, for a governing stochastic model with possibly rational drift, we may multiply a polynomial on the both sides of the model and convert the rational drift to a polynomial drift [13]
. Nevertheless, there is not a onesizefitall basis in data science.
Anyway, once the basis is selected, the following work is recursively estimating the coefficients in the system, and then eliminating the uncorrelated features by vanishing the estimators close to 0. The test error can be used to measure the accuracy of the estimators. In our estimation, we employ part of the loss function involving or to estimate and improve the corresponding coefficients. Noting that the sparsity of causes that of , we have added an penalty of in the loss function to improve the estimation of . Nevertheless, though we want to sparse , we can only solve from one by one, as they are nonlinear related to , the adding of a penalty of can complicate the solving process more than lead to a sparse solution. We choose a threshold to vanish the absolute small elements in by cross validation, as in fact should be , and true can result in very small elements of . We fix a small threshold to sparse the coefficients owning to the fact that the estimator is not that sensitive to the threshold as that for .
Acknowledgements
This work was partially supported by NSFC grants 11531006, 11601491 and 11771449.
References
 [1] S. L. Brunton, J. L. Proctor and J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. 113 39323937 (2016).
 [2] X. Chang, Z. Xu, H. Zhang, J. Wang and Y. Liang, Rubust regularization theory based on regularization: the asymptotic distribution and variable selection consistence of solutions. Sci. Sin. Math. 40(10) 985998 (2010).
 [3] Y. Chao and J. Duan, The OnsagerMachlup function as Lagrangian for the most probable path of a jumpdiffusion process. Nonlinearity 32 37153741 (2019).
 [4] S. Chen, D. L. Donoho and M. A. Saunders, Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1) 3361 (2006).
 [5] X. Chen, D. Yang, Q. Zhang and J. Liang, regulariztion based numerical method for effective reconstruction of bioluminescence tomography. J. Appl. Phys. 115 184702 (2014).
 [6] J. Duan, An Introduction to Stochastic Dynamics. Cambridge University Press, 2015.
 [7] D. Drr and A. Bach, The OnsagerMachlup function as lagrangian for the most probable path of a diffusion process. Commun. Math. Phys. 60 153170 (1978).
 [8] T. Hadtie, R. Tibshirani and J. Friedman, The Elemnets of Statistical Learning. Springer, 2009.
 [9] N. Q. Hoffer, K. Neupane, A. G. T. Pyo and M. T. Woodside, Measuring the average shape of the transition paths during the folding of a single biological molecule. Proc. Natl. Acad. Sci. 116(17) 81258130 (2019).
 [10] W. Iwasaki and T. Takagi, Rapid pathway evolution facilitated by horizontal gene transfers across prokaryotic lineages. PLoS Genet. 5(3) e1000402 (2009).
 [11] R. Kress, Numerical analysis. Springer, 1998.
 [12] J. N. Kutz, Datadriven modeling scientific computation. Oxford University Press, 2013.
 [13] N. M. Mangan, S. L. Brunton, J. L. Proctor and J. N. Kutz, Inferring biological networks by sparse identification of nonlinear dynamics. IEEE Transcation on Molecular, Biological, and Multiscale Communications 2(1) 5263 (2016).
 [14] B. K. Natraajan, Sparse approximate solutions to liner systems. SIAM J. Comput. 24(2) 227234 (1995).
 [15] L. Onsager and S. Machlup, Fluctuations and irreversible processes. Phys. Rev. 91(6) 15051512 (1953).
 [16] A. F. Psaros, I. A. Kougioumtzoglou and I. Petromichelakis, Sparse representations and compressive sampling for enhancing the computational efficiency of the Wiener path integral technique. Mech. Syst. Signal Pr. 111 87101 (2018).
 [17] S. H. Rudy, S. L. Brunton, J. L. Proctor and J. N. Kutz, Datadriven discovery of partial differential equations. Sci. Adv. 3 e1602614 (2017).
 [18] A. J. Sherbondy, R. F. Dougherty, M. BenShachar, S. Napel and B. A. Wandell, ConTrack: Finding the most likely pathways between brain regions using diffusion tractography. Journal Vision 8(9):15 116 (2008).
 [19] D. Stefan, C. Pinel, S. Pinhal, E. Cinquemani, J. Geiselmann and H. de Jong, Inference of quantitative models of bacterial promoters from timeseries reporter gene data. PLoS Comput. Biol. 11(1) e1004028 (2015).
 [20] R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58(1) 267288 (1996).
 [21] L. Tisza and I. Manning, Fluctuations and irreversible thermodynamics. Phys. Rev. 105 16951705 (1957).
 [22] D. Wu, M. Fu and J. Duan, Discovering mean residence time and escape probability from data of stochastic dynamical systems. Chaos 29 093122 (2019).
 [23] Z. Xu, H. Zhang, Y. Wang and X. Chang, regualrization. Sci. China Inf. Sci. 53 11591169 (2010).
 [24] Z. Xu, X. Chang, F. Xu and H. Zhang, regualrization: A thresholding representation theory and a fast solver. IEEE Tran. Neural Netw. Learn. Syst. 23(7) 10131027 (2012).
 [25] O. Yair, R. Talmon, R. R. Coifman and I. G. Kevrekidis, Reconstruction of normal forms by learning informed observation geometries from data. Proc. Nat. Acad. Sci. 114(38) E7865E7874 (2017).
 [26] J. Zeng, S. Lin, Y. Wang and Z. Xu, regularization: convergence iterative half thresholding algorithm. IEEE Trans. Signal Process. 62(9) 23172329 (2014).
 [27] P. Zheng, T. Askham, S. L. Brunton, J. N. Kutz and A. Y. Aravkin, A unified framework for sparse relaxed regularized regression: SR3. IEEE Access 7 14041423 (2019).
Comments
There are no comments yet.