A novel use of machine learning, which has potential both for modeling with large or high-frequency data sets as well as advancing fundamental science, is the discovery of governing differential equations from data. Unlike highly specialized algorithms used to refine existing models, these novel methods are distinguished by comparatively limited assumptions, and the ability to produce various types of equations from a wide variety of data.
We now give a very brief (and incomplete) overview of a few proposed algorithms. Key differences between various works include the techniques used to generate candidate equations and the technique used to select equations. A fundamental problem that all these work address is that, if such algorithms are to mimic a human scientist, while generating accurate models, they must also avoid or penalize spurious “overfitted” models. In 
, a symbolic regression method was developed to learn conservation laws of physical systems. The laws, which could be nonlinear, were created using genetic programming and evaluated concurrently on predicative ability and parsimony (number of terms) in order to prevent overfitting. Earlier, in in a dynamical system context, the equations were evaluated using “probing” tests while the overfitting was addressed using a separate “snipping” process. In , a more parametric method for learning nonlinear autonomous systems was developed in which the candidate equation is build from a linear combination of elements from a user-defined “dictionary”. Parsimony translated to sparsity in the dictionary elements, so sparse regression was used to determine the coefficients. In , a similar approach was used to generate nonlinear partial differential evolution equations.
Building on an earlier work  where Gaussian process regression was used to infer solutions of equations, a Gaussian process framework was developed in  for parametric learning of linear differential equations, of the form
given small data on and . Here refers to a linear differential operator with parameters ; for example, could be a linear combination of differential operators with coefficients . The method placed a Gaussian Process on the function and used linearity of to obtain a joint Gaussian Process on in which the unknown parameters were subsumed as hyperparameters (this is reviewed in detail in Section 2). This allowed the use of standard continuous optimization techniques to optimize the negative log-marginal likelihood, which effects an automatic trade-off between data-fit and model complexity, and thereby find . Thus, in this approach, the problem of identifying the differential equation is “embedded” into the problem of interpolation/regression of data. Moreover, the Gaussian Process method only requires computation of the forward action of on the covariance kernel, rather than the solution of the differential equation. However, the method also required the user to select a “dictionary” of terms and assume a parametric form of the equation.
The inclusion of fractional-order operators in this framework, with the order itself is a parameter, allows for a single fractional-order operator to interpolate between derivatives of all orders. This allows the user to directly control the parsimony, by fitting a specified number of fractional derivatives to data, without making assumptions on the orders of derivatives to be discovered. Therefore, building on a basic example of a fractional-order operator treated in , we significantly extend the framework to treat other space-fractional differential operators and covariance kernels. The main problem that must be addressed is the efficient computation of the action of the unknown (fractional) linear operator on covariance kernels.
At the same time, fractional-order derivatives are far more than a tool to interpolate between integer-order and facilitate data-driven discovery of PDEs using continuous optimization techniques. The advances in the present article further improve the ability to discover fractional-order partial differential equations (FPDEs) from real-world data. It is now well understood that FPDEs have a profound connection with anomalous diffusions and systems driven by heavy-tailed processes (, ). While such heavy-tailed data abounds in the fields of, e.g., hydrology, finance, and plasma physics, FPDEs are currently underutilized as tools to model macroscopic properties of such systems. This can be attributed to the analytic difficulties of deriving FPDE models; not only is there an additional parameter, but the involved formulas and nonlocal nature of fractional-order derivatives make them significantly less attractive for specialists in other fields to work with.
Machine Learning is a natural tool for ameliorating this issue. As proof of this concept, we point to the work , which employed a multi-fidelity Gaussian Process method to discover fractional-order of the fractional advection-dispersion equation governing underground tritium transport through a heterogeneous medium at the Macrodispersion Experimental site at Columbus Air Force Base. This resulted in improved and highly efficient fitting of variable-order fractional equation to data. Along more theoretical lines,  explored the determination, using Gaussian processes, of the fractional-order of an elliptic problem involving a Spectral fractional Laplacian on a bounded domain with Neumann boundary condition. In addition, the authors also proved wellposedness of the inverse Bayesian formulation of this problem in the sense of . Our work differs from  in that we do not repeatedly solve the forward problem, and from  in that we do not place a prior on the fractional order or on other parameters; rather, the parameters are inferred by placing a prior and training on the solution and right-hand side of the equation. This allows for more flexibility with regard to the form of the equation and the inclusion of additional parameters.
In the aforementioned Gaussian Process framework of , it is possible to treat time-derivatives by considering the data and equation in space-time. Time may be treated as another dimension in the covariance kernel. An alternative method is suggested by the work of , in which learning of evolution equations is based on the numerical differentiation of Gaussian processes. There, the data is given at different “snap-shots” in time which are used to discretize the time-derivative. In this way, even nonlinear time-dependent equations can be discovered using correspondence between nonlinear terms and specific linearizations of the discretized system. The training is similar to that of , and we extend this to fractional operators as well.
This work is organized as follows. In section 2, we review the Gaussian Process framework of  and . In section 3, we review the Matérn family of covariance kernels and space-fractional operators and present new formulas for space-fractional derivatives of such covariance kernels. Such covariance kernels can be more suited to rough data in certain applications. The inclusion of these new formulas, which can be efficiently treated using generalized Gauss-Laguerre quadrature, allows the discovery of various fractional equations in a unified way. In section 4 we present basic synthetic examples to illustrate the methodology, and in section 5 we apply the methodology to the discovery and interpolation of integer-order advection and diffusion using the fractional-order framework. This includes an interesting problem of interpolating two-term advection-diffusion using a single fractional-order term, and an exploration of user-selected parsimony. After reviewing the relation between -stable processes and fractional diffusion in section 6 and studying a synthetic example, in section 7 we apply the methodology to the modeling of relative stock performance (Intel to S&P 500) by -stable processes via an associated fractional-order diffusion equation.
2 The Gaussian Process Framework
We review the framework developed by  for parametric learning of linear differential equations, of the form
given data on and . Here, is a linear operator with unknown parameters . Here, and throughout the article, we use boldface characters (such as
) to denote vector-valued variables, and capital boldface characters (such as) to denote data vectors.
Assume to be Gaussian process with mean and covariance function
with hyperparameters :
We shall be vague about the form of the covariance kernel until Section 3; for now, it suffices to say that it describes how the correlation between the values of at two points and falls off with or otherwise behaves with the two points, and that it must be a symmetric, positive semidefinite function 
. Then, the linear transformationof the Gaussian process implies a Gaussian Process for (see , §9.4),
with covariance kernel
Moreover, the covariance between and , and between and is
respectively. By symmetry of ,
The hyperparameters of the joint Gaussian Process
are then learned by training on the data of given at points and of given at points . This is done with a Quasi-Newton optimizer L-BFGS to minimize the negative log marginal likelihood ():
where , , and is given by
The additional noise parameters and are included to learn uncorrelated noise in the data; their inclusion above corresponds to the assumption that
with and independently .
Next we review the time-stepping Gaussian Process method of  for learning linear (in our case) equations of the form
For our purposes, we consider two “snapshots” and of the system at two times and , respectively, such that
We perform, in the case of two snapshots, a backward Euler discretization
Then we assume a Gaussian Process, but for :
As before, the linearity of leads to a Gaussian Process for . We obtain the joint Gaussian process
where, denoting the identity operator by Id,
Equation (16) can be compared to equation (8), and equation (17) to (5) and (6). The set-ups are very similar, and again, has been merged into the hypermarameters of this joint Gaussian Process. Given data at spacial points and for the functions and , represented by the vectors and , respectively, the new hyperparameters are trained by employing the same Quasi-Newton optimizer L-BFGS as before to minimize the given by equation (9). In this case, , , and is given by
Here, is an additional parameter to learn noise in the data, under the assumption
with and being independent.
3 Fractional Derivatives of Covariance Kernels
Many properties of a Gaussian Process are determined by the choice of covariance kernel . In particular, the covariance kernel encodes an assumption about the smoothness of the field that being interpolated. Stein 
writes “…properties of spatial interpolants depend strongly on the local behavior of the random field. In practice, this local behavior is not known and must be estimated from the same data that will be used to do the interpolation. This state of affairs strongly suggests that it is critical to select models for the covariance structures that include at least one member whose local behavior accurately reflects the actual local behavior of the spatially varying quantity under study”. Matérn kernels(defined below), a family of stationary kernels which includes the exponential kernel for , and the squared-exponential kernel in the limit , have been widely used for this reason. A Gaussian Process with Matérn covariance kernel is -times mean square differentiable for . We have developed a computational methodology that allows for Matérn kernels of arbitrary real order to be used in our Gaussian Process, namely in (3) and (15). In fact, the parameter itself can be treated and optimized as a hyperparameter of the Gaussian Process, just as the equation parameters were in Section 2. We employ such an algorithm to explore the effect of the parameter when working with rough time series histogram data in section 6.
The main problem that arises when using fractional operators is the computation of their action on kernels such as the Matérn class, as required by equations (5) and (6) for the time-independent case and (17) for the time-dependent case. This cannot be done analytically, and requires a numerical approach, in contrast to the works  and  where (standard) differential operators applied to kernels were obtained symbolically in closed form using Mathematica. Moreover, unlike standard derivatives, fractional derivatives, whether in ,
, or on bounded subsets, are nonlocal operators typically defined by singular integrals or eigenfunction expansions that are difficult and expensive to discretize, . However, space-fractional derivatives in enjoy representations as Fourier multiplier operators. In other words, they are equivalent to multiplication by a function in frequency space. This representation suggests a computational method that avoids any singular integral operators or the solution of extension problems in . The downside to Fourier methods is that, if used to compute the fractional derivative of a function on , they may require quadrature of a (forward and inverse) Fourier integral. Thus, if one wishes to compute the fractional derivative of a covariance kernel , as in (5), (6), or (17), this may entail quadrature for and , and quadrature for . These dimensions for quadrature can be cut in half provided the (forward) Fourier transforms of these kernels were known analytically. Moreover, if the covariance kernel is stationary, we see in Theorem 1 below that can further be reduced from a -dimensional integral to -dimensional one.
Thus, the entire problem of kernel computation is reduced to -dimensional quadrature if the following three conditions are satisfied:
The spacial differential operator is a Fourier multiplier operator:
This is true for a variety of fractional space derivatives:
Here, and throughout this article, we use the Fourier transform convention
The covariance kernel is stationary:
This is true of the squared-exponential kernel in one-dimension
as well as frequently used multivariate squared-exponential kernels, formed by multiplication
of the one-dimensional kernel. The same is true for the Matérn kernels , which have one-dimensional form
and the corresponding multidimensional kernels
The notation refers to the modified Bessel function, which is potentially confusing, but will not be an issue as we will focus on Fourier representation of in what follows.
The (forward) Fourier transform of the stationary covariance kernel is known analytically. This is satisfied by the squared exponential kernel and the Matérn kernel111In the machine learning literature, authors such as  describe this Fourier transform as the spectral density in the context of Bochner’s theorem on stationary kernels, and write it in the equivalent form, up to Fourier transform convention: . :
The same is true for the multidimensional kernels and by linearity of the Fourier transform, and for and by Fubini’s theorem.
Suppose conditions (1)-(3) on the covariance kernel and the operator and are satisfied. Then the fractional derivatives of the covariance kernel , , and can be computed by -dimensional integrals
For a proof of this theorem, see Appendix A.
Provided the integrals (30) can be computed numerically at the locations of the data, they can be used to build the kernel matrix and evaluate the objective function given by (9). In the Quasi-Newton L-BFGS method (discussed in Section 2) that is used to train the extended hyperparameters implicit in , we supply the gradient of the , the components of which are given by
See , §5.4. To obtain , we note in (30) that the derivative may be passed into the integrand and through the complex exponential factor. The resulting derivative of the product of the multiplier(s) , which contains the equation parameters, and , which contains the original kernel parameters, can be obtained symbolically as a closed-form expression. The same numerical procedure is used to evaluate the resulting integrals as for (30). The Hessian is not supplied in closed form and is approximated from evaluations of the gradient.
When training a Gaussian process, it is advantageous to standardize the data  so that when possible. In the framework discussed here, this reduces the difficulty of computing the kernel functions in (30) by restricting the frequency and support of the integrands. When necessary, standardization for the applications considered here can be performed by rescaling the values and positions of the data point by appropriate constants. Once the differential equation is learned for the scaled solution , the true coefficients can be obtained via inverse scaling. This is discussed in detail for an example in Section 7. During training, we expect convergence to a local minimum of the , and we have not found the optimal to depend significantly on the initial guess in our examples, but there is no guarantee of this. Uniqueness, stability, and convergence remain important open questions.
Although numerical calculation of the above integrals can be performed using Gauss-Hermite quadrature, this is not optimal as the fractional-order monomial is not smooth at the origin. To obtain faster convergence with the number of quadrature points, a superior choice is generalized Gauss-Laguerre quadrature, involving a weight function of the form for :
Here, are the Gauss-Laguerre weights, and the nodes. In practice, it is essential for to match the fractional part of the power of the monomial in the integrand , as the remainder yields a smooth function. We use the Golub-Welsch algorithm to find the nodes, but compute the weights by evaluating the generalized Gauss-Laguerre polynomial at these nodes for higher relative accuracy.
We describe two examples and discuss the convergence of the numerical Gauss-Laguerre quadrature in each one. First, consider the left-sided Riemann-Louiville derivative in one dimension, which involves in the above formulas. Owing to the symmetry of , one can write
These integrals call for generalized Gauss-Laguerre quadrature to be performed with for and for . Using the Matérn kernel with as an example, the convergence of the error with the number of quadrature points is shown in Table 1. The MATLAB integral function is used for reference when computing the error. The setup for working with the right-sided Riemann-Louiville derivative, or for the one-dimensional fractional Laplacian, is entirely similar.
Next we consider the fractional Laplacian in two dimensions. This involves the multiplier . We transform the integrals into polar coordinates:
The quadrature of these integrals is performed using the trapezoid rule in and generalized Gauss-Laguerre quadrature in , using for and for . Using the product multivariate Matérn kernel as an example, convergence with respect to the number of quadrature points in is show in in Table 2, where 64 quadrature points in and 8, 16, 32, 64 quadrature points in are used. Reference answers were generated by nesting the MATLAB integral function.
The examples in the following sections are built by defining the Matern kernels in Mathematica and performing derivatives with respect to the hyperparameters symbolically. The expressions are ported into MATLAB code where Gauss-Laguerre quadrature with appropriate is used to compute the (9) as well as the derivatives of the with respect to the parameters. For our purposes, 64 quadrature points in one dimension and 64x64 quadrature points (as described above) in two dimensions is sufficient.
4 A Basic Example
In this section, we will illustrate the methodology to discover the parameters and in the fractional elliptic equation
in one and two space dimensions from data on and .
Following the time-independent framework, this means that we must optimize the negative log marginal likelihood (9), where in the covariance matrix (10), the kernels are given by the integral formulas (30). In the latter formulas, the multiplier and the Fourier transform of the stationary prior kernel must be specified; the multiplier corresponding to the operator is in one dimension, and in two dimensions. We choose to use the Matérn kernel , given by (27
) in one dimension, with a tensor product (28) of Matérn kernels in two dimensions. Thus, by equations (29), This completes the description of the covariance kernel.
In the one-dimensional case the solution/RHS pair
is used. Two data sets are generated for two experiments. In both experiments, we use the above solution/RHS pair to generate data with . The Matérn kernel with fixed is used to discover the parameters, and the initial parameters for the optimization are taken to be . In the first experiment, 7 data points at for and 11 data points for are generated via latin hypercube sampling. No noise is added. In the second experiment, 20 data points for each of and
are generated in the same way, but normal random noise of standard deviation 0.1 forand 0.2 for is added to the data.
The GP regression for the first experiment is shown in Figure 1. The equation parameters recovered are and , which are within 1% of the true values. The GP regression for the second experiment is shown in Figure 2. There, we have also plotted the twice the standard deviation of the Gaussian process plus twice the standard deviation of the learned noise, and (in the previous experiment, these learned parameters were miniscule). The parameters learned in the second experiment are and , which are within 7% and 6% of the true values, respectively.
In the two dimensional example, the exact solution pair used to generate data is now
with Again, the Matérn parameter is used. The initial parameters for the optimization are . We take 40 data points for each of and , generated using latin hypercube sampling on . The parameters and were used to generate data. The result of the training is shown in Figure 3. The learned parameters are and , within of the true values, with hyperparameters , , and .
These examples demonstrate the feasiblity of implementing the fractional kernels as described in the previous sections and the accuracy of discovered parameters, even with noise or small data. Moreover, there is no theoretical difficulty in increasing the dimension of the problem. However, in additional to longer runtime for the computation of formulas (30), the user should expect significantly more data to be required for accurate parameter estimation. For example, performing the same two-dimensional example above, with only 20 data points for each of and , results in learned parameters and , the parameter exhibiting an error of roughly 20%. In the analogous one-dimensional example (Fig. 1), a error is obtained for less than half this number of data points.
In concluding this section, we point that since the estimation of the equation parameters is based on accurate Gaussian process regression through the training data, there are various well-known hazards to avoid. In addition to obvious issues such as unresolved data and data that is too noisy, too much data in featureless “flat” regions carries the risk of overfitting, and should be avoided.
5 Discovering and Interpolating Integer Order Models
As discussed in the introduction, fractional-order differential operators interpolate between classical differential operators of integer order, reducing the task of choosing a “dictionary” of operators of various orders and the assumptions that this entails. The user controls the parsimony directly, by choosing the number of fractional terms in the candidate equation. For example, the model in Section 4
was constrained to be of parsimony one, since it includes a single differential operator of fractional order. This raises several questions, such as: Can the method be used to discover integer-order operators when they drive the true dynamics? What can be expected if the user-selected parsimony is lower than the parsimony of the “true” model driving the data? Can lower-parsimony models still be used to model dynamics? To explore these questions, we consider the parametrized model
for . where the left-sided Riemann-Louiville derivative was defined in (21). Note that
For , equation (43) reduces to the advection equation (with speed )
while for , it reduces to the diffusion equation (with diffusion coefficient )
We perform four experiments. Importantly, we learn these equations using the time-stepping methodology, optimizing (9), where the kernel blocks are given by (17) and (18). We take (effectively a squared-exponential kernel) and again use (30) with generalized Gauss-Laguerre quadrature to evaluate the action of the fractional derivatives on . In all of experiments, we generate data that satisfies the initial condition . We choose and ; thus, . The GP is trained on 30 data points for each of these time slices. The experiments are
Data generated from , the solution to the integer order advection-diffusion equation
We learn the order and coefficient in (43). Note that this archetype is limited to only one space-differential operator; we will see that the algorithm will select best order to capture both the advection and diffusion in the data.
The same advection-diffusion data as in experiment 3, but with the two term, four parameter candidate equation
We note that all of these experiments call for only a single fractional-order dictionary term; even Experiment 4 uses two copies of the same archetype. Experiments 1-3 use initial parameters and , and Experiment 4 uses , .
In Experiments 1 and 2, we note that fractional-order parameters are discovered close to (within 5% of) the true integer-order parameters. In this sense, the true dynamics can be considered recovered. The numerical difference from the true parameters despite the high number of data points (30 per slice) is likely due to approximation error from the backwards Euler approximation (14), and may be resolved by using higher-order differentiation with more time slices , as simply taking to be extremely small may cause the optimization to be dominated by numerical error in computing the kernels.
Experiment 3 shows what occurs when the user-defined parsimony is less than the true dynamics used to generate data. The optimizer still converges, and to a sensible answer – the result can be interpreted as an interpolation of the two integer-order operators in the true dynamics. Moreover, as shown in Figure 4, near the time of the data used to train the model, and even much later, the fractional dynamics are a good approximation to the true dynamics, while being simpler in the sense of being driven by only one spatial derivative. This leads to potential applications of interpolating complex systems using lower-parsimony fractional models.
In Experiment 4, the parsimony was increased with the inclusion of an additional indepedent copy of the fractional archetype. The advection-diffusion equation is recovered with parameters within of the true values. Thus, with a single fractional archetype and user-controlled parsimony, it is possible to discover advection, diffusion, advection-diffusion, as well as a single-term fractional interpolation of advection-diffusion.