1 Introduction
A novel use of machine learning, which has potential both for modeling with large or highfrequency data sets as well as advancing fundamental science, is the discovery of governing differential equations from data. Unlike highly specialized algorithms used to refine existing models, these novel methods are distinguished by comparatively limited assumptions, and the ability to produce various types of equations from a wide variety of data.
We now give a very brief (and incomplete) overview of a few proposed algorithms. Key differences between various works include the techniques used to generate candidate equations and the technique used to select equations. A fundamental problem that all these work address is that, if such algorithms are to mimic a human scientist, while generating accurate models, they must also avoid or penalize spurious “overfitted” models. In [34]
, a symbolic regression method was developed to learn conservation laws of physical systems. The laws, which could be nonlinear, were created using genetic programming and evaluated concurrently on predicative ability and parsimony (number of terms) in order to prevent overfitting. Earlier, in
[2] in a dynamical system context, the equations were evaluated using “probing” tests while the overfitting was addressed using a separate “snipping” process. In [5], a more parametric method for learning nonlinear autonomous systems was developed in which the candidate equation is build from a linear combination of elements from a userdefined “dictionary”. Parsimony translated to sparsity in the dictionary elements, so sparse regression was used to determine the coefficients. In [32], a similar approach was used to generate nonlinear partial differential evolution equations.Building on an earlier work [26] where Gaussian process regression was used to infer solutions of equations, a Gaussian process framework was developed in [27] for parametric learning of linear differential equations, of the form
(1) 
given small data on and . Here refers to a linear differential operator with parameters ; for example, could be a linear combination of differential operators with coefficients . The method placed a Gaussian Process on the function and used linearity of to obtain a joint Gaussian Process on in which the unknown parameters were subsumed as hyperparameters (this is reviewed in detail in Section 2). This allowed the use of standard continuous optimization techniques to optimize the negative logmarginal likelihood, which effects an automatic tradeoff between datafit and model complexity, and thereby find . Thus, in this approach, the problem of identifying the differential equation is “embedded” into the problem of interpolation/regression of data. Moreover, the Gaussian Process method only requires computation of the forward action of on the covariance kernel, rather than the solution of the differential equation. However, the method also required the user to select a “dictionary” of terms and assume a parametric form of the equation.
The inclusion of fractionalorder operators in this framework, with the order itself is a parameter, allows for a single fractionalorder operator to interpolate between derivatives of all orders. This allows the user to directly control the parsimony, by fitting a specified number of fractional derivatives to data, without making assumptions on the orders of derivatives to be discovered. Therefore, building on a basic example of a fractionalorder operator treated in [27], we significantly extend the framework to treat other spacefractional differential operators and covariance kernels. The main problem that must be addressed is the efficient computation of the action of the unknown (fractional) linear operator on covariance kernels.
At the same time, fractionalorder derivatives are far more than a tool to interpolate between integerorder and facilitate datadriven discovery of PDEs using continuous optimization techniques. The advances in the present article further improve the ability to discover fractionalorder partial differential equations (FPDEs) from realworld data. It is now well understood that FPDEs have a profound connection with anomalous diffusions and systems driven by heavytailed processes (
[18], [19]). While such heavytailed data abounds in the fields of, e.g., hydrology, finance, and plasma physics, FPDEs are currently underutilized as tools to model macroscopic properties of such systems. This can be attributed to the analytic difficulties of deriving FPDE models; not only is there an additional parameter, but the involved formulas and nonlocal nature of fractionalorder derivatives make them significantly less attractive for specialists in other fields to work with.Machine Learning is a natural tool for ameliorating this issue. As proof of this concept, we point to the work [21], which employed a multifidelity Gaussian Process method to discover fractionalorder of the fractional advectiondispersion equation governing underground tritium transport through a heterogeneous medium at the Macrodispersion Experimental site at Columbus Air Force Base. This resulted in improved and highly efficient fitting of variableorder fractional equation to data. Along more theoretical lines, [10] explored the determination, using Gaussian processes, of the fractionalorder of an elliptic problem involving a Spectral fractional Laplacian on a bounded domain with Neumann boundary condition. In addition, the authors also proved wellposedness of the inverse Bayesian formulation of this problem in the sense of [36]. Our work differs from [21] in that we do not repeatedly solve the forward problem, and from [10] in that we do not place a prior on the fractional order or on other parameters; rather, the parameters are inferred by placing a prior and training on the solution and righthand side of the equation. This allows for more flexibility with regard to the form of the equation and the inclusion of additional parameters.
In the aforementioned Gaussian Process framework of [27], it is possible to treat timederivatives by considering the data and equation in spacetime. Time may be treated as another dimension in the covariance kernel. An alternative method is suggested by the work of [24], in which learning of evolution equations is based on the numerical differentiation of Gaussian processes. There, the data is given at different “snapshots” in time which are used to discretize the timederivative. In this way, even nonlinear timedependent equations can be discovered using correspondence between nonlinear terms and specific linearizations of the discretized system. The training is similar to that of [27], and we extend this to fractional operators as well.
This work is organized as follows. In section 2, we review the Gaussian Process framework of [27] and [24]. In section 3, we review the Matérn family of covariance kernels and spacefractional operators and present new formulas for spacefractional derivatives of such covariance kernels. Such covariance kernels can be more suited to rough data in certain applications. The inclusion of these new formulas, which can be efficiently treated using generalized GaussLaguerre quadrature, allows the discovery of various fractional equations in a unified way. In section 4 we present basic synthetic examples to illustrate the methodology, and in section 5 we apply the methodology to the discovery and interpolation of integerorder advection and diffusion using the fractionalorder framework. This includes an interesting problem of interpolating twoterm advectiondiffusion using a single fractionalorder term, and an exploration of userselected parsimony. After reviewing the relation between stable processes and fractional diffusion in section 6 and studying a synthetic example, in section 7 we apply the methodology to the modeling of relative stock performance (Intel to S&P 500) by stable processes via an associated fractionalorder diffusion equation.
2 The Gaussian Process Framework
We review the framework developed by [27] for parametric learning of linear differential equations, of the form
(2) 
given data on and . Here, is a linear operator with unknown parameters . Here, and throughout the article, we use boldface characters (such as
) to denote vectorvalued variables, and capital boldface characters (such as
) to denote data vectors.Assume to be Gaussian process with mean and covariance function
with hyperparameters :
(3) 
We shall be vague about the form of the covariance kernel until Section 3; for now, it suffices to say that it describes how the correlation between the values of at two points and falls off with or otherwise behaves with the two points, and that it must be a symmetric, positive semidefinite function [31]
. Then, the linear transformation
of the Gaussian process implies a Gaussian Process for (see [31], §9.4),(4) 
with covariance kernel
(5) 
Moreover, the covariance between and , and between and is
(6) 
respectively. By symmetry of ,
(7) 
The hyperparameters of the joint Gaussian Process
(8) 
are then learned by training on the data of given at points and of given at points . This is done with a QuasiNewton optimizer LBFGS to minimize the negative log marginal likelihood ([31]):
(9) 
where , , and is given by
(10) 
The additional noise parameters and are included to learn uncorrelated noise in the data; their inclusion above corresponds to the assumption that
(11) 
with and independently .
Next we review the timestepping Gaussian Process method of [24] for learning linear (in our case) equations of the form
(12) 
For our purposes, we consider two “snapshots” and of the system at two times and , respectively, such that
(13) 
We perform, in the case of two snapshots, a backward Euler discretization
(14) 
Then we assume a Gaussian Process, but for :
(15) 
As before, the linearity of leads to a Gaussian Process for . We obtain the joint Gaussian process
(16) 
where, denoting the identity operator by Id,
(17)  
Equation (16) can be compared to equation (8), and equation (17) to (5) and (6). The setups are very similar, and again, has been merged into the hypermarameters of this joint Gaussian Process. Given data at spacial points and for the functions and , represented by the vectors and , respectively, the new hyperparameters are trained by employing the same QuasiNewton optimizer LBFGS as before to minimize the given by equation (9). In this case, , , and is given by
(18) 
Here, is an additional parameter to learn noise in the data, under the assumption
(19) 
with and being independent.
3 Fractional Derivatives of Covariance Kernels
Many properties of a Gaussian Process are determined by the choice of covariance kernel . In particular, the covariance kernel encodes an assumption about the smoothness of the field that being interpolated. Stein [35]
writes “…properties of spatial interpolants depend strongly on the local behavior of the random field. In practice, this local behavior is not known and must be estimated from the same data that will be used to do the interpolation. This state of affairs strongly suggests that it is critical to select models for the covariance structures that include at least one member whose local behavior accurately reflects the actual local behavior of the spatially varying quantity under study”. Matérn kernels
(defined below), a family of stationary kernels which includes the exponential kernel for , and the squaredexponential kernel in the limit , have been widely used for this reason. A Gaussian Process with Matérn covariance kernel is times mean square differentiable for . We have developed a computational methodology that allows for Matérn kernels of arbitrary real order to be used in our Gaussian Process, namely in (3) and (15). In fact, the parameter itself can be treated and optimized as a hyperparameter of the Gaussian Process, just as the equation parameters were in Section 2. We employ such an algorithm to explore the effect of the parameter when working with rough time series histogram data in section 6.The main problem that arises when using fractional operators is the computation of their action on kernels such as the Matérn class, as required by equations (5) and (6) for the timeindependent case and (17) for the timedependent case. This cannot be done analytically, and requires a numerical approach, in contrast to the works [27] and [24] where (standard) differential operators applied to kernels were obtained symbolically in closed form using Mathematica. Moreover, unlike standard derivatives, fractional derivatives, whether in ,
, or on bounded subsets, are nonlocal operators typically defined by singular integrals or eigenfunction expansions that are difficult and expensive to discretize
[14], [18]. However, spacefractional derivatives in enjoy representations as Fourier multiplier operators. In other words, they are equivalent to multiplication by a function in frequency space. This representation suggests a computational method that avoids any singular integral operators or the solution of extension problems in . The downside to Fourier methods is that, if used to compute the fractional derivative of a function on , they may require quadrature of a (forward and inverse) Fourier integral. Thus, if one wishes to compute the fractional derivative of a covariance kernel , as in (5), (6), or (17), this may entail quadrature for and , and quadrature for . These dimensions for quadrature can be cut in half provided the (forward) Fourier transforms of these kernels were known analytically. Moreover, if the covariance kernel is stationary, we see in Theorem 1 below that can further be reduced from a dimensional integral to dimensional one.Thus, the entire problem of kernel computation is reduced to dimensional quadrature if the following three conditions are satisfied:

The spacial differential operator is a Fourier multiplier operator:
(20) This is true for a variety of fractional space derivatives:
(21) Here, and throughout this article, we use the Fourier transform convention
(22) 
The covariance kernel is stationary:
(23) This is true of the squaredexponential kernel in onedimension
(24) as well as frequently used multivariate squaredexponential kernels, formed by multiplication
(25) or addition
(26) of the onedimensional kernel. The same is true for the Matérn kernels , which have onedimensional form
(27) and the corresponding multidimensional kernels
(28) The notation refers to the modified Bessel function, which is potentially confusing, but will not be an issue as we will focus on Fourier representation of in what follows.

The (forward) Fourier transform of the stationary covariance kernel is known analytically. This is satisfied by the squared exponential kernel and the Matérn kernel^{1}^{1}1In the machine learning literature, authors such as [31] describe this Fourier transform as the spectral density in the context of Bochner’s theorem on stationary kernels, and write it in the equivalent form, up to Fourier transform convention: . :
(29) The same is true for the multidimensional kernels and by linearity of the Fourier transform, and for and by Fubini’s theorem.
Theorem 1
Suppose conditions (1)(3) on the covariance kernel and the operator and are satisfied. Then the fractional derivatives of the covariance kernel , , and can be computed by dimensional integrals
(30) 
For a proof of this theorem, see Appendix A.
Provided the integrals (30) can be computed numerically at the locations of the data, they can be used to build the kernel matrix and evaluate the objective function given by (9). In the QuasiNewton LBFGS method (discussed in Section 2) that is used to train the extended hyperparameters implicit in , we supply the gradient of the , the components of which are given by
(31) 
See [31], §5.4. To obtain , we note in (30) that the derivative may be passed into the integrand and through the complex exponential factor. The resulting derivative of the product of the multiplier(s) , which contains the equation parameters, and , which contains the original kernel parameters, can be obtained symbolically as a closedform expression. The same numerical procedure is used to evaluate the resulting integrals as for (30). The Hessian is not supplied in closed form and is approximated from evaluations of the gradient.
When training a Gaussian process, it is advantageous to standardize the data [40] so that when possible. In the framework discussed here, this reduces the difficulty of computing the kernel functions in (30) by restricting the frequency and support of the integrands. When necessary, standardization for the applications considered here can be performed by rescaling the values and positions of the data point by appropriate constants. Once the differential equation is learned for the scaled solution , the true coefficients can be obtained via inverse scaling. This is discussed in detail for an example in Section 7. During training, we expect convergence to a local minimum of the , and we have not found the optimal to depend significantly on the initial guess in our examples, but there is no guarantee of this. Uniqueness, stability, and convergence remain important open questions.
Although numerical calculation of the above integrals can be performed using GaussHermite quadrature, this is not optimal as the fractionalorder monomial is not smooth at the origin. To obtain faster convergence with the number of quadrature points, a superior choice is generalized GaussLaguerre quadrature, involving a weight function of the form for :
(32) 
Here, are the GaussLaguerre weights, and the nodes. In practice, it is essential for to match the fractional part of the power of the monomial in the integrand , as the remainder yields a smooth function. We use the GolubWelsch algorithm to find the nodes, but compute the weights by evaluating the generalized GaussLaguerre polynomial at these nodes for higher relative accuracy.
We describe two examples and discuss the convergence of the numerical GaussLaguerre quadrature in each one. First, consider the leftsided RiemannLouiville derivative in one dimension, which involves in the above formulas. Owing to the symmetry of , one can write
(33)  
(34)  
(35)  
(36) 
Similarly,
(37) 
These integrals call for generalized GaussLaguerre quadrature to be performed with for and for . Using the Matérn kernel with as an example, the convergence of the error with the number of quadrature points is shown in Table 1. The MATLAB integral function is used for reference when computing the error. The setup for working with the rightsided RiemannLouiville derivative, or for the onedimensional fractional Laplacian, is entirely similar.
Next we consider the fractional Laplacian in two dimensions. This involves the multiplier . We transform the integrals into polar coordinates:
(38)  
(39) 
The quadrature of these integrals is performed using the trapezoid rule in and generalized GaussLaguerre quadrature in , using for and for . Using the product multivariate Matérn kernel as an example, convergence with respect to the number of quadrature points in is show in in Table 2, where 64 quadrature points in and 8, 16, 32, 64 quadrature points in are used. Reference answers were generated by nesting the MATLAB integral function.
The examples in the following sections are built by defining the Matern kernels in Mathematica and performing derivatives with respect to the hyperparameters symbolically. The expressions are ported into MATLAB code where GaussLaguerre quadrature with appropriate is used to compute the (9) as well as the derivatives of the with respect to the parameters. For our purposes, 64 quadrature points in one dimension and 64x64 quadrature points (as described above) in two dimensions is sufficient.
Function 


8  16  32  64  
1.190e02  3.468e04  2.917e06  2.876e07  
1  5.126e03  8.029e06  3.741e07  3.204e07  
10  8.163e02  1.446e02  2.378e04  4.188e06  
4.752e02  1.991e03  2.822e05  3.315e06  
1  6.179e03  9.528e05  2.524e07  1.634e07  
10  6.937e02  9.915e03  3.456e04  2.828e06  
1.624e01  8.980e03  2.277e04  3.449e06  
1  1.006e03  1.515e04  9.092e07  9.642e07  
10  6.571e02  4.774e03  3.370e04  1.217e06 
Function 


8  16  32  64  
5.406e02  6.201e04  8.248e06  9.939e05  
1  1.368e02  3.611e04  2.659e06  2.955e06  
10  7.955e01  8.385e02  3.873e03  2.006e05  
2.025e01  3.922e03  6.705e05  3.842e05  
1  2.769e02  9.611e04  8.921e06  8.977e06  
10  3.717e01  2.013e02  3.340e03  1.515e05  
4.759e01  2.644e02  5.154e04  7.347e05  
1  8.117e02  1.449e03  8.740e06  9.848e06  
10  8.713e02  4.553e02  1.535e03  8.209e06 
4 A Basic Example
In this section, we will illustrate the methodology to discover the parameters and in the fractional elliptic equation
(40) 
in one and two space dimensions from data on and .
Following the timeindependent framework, this means that we must optimize the negative log marginal likelihood (9), where in the covariance matrix (10), the kernels are given by the integral formulas (30). In the latter formulas, the multiplier and the Fourier transform of the stationary prior kernel must be specified; the multiplier corresponding to the operator is in one dimension, and in two dimensions. We choose to use the Matérn kernel , given by (27
) in one dimension, with a tensor product (
28) of Matérn kernels in two dimensions. Thus, by equations (29), This completes the description of the covariance kernel.In the onedimensional case the solution/RHS pair
(41) 
is used. Two data sets are generated for two experiments. In both experiments, we use the above solution/RHS pair to generate data with . The Matérn kernel with fixed is used to discover the parameters, and the initial parameters for the optimization are taken to be . In the first experiment, 7 data points at for and 11 data points for are generated via latin hypercube sampling. No noise is added. In the second experiment, 20 data points for each of and
are generated in the same way, but normal random noise of standard deviation 0.1 for
and 0.2 for is added to the data.The GP regression for the first experiment is shown in Figure 1. The equation parameters recovered are and , which are within 1% of the true values. The GP regression for the second experiment is shown in Figure 2. There, we have also plotted the twice the standard deviation of the Gaussian process plus twice the standard deviation of the learned noise, and (in the previous experiment, these learned parameters were miniscule). The parameters learned in the second experiment are and , which are within 7% and 6% of the true values, respectively.
In the two dimensional example, the exact solution pair used to generate data is now
(42) 
with Again, the Matérn parameter is used. The initial parameters for the optimization are . We take 40 data points for each of and , generated using latin hypercube sampling on . The parameters and were used to generate data. The result of the training is shown in Figure 3. The learned parameters are and , within of the true values, with hyperparameters , , and .
These examples demonstrate the feasiblity of implementing the fractional kernels as described in the previous sections and the accuracy of discovered parameters, even with noise or small data. Moreover, there is no theoretical difficulty in increasing the dimension of the problem. However, in additional to longer runtime for the computation of formulas (30), the user should expect significantly more data to be required for accurate parameter estimation. For example, performing the same twodimensional example above, with only 20 data points for each of and , results in learned parameters and , the parameter exhibiting an error of roughly 20%. In the analogous onedimensional example (Fig. 1), a error is obtained for less than half this number of data points.
In concluding this section, we point that since the estimation of the equation parameters is based on accurate Gaussian process regression through the training data, there are various wellknown hazards to avoid. In addition to obvious issues such as unresolved data and data that is too noisy, too much data in featureless “flat” regions carries the risk of overfitting, and should be avoided.
5 Discovering and Interpolating Integer Order Models
As discussed in the introduction, fractionalorder differential operators interpolate between classical differential operators of integer order, reducing the task of choosing a “dictionary” of operators of various orders and the assumptions that this entails. The user controls the parsimony directly, by choosing the number of fractional terms in the candidate equation. For example, the model in Section 4
was constrained to be of parsimony one, since it includes a single differential operator of fractional order. This raises several questions, such as: Can the method be used to discover integerorder operators when they drive the true dynamics? What can be expected if the userselected parsimony is lower than the parsimony of the “true” model driving the data? Can lowerparsimony models still be used to model dynamics? To explore these questions, we consider the parametrized model
(43) 
for . where the leftsided RiemannLouiville derivative was defined in (21). Note that
(44) 
For , equation (43) reduces to the advection equation (with speed )
(45) 
while for , it reduces to the diffusion equation (with diffusion coefficient )
(46) 
We perform four experiments. Importantly, we learn these equations using the timestepping methodology, optimizing (9), where the kernel blocks are given by (17) and (18). We take (effectively a squaredexponential kernel) and again use (30) with generalized GaussLaguerre quadrature to evaluate the action of the fractional derivatives on . In all of experiments, we generate data that satisfies the initial condition . We choose and ; thus, . The GP is trained on 30 data points for each of these time slices. The experiments are

Data generated from , the solution to the integer order advectiondiffusion equation
(47) We learn the order and coefficient in (43). Note that this archetype is limited to only one spacedifferential operator; we will see that the algorithm will select best order to capture both the advection and diffusion in the data.

The same advectiondiffusion data as in experiment 3, but with the two term, four parameter candidate equation
(48)
We note that all of these experiments call for only a single fractionalorder dictionary term; even Experiment 4 uses two copies of the same archetype. Experiments 13 use initial parameters and , and Experiment 4 uses , .
In Experiments 1 and 2, we note that fractionalorder parameters are discovered close to (within 5% of) the true integerorder parameters. In this sense, the true dynamics can be considered recovered. The numerical difference from the true parameters despite the high number of data points (30 per slice) is likely due to approximation error from the backwards Euler approximation (14), and may be resolved by using higherorder differentiation with more time slices [25], as simply taking to be extremely small may cause the optimization to be dominated by numerical error in computing the kernels.
Experiment 3 shows what occurs when the userdefined parsimony is less than the true dynamics used to generate data. The optimizer still converges, and to a sensible answer – the result can be interpreted as an interpolation of the two integerorder operators in the true dynamics. Moreover, as shown in Figure 4, near the time of the data used to train the model, and even much later, the fractional dynamics are a good approximation to the true dynamics, while being simpler in the sense of being driven by only one spatial derivative. This leads to potential applications of interpolating complex systems using lowerparsimony fractional models.
In Experiment 4, the parsimony was increased with the inclusion of an additional indepedent copy of the fractional archetype. The advectiondiffusion equation is recovered with parameters within of the true values. Thus, with a single fractional archetype and usercontrolled parsimony, it is possible to discover advection, diffusion, advectiondiffusion, as well as a singleterm fractional interpolation of advectiondiffusion.
Exp.  Data  Candidate  Parameters Learned 

1  Advection: 
Comments
There are no comments yet.