The interest in using machine learning as an alternative to the classical numerical analysis methods [26, 11, 25, 65] for the solution of the inverse [42, 50, 14, 29, 69, 68, 1], and forward problems [46, 21, 52, 27, 44] in differential equations modelling dynamical systems can be traced back three decades ago. Today, this interest has been boosted together with our need to better understand and analyse the emergent dynamics of complex multiphysics/ multiscale dynamical systems of fundamental theoretical and technological importance . The objectives are mainly two. First, that of the solution of the inverse problem, i.e. that of identifying/discovering the hidden macroscopic laws, thus learning nonlinear operators and constructing coarse-scale dynamical models of ODEs and PDEs and their closures, from microscopic large-scale simulations and/or from multi-fidelity observations [10, 57, 58, 59, 62, 9, 3, 47, 74, 15, 16, 48]. Second, based on the constructed coarse-scale models, to systematically investigate their dynamics by efficiently solving the corresponding differential equations, especially when dealing with (high-dimensional) PDEs [24, 13, 15, 16, 22, 23, 38, 49, 59, 63]. Towards this aim, physics-informed machine learning [57, 58, 59, 48, 53, 15, 16, 40]
has been addressed to integrate available/incomplete information from the underlying physics, thus relaxing the “curse of dimensionality”. However, failures may arise at the training phase especially in deep learning formulations, while there is still the issue of the corresponding computational cost[45, 75, 76, 40]. Thus, a bet and a challenge is to develop physics-informed machine learning methods that can achieve high approximation accuracy at a low computational cost.
, for the numerical solution of initial-value problems of nonlinear ODEs and index-1 DAEs as these may also arise from the spatial discretization of PDEs. Our scheme consists of a single hidden layer, with Gaussian kernels, in which the weights between the input and hidden layer are fixed to ones. The shape parameters of the Gaussian kernels are random variables drawn from a uniform distribution which bounds are “parsimoniously” chosen based on the expected bias-variance trade-off
. The unknown parameters, i.e. the weights between the hidden and the output layer are estimated by solving with Newton-type iterations a system of nonlinear algebraic equations. For low to medium scale systems this is achieved using SVD decomposition, while for large scale systems, we exploit a sparse QR factorization algorithm withregularization . Furthermore, to facilitate the convergence of Newton’s iterations, especially at very stiff regimes and regimes with very sharp gradients, we (a) propose a variable step size scheme for adjusting the interval of integration based on the elementary local error control algorithm , and (b) address a natural continuation method for providing good initial guesses for the unknown weights. To demonstrate the performance of the proposed method in terms of both approximation accuracy and computational cost, we have chosen seven benchmark problems, four index-1 DAEs and three stiff problems of ODEs, thus comparing it with the ode23s, ode23t and ode15s solvers of the MATLAB suite ODE . In particular, we considered the index-1 DAE Robertson model describing the kinetics of an autocatalytic reaction [60, 66], an index-1 DAEs model describing the motion of a bead on a rotating needle , an index-1 DAEs model describing the dynamics of a power discharge control problem , the index-1 DAE chemical Akzo Nobel problem [51, 72], the Belousov-Zabotinsky chemical kinetics stiff ODEs [7, 77], the Allen-Chan phase-field PDE describing the process of phase separation for generic interfaces  and Kuramoto-Sivashinsky PDE [43, 70, 73]. The PDEs are discretized in space with central finite differences, thus resulting to a system of stiff ODEs . The results show that the proposed scheme outperforms the aforementioned solvers in several cases in terms of numerical approximation accuracy, especially in problems where stiffness and sharp gradients arise, while the required computational times are comparable for any practical purposes.
First, we describe the problem, and present some preliminaries on the use of machine learning for the solution of differential equations and on the concept of random projections for the approximation of continuous functions. We then present the proposed physics-informed random projection neural network scheme, and building on previous works , we prove that in principle the proposed PIRPNN can approximate with any given accuracy any unique continuously differentiable function that satisfies the Picard-Lindelöf Theorem. Finally, we address (a) a variable step size scheme for adjusting the interval of integration and (b) a natural continuation method, to facilitate the convergence of Newton iterations, especially in regimes where stiffness and very sharp gradients arise.
2.1 Description of the Problem
We focus on IVPs of ODEs and index-1 DAEs that may also arise from the spatial discretization of PDEs using for example finite differences, finite elements and spectral methods. In particular, we consider IVPs in the linear implicit form of:
denotes the set of the states , is the so-called mass matrix with elements , denotes the set of Lipschitz continuous multivariate functions, say in some closed domain , and are the initial conditions. When , the system reduces to the canonical form. The above formulation includes problems of DAEs when is a singular matrix, including semiexplicit DAEs in the form :
where, we assume that the Jacobian is nonsingular, .
In this work, we use physics-informed random projection neural networks for the numerical solution of the above type of IVPs which solutions are characterized both by sharp gradients and stiffness [64, 66].
Stiff problems are the ones which integration “with a code that aims at non stiff problems proves conspicuously inefficient for no obvious reason (such a severe lack of smoothness in the equation or the presence of singularities)”. At this point, it is worthy to note that stiffness is not connected to the presence of sharp gradients . For example, at the regimes where the relaxation oscillations of the van der Pol model exhibit very sharp changes resembling discontinuities, the equations are not stiff.
2.2 Physics-informed machine learning for the solution of differential equations
Let’s assume a set of points of the independent (spatial) variables, thus defining the size grid in the domain , points along the boundary of the domain and points in the time interval, where the solution is sought. For our illustrations, let’s consider a time-dependent PDE in the form of
where is the partial differential operator acting on satisfying the boundary conditions , where is the boundary differential operator. Then, the solution with machine learning of the above PDE involves the solution of a minimization problem of the form:
where represents a machine learning constructed function approximating the solution at at time and is a machine learning algorithm; contains the parameters of the machine learning scheme (e.g. for a FNN the internal weights , the biases , the weights between the last hidden and the output layer ),
contains the hyper-parameters (e.g. the parameters of the activation functions for a FNN, the learning rate, etc.). In order to solve the optimization problem (4), one usually needs quantities such as the derivatives of with respect to and the parameters of the machine learning scheme (e.g. the weights and biases for a FNN). These can be obtained in several ways, numerically using finite differences or other approximation schemes, or by symbolic or automatic differentiation [5, 49]). The above approach can be directly implemented also for solving systems of ODEs as these may also arise by discretizing in space PDEs. Yet, for large scale problems even for the simple case of single layer networks, when the number of hidden nodes is large enough, the solution of the above optimization problem is far from trivial. Due to the “curse of dimensionality” the computational cost is high, while when dealing with differential equations which solutions exhibit sharp gradients and/or stiffness several difficulties or even failures in the convergence have been reported [45, 75, 76].
2.3 Random Projection Neural Networks
Random projection neural networks (RPNN) including Random Vector Functional Link Networks (RVFLNs)[36, 35], Echo-State Neural Networks and Reservoir Computing , and Extreme Learning Machines [34, 33] have been introduced to tackle the “curse of dimensionality” encountered at the training phase. One of the fundamental works on random projections is the celebrated Johnson and Lindenstrauss Lemma  stating that for a matrix containing , points in , there exists a projection defined as:
has components which are i.i.d. random variables sampled from a normal distribution, which mapsinto a random subspace of dimension , where the distance between any pair of points in the embedded space is bounded in the interval .
Regarding single layer feedforward neural networks (SLFNNs), in order to improve the approximation accuracy, Rosenblatt  suggested the use of randomly parametrized activation functions for single layer structures. Thus, the approximation of a sufficiently smooth function is written as a linear combination of appropriately randomly parametrized family of basis functions as
where are the weighting coefficients of the inputs, are the biases and are the shape parameters of the basis functions.
More generally, for SLFNNs with inputs, outputs and neurons in the hidden layer, the random projection of samples in the -dimensional input space can be written in a matrix-vector notation as:
is a random matrix containing the outputs of the hidden layer as shaped by thesamples in the -dimensional space, the randomly parametrized internal weights , the biases and shape parameters of the activation functions; is the matrix containing the weights between the hidden and the output layer.
In the early ’90s, Barron 
proved that for functions with integrable Fourier transformations, a random sample of the parameters of sigmoidal basis functions from an appropriately chosen distribution results to an approximation error of the order of. Igelnik and Pao  extending Barron’s proof  for any family of integrable basis functions proved the following theorem for RVFLNs:
SLFNNs as defined by Eq.(6) with weights and biases selected randomly from a uniform distribution and for any family of integrable basis functions , are universal approximators of any Lipschitz continuous function defined in the standard hypercube and the expected rate of convergence of the approximation error, i.e., the distance between and on any compact set defined as:
is of the order of , where ; denotes expectation over a probabilistic space , is the support of in .
be a probability distribution drawn i.i.d. onand consider the basis functions that satisfy . Define the set of functions
and let be a measure on . Then, if one takes a function in , and values of the shape parameter are drawn i.i.d. from , for any
with probability at leastover , there exists a function defined as so that
For the so-called Extreme-Learning machines (ELMs), which similarly to RVFLNs are FNNs with randomly assigned internal weights and biases of the hidden layers, Huang et al. [34, 33] has proved the following theorem:
For any set of input-output pairs (), the projection matrix in Eq.(7) is invertible and
with probability 1 for , , randomly chosen from any probability distribution, and an activation function that is infinitely differentiable.
2.4 The Proposed Physics-Informed RPNN-based method for the solution of ODES and index-1 DAEs
Here, we propose a physics-informed machine learning method based on random projections for the solution of IVPs of a system given by Eq.(1)/(2) in collocation points in an interval, say . According to the previous notation, for this problem we have . Thus, Eq.(7) reads:
Thus, the output of the RPNN is spanned by the range , i.e. the column vectors of , say . Hence, the output of the RPNN can be written as:
For an IVP of variables, we construct such PIRPNNs. We denote by the set of functions that approximate the solution profile at time , defined as:
where is the column vector containing the values of the basis functions at time as shaped by and containing the values of the parameters of the basis functions and is the vector containing the values of the output weights of the -th PIRPNN network. Note that the above set of functions are continuous functions of and satisfy explicitly the initial conditions.
For index-1 DAEs, with say , or in the semiexplicit form of (2), there are no explicit initial conditions for the variables , or the variables in (2): these values have to satisfy the constraints (equivalently ) , thus one has to start with consistent initial conditions. Assuming that the corresponding Jacobian matrix of the with respect to , and for the semiexplicit form (2), , is not singular, one has to solve initially at , using for example Newton-Raphson iterations, the above nonlinear system of algebraic equations in order to find a consistent set of initial values. Then, one can write the approximation functions of the / variables as in Eq.(14).
With collocation points in , by fixing the values of the interval weights and the shape parameters
, the loss function that we seek to minimize with respect to the unknown coefficientsis given by:
When the system of ODEs/DAEs results from the spatial discretization of PDEs, we assume that the corresponding boundary conditions have been appropriately incorporated into the resulting algebraic equations explicitly or otherwise can be added in the loss function as algebraic constraints.
Based on the above notation, we construct PIRPNNs, taking, for each network , Gaussian RBFs which, for , are given by:
The values of the (hyper) parameters, namely , are set as:
while the values of the shape parameters are sampled from an appropriately chosen uniform distribution. Under the above assumptions, the time derivative of is given by:
2.4.1 Approximation with the PIRPNN
At this point, we note that in Theorems 2.1 and 2.3, the universal approximation property is based on the random base expansion given by Eq.(6), while in our case, we have a slightly different expansion given by Eq.(14). Here, we show that the PIRPNN given by Eq.(14) is a universal approximator of the solution of the ODEs in canonical form or of the index-1 DAEs in the semiexplicit form (2).
For the IVP problem (1) in the canonical form or in the semiexplicit form (2), the PIRPNN solution given by Eq.(14) with Gaussian basis functions defined by Eq.(16), and the values of the shape parameters drawn i.i.d. from a uniform distribution converges uniformly to the solution profile in a closed time interval .
Assuming that the system in Eq.(1) can be written in the canonical form, the Picard-Lindelöf Theorem  holds true, then it exists a unique continuously differentiable function defined on a closed time interval given by:
From Eq.(14) we have:
By the change of variables, , the integral in Eq.(18) becomes
Thus, in fact, upon convergence, the PIRPNN provides an approximation of the normalized integral. By Theorem 3, we have that in the interval , the PIRPNN with the shape parameter of the Gaussian kernel drawn i.i.d. from a uniform distribution provides, a uniform approximation of the integral in Eq.(18) in terms of a Monte Carlo integration method as also described in .
Hence, as the initial conditions are explicitly satisfied by , we have from Eq.(10) an upper bound for the uniform approximation of the solution profile with probability .
For index-1 DAEs in the semiexplicit form of (2), by the implicit function theorem, we have that the DAE system is in principle equivalent with the ODE system in the canonical form:
where is the unique solution of . Hence, in that case, the proof of convergence reduces to the one above for the ODE system in the canonical form. ∎
2.4.2 Computation of the unknown weights
For collocation points, the outputs of each network , , are given by:
The minimization of the loss function (15) is performed over the nonlinear residuals :
where , , , is the column vector obtained by collecting the values of all vectors , . Thus, the solution to the above non-linear least squares problem can be obtained, e.g. with Newton-type iterations such as Newton-Raphson, Quasi-Newton and Gauss-Newton methods (see e.g. ). For example, by setting , the new update at the -th Gauss-Newton iteration is computed by the solution of the linearized system:
where is the Jacobian matrix of with respect to . Note that the residuals depend on the derivatives and the approximation functions , while the elements of the Jacobian matrix depend on the derivatives of as well as on the mixed derivatives . Based on (17), the latter are given by
Based on the above, the elements of the Jacobian matrix can be computed analytically as:
where, as before, and .
However, in general, even when , the Jacobian matrix is expected to be rank deficient, or nearly rank deficient, since some of the rows due to the random construction of the basis functions can be nearly linear dependent. Thus, the solution of the corresponding system, and depending on the size of the problem, can be solved using for example truncated SVD decomposition or QR factorization with regularization. The truncated SVD decomposition scheme results to the Moore-Penrose pseudoinverse and the updates are given by:
is the inverse of the diagonal matrix with singular values ofabove a certain value , and ,
are the matrices with columns the corresponding left and right eigenvectors, respectively. If we have already factorizedat an iteration , then in order to decrease the computational cost, one can proceed with a Quasi-Newton scheme, thus using the same pseudoinverse of the Jacobian for the next iterations until convergence, e.g., computing the SVD decomposition for only the first and second Newton-iterations and keep the same for the next iterations.
For large-scale sparse Jacobian matrices, as those arising for example from the discretization of PDEs, one can solve the regularization problem using other methods such as sparse QR factorization. Here, to account for the ill-posed Jacobian, we have used a sparse QR factorization with regularization as implemented by SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package [18, 19].
2.4.3 Parsimonious construction of the PIRPNN
The variable step size scheme.
In order to deal with the presence of sharp gradients that resemble singularities at the time interval of interest, and stiffness, we propose an adaptive scheme for adjusting the step size of time integration as follows. The full time interval of integration is divided into sub-intervals, i.e., , where are determined in an adaptive way. This decomposition of the interval leads to the solution of consecutive IVPs. In order to describe the variable step size scheme, let us assume that we have solved the problem up to interval , hence we have found and we are seeking in the current interval with a width of . Moreover suppose that the Quasi-Newton iterations after a certain number of iterations, say, (here ) the resulting approximation error is [26, 71]:
where is the tolerance relative to the size of each derivative component and is a threshold tolerance. Now, if the solution is accepted, otherwise the solution is rejected.
In both cases, the size of the time interval will be updated according to the elementary local error control algorithm :
where is a scaling factor and is a safe/conservative factor. Also should not be allowed to increase or decrease too much, so should not be higher than a (here set to ) and to be smaller than a (here set to ).
Thus, if the Quasi-Newton scheme does not converge to a specific tolerance within a number of iterations , then the interval width is decreased, thus redefining a new guess for and the Quasi-Newton scheme is repeated in the interval .
Regarding the interplay between the number of collocation points and the time interval of integration as shaped by the variable step
size scheme, we note that for band-limited profile solutions, according to the Nyquist sampling theorem,
we can reconstruct the solution if the sampling rate is at least samples per second; for sampling points in a time interval say , the maximum allowed frequency for band-limited signals should not exceed . For any practical purposes, a sampling rate of at least 4 or even 10 times the critical frequency is required in order to deal with phenomena such as aliasing and response spectrum shocks.
For time-limited signals, thus, assuming that the energy of the signal in the time domain is bounded i.e., , then for all practical purposes, the critical frequency can be set as the frequency beyond which the amplitude of the Fourier transform can be considered negligible, i.e., lower than a certain threshold, say .
A continuation method for Newton’s iterations.
For Newton-type schemes, the speed of the convergence to (or the divergence from) the solution depends on the choice of the initial guess, here for the unknown weights. Thus, we address a numerical natural continuation method for providing “good” initial guesses for the weights of the PIRPNN. Suppose that we have already converged to the solution in the interval ; we want to provide for the next time interval , as computed from the proposed adaptation scheme described above, a good initial guess for the weights of the PIRPNN. We state the following proposition.
Let be the solution found with PIRPNN at the end of the time interval . Then, an initial guess for the weights of the PIRPNN for the time interval is given by:
where is the matrix with the initial guess of the output weights of the PIRPNNs and is the vector containing the values of the random basis functions in the interval .
At time , a first-order estimation of the solution is given by:
where is known. For the next time interval , the approximation of the solution with the PIRPNNs reads:
It can be easily seen, that the economy SVD decomposition of the -dimensional vector is given by:
Thus, the pseudo-inverse of , is . Hence, by Eq.(33), an initial guess for the weights for the time interval is given by:
Estimation of the interval of uniform distribution based on the variance/bias trade-off decomposition
Based on the choice of Gaussian basis functions, from Eqs.(6), (13) one has to choose the number of the basis functions, and the interval of the uniform distribution say , from which the values of the shape parameters are drawn. The theorems of uniform convergence (sections 2.3 and 2.4.1) consider the problem from the function approximation point of view. Regarding the approximation of a discrete set of data points, it has been proved that a set of randomly and independently constructed vectors in the hypercube will be pair-wise -orthogonal (i.e., ) with probability , where is sufficiently small for .
Here, we construct random vectors by parsimoniously sampling the values of the shape parameter from an appropriately bounded uniform interval for minimizing the two sources of error approximation, i.e., the bias and the variance in order to get good generalization properties. In our scheme, these, over all possible values of the shape parameter are given by (see Eq.(19)):
where denotes expectation operator. Overfitting, i.e., a high variance occurs for large values of and underfitting, i.e., a high bias occurs for small values of .
The expected value of the kernel
with respect to the probability density function of the uniform distribution of the random variablereads:
Similarly, the variance is given by:
The above expressions suggest that . This leaves us with only one parameter to be determined for the “optimal” estimation of the upper bound of the uniform interval. Here, the value of is found based on a reference solution, say resulting from the integration of a stiff problem, which solution profiles contain also sharp gradients. For our computations, we have chosen as reference solution the one resulting from the van der Pol (vdP) ODEs given by:
for and , as initial conditions; the time interval was set to , i.e., approximately three times the period of the relaxation oscillations, which for , is . The particular choice of results to a stiff problem, thus containing very sharp gradients resembling approximately a discontinuity in the solution profile within the integration interval.
The reference solution was obtained using the ode15s of the MATLAB ODE suite with absolute and relative error tolerances equal to 1e14.
Here, in order to estimate the “optimal” values of for the vdP problem, we computed the bias-variance loss function using points in fixing the number of collocation points to . In Figure 1, we show the contour plots of the bias and variance errors and the computational times required for convergence, with both absolute tolerance and relative tolerance set to 1e06, for the vdP model.
As can be seen in Figure 1(a) there is a linear band of “optimal” combinations of within which the bias-variance error is for any practical purposes the same of the order of 1e2. In particular, this band can be coarsely described by a linear relation of the form . Furthermore, as shown in Figure1(b), within this band of “optimal” combinations, the computational cost is minimum for . Based on the above, the parsimonious of the variables and are giving the best numerical accuracy and minimum computational cost. We note that the above parsimonious optimal values are set fixed once and for all for all the benchmark problems that we consider here.