DeepAI

Numerical Solution of Stiff Ordinary Differential Equations with Random Projection Neural Networks

We propose a numerical scheme based on Random Projection Neural Networks (RPNN) for the solution of Ordinary Differential Equations (ODEs) with a focus on stiff problems. In particular, we use an Extreme Learning Machine, a single-hidden layer Feedforward Neural Network with Radial Basis Functions which widths are uniformly distributed random variables, while the values of the weights between the input and the hidden layer are set equal to one. The numerical solution is obtained by constructing a system of nonlinear algebraic equations, which is solved with respect to the output weights using the Gauss-Newton method. For our illustrations, we apply the proposed machine learning approach to solve two benchmark stiff problems, namely the Rober and the van der Pol ones (the latter with large values of the stiffness parameter), and we perform a comparison with well-established methods such as the adaptive Runge-Kutta method based on the Dormand-Prince pair, and a variable-step variable-order multistep solver based on numerical differentiation formulas, as implemented in the and MATLAB functions, respectively. We show that our proposed scheme yields good numerical approximation accuracy without being affected by the stiffness, thus outperforming in same cases the and functions. Importantly, upon training using a fixed number of collocation points, the proposed scheme approximates the solution in the whole domain in contrast to the classical time integration methods.

• 5 publications
• 4 publications
• 13 publications
• 13 publications
12/10/2020

Extreme learning machine collocation for the numerical solution of elliptic PDEs with sharp gradients

We introduce a new numerical method based on machine learning to approxi...
03/10/2022

Parsimonious Physics-Informed Random Projection Neural Networks for Initial-Value Problems of ODEs and index-1 DAEs

We address a physics-informed neural network based on the concept of ran...
06/15/2022

On Numerical Integration in Neural Ordinary Differential Equations

The combination of ordinary differential equations and neural networks, ...
02/13/2022

Obtaining weights for Gröbner basis computation in parameter identifiability problems

We consider a specific class of polynomial systems that arise in paramet...
04/24/2022

Numerical Computation of Partial Differential Equations by Hidden-Layer Concatenated Extreme Learning Machine

The extreme learning machine (ELM) method can yield highly accurate solu...
06/09/2012

A Connectionist Network Approach to Find Numerical Solutions of Diophantine Equations

The paper introduces a connectionist network approach to find numerical ...
01/31/2022

Constructing coarse-scale bifurcation diagrams from spatio-temporal observations of microscopic simulations: A parsimonious machine learning approach

We address a three-tier data-driven approach to solve the inverse proble...

1 Introduction

The idea of using Artificial Neural Networks (ANNs) for the numerical solution of Ordinary Differential Equations (ODEs) dates back to the ’90s. One of the first works in the field was proposed by Lee and Kang [44], who addressed a modified Hopfield Neural Network to solve a first-order nonlinear ODE. This method is based on the discretization of the differential operator with finite differences and the minimization of an energy function of the resulting algebraic difference equations. Following up this work, Meade and Fernadez [50]

used a non-iterative scheme based on Feedforward Neural Networks (FNN) for the solution of linear ODEs, where the estimation of the weights of the FNN is based on the Galerkin weighted-residuals method. In 1998, Lagaris et al.

[42] introduced a numerical optimization method based on FNNs for the solution of nonlinear ODEs and PDEs. The method constructs appropriate trial functions by the aid of a single-hidden layer FNN that is trained to minimize the error between the FNN prediction and the right-hand side of the differential equations. The proposed approach is demonstrated through both initial and boundary-value problems and a comparison with Galerkin finite elements is also provided. Based on the work of Lagaris et al. [42], Filici [23]

proposed a single-layer multiple linear output perceptron, providing a proof for the error bound. Network training was performed with the LMDER optimization solver from MINIPACK. The performance of the approach was tested through simple ODE problems, including the van der Pol equation, but without stiffness. Tsoulos et al.

[68] employed an FNN trained by grammatical evolution and a local optimization procedure to solve second-order ODEs. More recently, Dufera [18] addressed a deep training network with back-propagation. The performance of the approach was tested with the non-stiff ODEs presented in Lagaris et al. [42]. The authors report that their scheme outperforms the non-adaptive 4th order Runge-Kutta method in terms of numerical approximation accuracy when considering short time intervals and a small number of collocation points. A review and presentation of various ANN schemes for the solution of ODES can be found in Yadav et al. [73]. In all the above procedures, a computationally demanding optimization algorithm is required for the evaluation of the parameters of the network and the ODEs problems are non-stiff.

Yang et al. [74] aim to solve ODE problems using the so-called Extreme Learning Machine (ELM) concept [29, 31]: the weights between the input and the hidden layer, as well as the biases of the hidden nodes, are chosen randomly, and the remaining unknown weights between the hidden and the output layer are computed in one step using least squares with regularization. The authors address a single-layer Legendre neural network to solve non-stiff ODEs up to second order, including the Emden-Fowler equation. The performance of the scheme is compared against other machine learning schemes, including a cosine basis function neural network trained by the gradient descent algorithm, the explicit Euler scheme, the Suen 3rd-order and the classical 4th-order Runge-Kutta methods. The authors report high numerical accuracy and computing times comparable or smaller than the other methods.

As discussed, the above studies deal with non-stiff ODE problems. One of the first papers that dealt with the solution of stiff ODEs and Differential-Algebraic Equations by ANNs was that of Gerstberger and Rentrop [24], where an FNN architecture was proposed to implement the implicit Euler scheme. The performance of that architecture was demonstrated using scalar ODEs and the van der Pol equations considering however only mild stiffness, setting the parameter that controls the stiffness to values up to 5.

More recently, theoretical and technological advances have renewed the interest for developing and applying machine learning techniques for the solution of differential equations. Due to the fact that (large-scale systems of) stiff ODEs arise in modelling an extremely wide range of problems, from biology and neuroscience to engineering processes and chemical kinetics, and from financial systems to social science and epidemiology, there is a re-emerging interest in developing new methods for their efficient numerical solution. Mall and Chakraverty [49] proposed a single-layer Hermite polynomial-based neural network trained with back-propagation to approximate the solutions of the van der Pol-Duffing and Duffing oscillators with low to medium values of the stiffness parameter. Following up this work, Chakraverty and Mall [12] proposed a single-layer Chebyshev neural network with regression-based weights to solve first- and second-order nonlinear ODEs and in particular the nonlinear Lane-Emden equation. The training was achieved using back-propagation. Budkina et al. [7] and Famelis and Kaloutsa [22] proposed a single-layer ANN to solve highly stiff ODE problems. The networks were trained for different values of the parameter that controls stiffness over relatively short time intervals using the Levenberg-Marquardt algorithm. The above studies on the solution of stiff ODEs report good numerical approximation accuracy of the proposed schemes, in general for relatively short time intervals and small sizes of the grid. However, no indication is given about the computational time required for the training versus the time required by widely used solvers for low-to-medium and medium-to-high stiff problems such as the ode45 and ode15s functions available from the MATLAB ODE suite [67]

. We note that the previous approaches, in general, fit into the framework of Physics-Informed Neural Networks (PINNs), which are trained to solve supervised learning tasks while respecting any given laws of physics described by general nonlinear differential equations

[60, 39]. This is also part of the so-called scientific machine learning, which is emerging as a potential alternative to classical scientific computing.

1.1 Our Contribution

We propose a new scheme based on Random Projection Neural Networks (RPNNs) and in particular on ELMs. RPNNs are a family of networks including randomized and Random Vector Functional Link Networks (RVFLNs)

[65, 53, 34], Echo-State Neural Networks and Reservoir Computing [36, 37, 52, 64, 55], and Extreme Learning Machines [29, 31, 30]. The keystone idea behind all these approaches is to use a fixed-weight configuration between the input and the hidden layer, fixed biases for the nodes of the hidden layer, and a linear output layer. Hence, the output is projected linearly onto the functional subspace spanned by the nonlinear basis functions of the hidden layer, and the only unknowns that have to be determined are the weights between the hidden and the output layer. Their estimation is done in one step using least squares. Here, for the solution of stiff problems of ODEs, we propose a single-layer FNN with Radial Basis Functions (RBFs) whose parameters are properly uniformly-distributed random variables. Thus, the proposed neural network scheme constitutes a Lipschitz embedding constructed through the random projection. The feasibility of the scheme is guaranteed by the celebrated Johnson and Lindenstrauss Lemma [38] and universal approximation theorems proved for random projection networks, and in particular for ELMs [32].
By combining this choice of the underlying functional space, for which we can also explicitly compute the derivatives, and the Gauss-Newton method for the solution of the system of nonlinear algebraic equations equivalent to the least squares problem, we obtain an efficient way to calculate by collocation a neural network function that approximates the exact solution in any point of the domain. It is very important to mention that once the random parameters are chosen in a proper interval, the underlying space is capable to catch the steep gradient behaviours of the solution, thus providing good numerical accuracy also for stiff ODE problems and outperforming in some cases the ode45 and the stiff solver ode15s: no adaptive strategy on the step size is needed so that the method can be used “as it is” despite the stiffness (see also what was demonstrated for boundary-layer problems with steep gradients in [9, 21]).
We apply the proposed scheme to solve two benchmark stiff ODE problems, namely the ROBER problem, a stiff system of three nonlinear ODEs describing the kinetics of an autocatalytic reaction [63], and the well-known van der Pol equations with high stiffness. We test the performance of the scheme, in terms of both approximation accuracy and computational time. We show that the proposed scheme performs comparably to, and in some cases where steep gradients arise better than, the ode15s in terms of numerical approximation accuracy, while ode45

in some cases completely fails or needs many points to satisfy a specific tolerance. Moreover, our approach provides the solution directly as a function that can be evaluated at every point of the domain, in contrast to the classical numerical methods, where extra interpolation/extrapolation steps are usually performed for this purpose.

The structure of the paper is as follows. In Section 2 we give very briefly a coarse definition of stiff ODEs. In Section 3, we provide some preliminaries on the solution of ODEs with FFNs based on the “classical" optimization approach, and the basic concept behind the use of RPNNs; we also discuss the celebrated Johnson and Lindenstrauss Lemma [38]. In Section 4, we propose our approach for solving ODEs with RPNNs, providing also a pseudo-code of the corresponding algorithm, and discuss its approximation properties within the framework of the universal approximation theorem of ELMs [32]. In section 5, we present the numerical results obtained by applying the proposed approach to the above-mentioned low-dimensional stiff ODE problems (Rober and van der Pol problems) along with a comparison with ode45 and ode15s. Conclusions are given in Section 6.

2 Stiff ODEs

We aim to solve initial-value ODE problems:

 ⎧⎪⎨⎪⎩dyidx=fi(x,y1,y2,...,ym),yi(x0)=αi,i=1,2,…,m, (1)

where the functions and the values are the input data, and the functions are the unknowns. In order to simplify the notation, we group the functions in a vector function , and the functions in .

The unknown solution can exhibit a complex behaviour, including steep gradients, which result in difficulties for the design and implementation of numerical methods. These difficult problems are usually the ones where the so-called stiffness arises. Until now, there is no complete and precise definition for stiffness. Following Lambert ([43]), one has to consider stiffness as a phenomenon exhibited by the system rather than a property of it. Generally, an ODE problem is called stiff if there exists a solution that varies very slowly but there are nearby solutions that vary rapidly, so that (explicit) numerical algorithms need many small steps to obtain accurate and reliable results.

A widely used definition for stiffness is the following, given by J.D. Lambert (see also [6]):

If a numerical method with a finite region of absolute stability, applied to a system with any initial conditions, is forced to use in a certain interval of integration a step length which is excessively small in relation to the smoothness of the exact solution in that interval, then the system is said to be stiff in that interval.

In the case of constant coefficients, where the stiffness ratio is introduced, one can define stiffness as follows. Consider the linear constant coefficient inhomogeneous system in the unknown :

 dydx=Ay+¯¯¯f(x), (2)

where is a nonlinear term and

is a constant and diagonalizable matrix with eigenvalues

,

, (assumed distinct) and corresponding eigenvectors

. Suppose also that:

 Re(λl)<0,l=1,2,...,m. (3)
Definition 2.1 (Stiffness ratio).

With the above notations, let be such that

 |Re(¯¯¯λ)|≥|Re(λl)|≥|Re(λ––)|,l=1,2,…,m.

We define the stiffness ratio of system (2) as

 |Re(¯¯¯λ)||Re(λ––)|.

The above definition is motivated by the explicit general solution of (2), that is

 y(x)=m∑l=1κlexp(λlx)cl+g(x), (4)

where is a particular integral taking into account the forcing term. In the hypothesis that (3) holds true, one has that each of the terms vanishes as , and hence the solution approaches asymptotically as . In this case, the term decays monotonically if is real, and sinusoidally if is complex. Interpreting to be time (as it is often in physical problems), is called the transient solution and the steady-state solution. If is large, then the corresponding term decays quickly as increases and thus it is called a fast transient; if is small, then decays slowly and is called a slow transient. Letting and be the indices identifying and , respectively, we have that is the fastest transient and the slowest. The ratio between these two terms gives the stiff behaviour of the linear ODE system.

Many examples of stiff problems exhibit also other features, but for each feature there maybe other stiff problems not exhibiting that particular feature. However, we note that Lambert refers to these features as “statements" rather than definitions. A few of them are the following: (1) a linear constant coefficient system is stiff if all of its eigenvalues have negative real part and the stiffness ratio is large; (2) stiffness occurs when stability requirements, rather than those of accuracy, constrain the step length; and (3) stiffness occurs when some components of the solution decay much more rapidly than others. One could also say that stiffness is a numerical efficiency issue since non-stiff methods can in theory solve stiff problems if they use many points (i.e. a lot of computing time). In practice, many extremely stiff problems bias “classical” numerical methods to tiny steps, thus making the computational cost huge and hence the methods inapplicable. Furthermore, in some contexts, a specific number of points is available and as a consequence, the problem is unsolvable with several numerical methods.

3 Preliminaries

3.1 Feedforward Neural Networks

A FNN is a biologically inspired regression (or classification) algorithm. It consists of processing units, called neurons or nodes, organised in layers. There are three types of layers: the input layer, the output layer and possibly one or more hidden layers. The units of the output and every hidden layer are connected to all the units of the previous layer. These connections may have different strengths, called weights, and encode the “knowledge” of the network. The data enter the input layer and pass through each layer during the forward phase. Each unit accepts a signal consisting of a (usually linear) combination of the outputs of the previous layer units and, using an activation function, creates an output that is transmitted to the next layer until the final output layer. Different activation functions and/or numbers of units can be used for each layer. A simple example of FNN is the Single Layer FNN (SLFNN), consisting of

input units, a single layer with hidden units with biases, and output units (without biases). Given an input , the output of this SLFNN reads:

 N(x)=WoΦ(Wx+b,q), (5)

where is the matrix containing the weights from the hidden layer to the output layer, is a vector function with components activation functions , is the matrix containing the weights from the input to the hidden layer, is the vector of the biases of the hidden nodes, and

is a vector containing the hyperparameters of the neural network, such as the parameters of the activation functions (e.g., for radial basis functions, the biases of the hidden neurons and the variances of the Gaussian functions), the learning rate, and the batch size. Many results are available concerning the approximation properties of the previous FFN. The most important one from the numerical point of view is the Universal Approximation Theorem, for which we refer to the original papers

[14, 27, 28, 56] and the recent review [41]. Next we present some results concerning our proposed machine learning scheme. What can be summarized here is that a FNN is capable of approximating uniformly any (piecewise-)continuous (multivariate) function, to any desired accuracy. This implies that any failure of a function mapping by a (multilayer) network must arise from an inadequate choice of weights and biases or an insufficient number of hidden nodes. Moreover, in the univariate case only one hidden layer is needed.

Given a training set of input-output pairs, the estimation of the appropriate weights and biases is usually attempted by using an optimization procedure aimed at minimizing a cost function. In particular, the “classical way" (see e.g. [42]

) to solve (partial) differential equations in a

-dimensional domain with FNNs involves the solution of a minimization problem as follows. One first defines a trial solution in the form of

 Ψ(x,P,q)=Ω(x,N(x,P,q)), (6)

where has components, each associated with a component of the solution , is sufficiently smooth, has components , and is a matrix containing the network parameters , and the vector of biases associated with the th component of . Then, we consider a discretization of the equation that has to be solved, so as to settle the conditions that the optimization has to manage.

In the case of the solution of ODEs, according to the above notation , i.e. the input domain is one-dimensional (representing for example time). By discretizing the domain of the ODE problem (1) with points , so as to collocate the equations, one can determine the values of the network parameters in by solving the following optimization problem:

 minPE(P):=n∑j=1∥∥∥dΨdx(xj,P,q)−f(xj,Ψ(xj,P,q))∥∥∥2. (7)

More details on the construction of trial solutions and the minimization problem concerning our approach are given in Section 4. In order to deal with the problem (7), one usually needs quantities such as the derivatives of with respect to the input and the weights and biases represented by . They can be obtained in several ways, e.g. by computing analytical derivatives, by using finite difference or other numerical approximations, by symbolic differentiation or by automatic differentiation (see, e.g., [46] and the references therein).

Although a specialized form of automatic differentiation, known as back-propagation, is widely used in ANNs, in this work, we can easily compute analytical first-order derivatives (see also [42]), and then apply to problem (7) a variety of numerical methods that use those derivatives [40]. For a Single-Input-Single-Output (SISO) NN (i.e. and ), with points and one hidden layer with elements in (5), it is straightforward to show that the first-order derivative with respect to , is given by

 ∂N∂xl=h∑j=1woj∂Φj∂xl, (8)

while the derivatives of the network with respect to the training parameters, which are in general the input and output weights and , are given by:

 ∂N∂wj=h∑i=1woi∂Φiwj,∂N∂woj=Φj. (9)

(Note that we use the lightface notation to indicate the SISO version of the neural network ). Once we have the above derivatives of , we can compute the first-order derivatives of the cost function in (7) with respect to the parameters of , and thus we are in principle able to train the network by any minimization technique using those derivatives. However, the training phase may require a very large computing time, thus we address a strategy that optimizes the parameters via the solution of a suitable least-squares problem.

Our proposed FNN is based on radial basis transfer functions, introduced in the next section. We remark that another natural choice for the transfer function, widely used in ANN, is a sigmoidal function, which is a bounded monotone S-shaped function with a bell-shaped first derivative (see, e.g.

[57]).

3.2 Radial Basis Function Neural Networks

Radial Basis Function Neural Networks (RBFNNs) are special cases of ANNs where each hidden unit is associated with a radial basis transfer function. Powell introduced radial basis functions (RBFs) in 1985 for multivariate function interpolation ([58]). As transfer functions in ANN, RBFs were first addressed by Broomhead and Lowe ([5]). An RBF is a function whose value depends only on the distance between its argument and some fixed point called center:

 ϕc(x)=ϕ(||x−c||), (10)

where denotes a vector norm.

A widely used RBF is the Gaussian RBF:

 G(x,c,σ)=exp(−||x−c||2σ2), (11)

employed with the Euclidean norm. A Gaussian RBF is characterized by the width or scale parameter .

In RBFNNs, each hidden unit is more “sensitive" to data points close to its center. Furthermore, for Gaussian RBFNNs, this sensitivity can be tuned by adjusting the width parameter : a larger implies less sensitivity. Thus, for most of the problems, the choice of the centers is crucial. For the specific case of RBFNNs with the same in all hidden nodes, Park and Sandberg proved that those networks are capable of universal approximation [56]. Furthermore, Liao et al. extended this result to very mild hypotheses on the activation function [45]. In Section 4, we suggest a choice of these parameters in the context of solving problems of ODEs.

Theorem 3.1.

Let be as in equation (10). If is continuous almost everywhere, locally essentially bounded, and not a polynomial, then for any compact set is dense in with respect to the uniform norm, i.e., given any and , there exists such that

 ||γ−g||L∞(K)≤ϵ.

RBFNNs have gained popularity because of a number of advantages compared to other types of ANNs, including their simpler structure and faster learning algorithms. RBFNNs have been applied to problems in many different areas, such as image processing [20, 51, 75]), speech recognition [70, 62, 2], time series analysis [76], adaptive equalization [11, 13] and others.

3.3 Random Projection RBFNNs

The training of an ANN requires the minimization of a cost function. But, even for the simple case of SLFNNs, this task may become challenging. Many optimization algorithms used for training ANNs apply stochastic gradient-based approaches which back-propagate the error and adjust the weights through specific directions [4]. More recently, second-order stochastic optimization methods have been widely investigated to get better performances than first-order methods, especially when ill-conditioned problems must be solved (see, e.g., [66] and the references therein). Nevertheless, there are still difficulties in using these approaches, such as the setting of the so-called hyperparameters, the significant increase of computing time when the number of data or the number of nodes in the hidden layer grows, and the high non-convexity stemming from the use of nonlinear activation functions, which can lead the algorithms to local minima.

A way to deal with the “curse of dimensionality" in training ANNs is to apply the concept of random projection. The idea behind random projections is to construct a Lipschitz mapping that projects the data into a random subspace. The feasibility of this approach can been justified by the celebrated Johnson and Lindenstrauss (JL) Theorem:

[38]:

Theorem 3.2 (Johnson and Lindenstrauss).

Let be a set of points in . Then, and such that , there exists a map such that

 (1−ϵ)∥u−v∥2≤∥F(u)−F(v)∥2≤(1+ϵ)∥u−v∥2∀u,v∈X. (12)

Note that while the above theorem is deterministic, its proof relies on probabilistic techniques combined with Kirszbraun’s theorem to yield a so-called extension mapping [38]. In particular, it can be shown that one of the many such embedding maps is simply a linear projection matrix with suitable random entries. Then, the JL Theorem may be proved using the following lemma.

Lemma 3.1.

Let be a set of points in and let be the random projection defined by

 F(u)=1√kRu,u∈Rd,

where

has components that are i.i.d. random variables sampled from a normal distribution. Then,

 (1−ϵ)∥u∥2≤∥F(u)∥2≤(1+ϵ)∥u∥2

is true with probability

.

Similar proofs have been given for distributions different from the normal one (see, e.g., [1, 15, 69, 72]). In general, the proof of the JL Theorem is based on the fact that inequality (12) is true with probability 1 if is large enough. Thus, the theorem states that there exists a projection (referred to as encoder) of into a random subspace of dimension , where the distance between any pair of points in the embedded space is bounded in the interval . Moreover, [15] proved that if the random projection is of Gaussian type, then a lower bound of the embedding dimension is given by .

We note that the above mapping is a feature mapping, which in principle may result in a dimensionality reduction () or a projection into a higher dimensional space () in which one seeks a linear low-dimensional manifold (in analogy to the case of kernel-based manifold learning methods). We also note that while the above linear random projection is but one of the choices for constructing a JL embedding (and proving it), it has been experimentally demonstrated and/or theoretically proven that appropriately constructed nonlinear random embeddings may outperform simple linear random projections. For example, in [25] it was shown that deep networks with random weights for each layer result in even better approximation accuracy than the simple linear random projection. Based on the concept of random projection, Schmidt et al. [65] performed computational experiments using FNNs with sigmoidal transfer functions in the hidden layer and a bias on the output layer, thus showing that by fixing the weights between the input and the hidden layer at random values, and by training the output weights via the solution of linear equations stemming from the Fisher minimization problem, the approximation accuracy is equivalent to that obtained with the standard back-propagation iterative procedure where all the weights are calibrated.

In the same year, Random Vector Functional-Link Networks (RVFLNs) were addressed in [54], in which the input layer is directly connected also to the output layer, the internal weights are chosen randomly in and the output weights are estimated in one step by solving a system of linear equations. By formulating a limit-integral of the function to be approximated with Monte Carlo simulations, in [35], it was proved that RVFLNs are universal approximators for continuous functions on bounded finite-dimensional sets and that the rate at which the approximation error vanishes is of the order of , where is the number of basis functions parametrized by random variables. Extensive computational experiments performed in [77] showed that the direct links employed in RVFLNs between the input and the output layer play an important role for the performance, while the bias of the output neuron is not so important.

Reservoir Computing (otherwise called Eco State Networks) is another approach to network training based on the concept of random projection [36, 71, 48, 8]

. The basic structure of Reservoir Computing consists of a Recurrent Neural Network (RNN) with a large number of hidden units, an extra input and an extra output unit. Internal connection weights form a sparse random connectivity pattern with fixed values. The estimation of the optimal output weights is achieved by solving a system of linear equations arising from a mean-squares minimization problem.

Furthermore, it has been shown that single-layer FNNs with randomly assigned input weights and biases of the hidden layer, and with infinitely differentiable functions at the hidden layer called Extreme Learning Machines (ELMs) can universally approximate any continuous function on any compact input set [33, 29, 30, 31]. For the case of a FNN with a single hidden layer of units, the random projection of the input space can be written as

 Y=Φ(X), (13)

where the columns of the matrix represent a set of points in the input -dimensional space, the columns of are the corresponding random projections, and acts as an encoder, i.e. a family of transfer functions whose parameters are sampled from a certain random distribution function. If the values of the weights between the input and the hidden layer are fixed, then the random projection can be written as a linear map:

 Y=WoΦ,Y∈Rk×n, (14)

where

is a random matrix containing the outputs of the nodes of the hidden layer as shaped by the

random distribution functions (e.g., RBFs or sigmoidal functions) and the -dimensional inputs. Thus, the so-called ELMs can be seen as underdetermined linear systems in which the output weights are estimated by solving minimum-norm least-squares problems [33, 30, 31, 3]. Moreover, for ELMs an interpolation-like theorem has been proven by Huang et al. in [33]:

Theorem 3.3.

Let us consider a single-hidden-layer FNN with hidden units and an infinitely differentiable transfer function ,   distinct input-output pairs (

, and randomly chosen values from any continuous probability distribution for the internal weights

and for the values of the biases of the neurons of the hidden layer, grouped in . Let us also denote by and the matrices containing the internal and output weights, and by and the matrices with columns and . Then, the hidden layer output matrix , whose elements are determined by the action of the transfer functions on (where has all the entries equal to 1), has full rank and

 ∥WoΦ−Y∥=0

with probability 1.

A review on on neural networks with random weights can be found in [10].

4 The Proposed Method

Here, we focus on the numerical solution of the initial-value problem (1) in an interval . Based on the previous notation, for this problem and . Thus, we denote by the th component of the trial solution defined in (6), where is the matrix containing the weights between the hidden and the output layer. Note that we separate from the other network parameters consisting of the input weights, the biases of the hidden layer and the parameters of the transfer function, and denote those parameters again by , with a little abuse of notation. Furthermore, the vector of hyperparameters in (6) is neglected because we do not use it. Following [42] and taking into account that the trial solution must satisfy the initial value conditions , , we set:

 Ψi(x,woi,pi)=αi+(x−x0)Ni(x,woi,pi), (15)

where is a single-output FNN with parameters the output weights , and contains the remaining parameters associated with that network. Then, if one considers the numerical solution based on collocation points , the error function we seek to minimize for training the FNN is given by:

 E(Wo,P)= (16) m∑i=1n∑j=1(dΨidx(xj,woi,pi)−fi(xj,Ψ1(xj,wo1,p1),…,Ψm(xj,wom,pm))2.

Here, we propose a RPNN for the solution of ODEs that has input collocation points, nodes in the hidden layer with Gaussian RBFs, and single-input-single output neural sub-networks with a linear output transfer function. In particular, we consider each sub-network to be a linear combination of RBFs, as specified next:

 Ni(x,woi,pi)=h∑j=1wojiGji(x),i=1,2,…,m, (17)

where

 Gji(x) =G(wjix+bji,cj,σji)=exp(−(wjix+bji−cj)2σji), (18) j=1,2,…h,i=1,2,…m,

with and with and . Notice that in the above configuration, we fix all the internal weights to 1 and the centers to be equidistant in the domain, while we set randomly from appropriately chosen uniform distributions the biases and the width parameters . This configuration is slightly different from the classical ELM configuration with RBFs, where the internal weights are set equal to 1 and the biases equal to 0, while the centers are randomly uniformly distributed in (see, e.g., [32]). However, it is straightforward to show that we still get universal approximation properties, as stated in the next theorem.

Theorem 4.1.

Let us fix a continuous (target) function and consider the system of functions in (17), with , and defined in (18) and the subsequent lines. Then, for every sequence of randomly chosen parameters there exists a choice of such that

 limm→∞∥Ni(x,woi,pi)−φ(x)∥=0 with probability one.

Proof. Since and are fixed, one can manipulate (18) and write it as

 Gji(x)=exp(−(x+αji)2βji),

where the parameters and are random variables (because and are random variables sampled from continuous probability distributions). Therefore, the network fits the hypotheses of an FNN with random hidden nodes. Because the considered RBFs are sufficiently regular, Theorem II.1 in [32] holds and hence the thesis follows in a straightforward manner. ∎
Under the previous assumptions, the derivative of the th component of the trial solution with respect to the collocation point is given by (see (8)):

 ∂Ψi∂xl=Ni(xl,woi,pi)−(xl−x0)Σhj=12σjiwoji(xl+bji−cj)exp(−(xl+bji−cj)2σji), (19)

while, for any fixed , the derivative of with respect to the only unknown parameter is given by (see (15) and (9)):

 ∂Ψi∂woji=(xl−x0)exp(−(xl+bji−cji)2σji). (20)

For the determination of the randomized parameters of the Gaussian RBF, we need to fix reference intervals where to look for these parameters. Our requests are that the functions are neither too steep nor too flat in the reference interval, and on each collocation point there are at least two basis functions giving values that are not too small. Then, based on numerical experiments, the biases of the hidden units and the parameters are taken to be uniformly randomly distributed in the intervals

 [−(xend−x0)6−¯s,−¯s] and [110,20(xend−x0)3],

respectively. We emphasise that this choice appears to be problem independent. Therefore, the only parameters that have to be determined by training the network are the output weights . Hence, for the collocation points , the outputs of each network , , are given by:

 Ni(x1,x2,…xn,woi,pi)=Riwoi, (21)

where is the vector with th component the output of corresponding to , and is defined as

 Ri(x1,…,xn,pi)=⎡⎢ ⎢⎣G1i(x1)⋯Ghi(x1)⋮⋮⋮G1i(xn)⋯Ghi(xn)⎤⎥ ⎥⎦. (22)

The minimization of the error function given in (16) is achieved by a Gauss-Newton scheme (see, e.g., [40]) over nonlinear residuals , with , , , given by:

 Fq(Wo)=dΨidxl(xl,woi,pi)−fi(xl,Ψ1(xl,wo1,p1),…,Ψm(xl,wom,pm)). (23)

By setting and looking at as a vector where the unknown output weights are stacked, i.e.

 Wo=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣wo1wo2⋮wom⎤⎥ ⎥ ⎥ ⎥ ⎥⎦∈Rmh,

 Wo(ν+1)=Wo(ν)+dWo(ν),dWo(ν)=argminWo∥(∇WoF(Wo(ν)))TdW+F(Wo(ν))∥, (24)

where denotes the current iteration and is the Jacobian matrix of with respect to :

 ∇WoF(Wo(ν))=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣∂F1∂wo11∂F1∂wo12…∂F1∂wo1h…∂F1∂wo1(mh)∂F2∂wo11∂F2∂wo12…∂F2∂wo1h…∂F2∂wo1(mh)⋮⋮⋱⋮⋱⋮∂Fq∂wo11∂Fq∂wo12…∂Fq∂wo1h…∂Fq∂wo1(mh)⋮⋮⋱⋮⋱⋮∂Fnm∂wo11∂Fnm∂wo12…∂Fnm∂wo1h…∂Fnm∂wo1(mh)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦∣∣(Wo(ν)). (25)

Note that the residuals depend on the derivatives in (19) and the trial functions in (15), while the elements of the Jacobian matrix depend on the derivatives of in (20) as well as on the mixed derivatives , which, based on (19), are given by

 ∂2Ψi∂xl∂woji=∂Ni(xl,woi,pi)∂woji−(xl−x0)2σji(xl+bji−cj)exp(−(xl+bji−cj)2σji), (26)

where

 ∂Ni(xl,woi,pi)∂woji=exp(−(xl+bji−cj)2σji). (27)

As in general , the minimization problem in (24) is an underdetermined linear system, and we compute its solution requiring a minimum

norm. This can be performed by using the Moore-Penrose pseudoinverse of the Jacobian matrix with the Singular Value Decomposition. More precisely, we estimate the pseudoinverse by cutting off all the singular values (and their corresponding singular vectors) below a small tolerance

. This allows us to deal with the case where the Jacobian is rank deficient, which may happen because the RBFs are not injective functions. Furthermore, this choice also allows us to cope with the difference between the exact rank and the numerical rank of the Jacobian matrix. Following [26], for some small the -rank of a matrix is defined as follows:

 rϵ=min{rank(B)∈Rmn×mh:∥M−B∥L2≤ϵ}. (28)

Then, if is “small enough”, we neglect all the singular values below and approximate the pseudoinverse of the Jacobian matrix as

 (∇WoF)+=VrϵΣ+rϵUTrϵ, (29)

where is the diagonal matrix with the largest singular values of , is its pseudoinverse, and and are the matrices with columns the corresponding left and right eigenvectors, respectively. The value of used in our experiments is specified at the beginning of Section 5. Thus, the direction at each Gauss-Newton iteration is given by

 dWo(ν)=−VrϵΣ+rϵUTrϵF(Wo(ν)).

We can summarize the proposed method in the pseudo-code shown in Algorithm 1, where denotes the uniform random distribution in the interval and is a tolerance used in the Gauss-Newton stopping criterion.

5 Numerical results

We implemented the Algorithm 1 using MATLAB 2021a on an Intel Core i5-8265U with up to 3.9 GHz frequency and a memory of 8 GB. The Moore-Penrose pseudoinverse of was computed with the MATLAB built-in function pinv, with a singular value tolerance . For our simulations, we chose two well-known and challenging stiff ODE systems: the ROBER and the van der Pol system (the latter becomes stiff as its parameter increases). For comparison purposes, we also solved the ODE problems with two widely used MATLAB built-in functions, namely ode45, implementing an adaptive-step Runge-Kutta method based on the Dormand-Prince pair, and ode15s, implementing a variable-step variable-order multistep method based on numerical differentiation formulas. In order to estimate the error in the solution computed with our RPNN approach, we used as reference solution the one computed by ode15s with absolute and relative error tolerances equal to . To this aim, we computed the and the norms of the differences between the computed and the reference solutions, as well as the Mean Absolute Error (MAE); all the above performance metrics are being evaluated at the grid collocation points where the solution is sought with the proposed RPNN algorithm. Finally, we ran each solver 100 times and computed the median, 5th percentile and 95th percentile of the execution time, in seconds. The time of each run was measured by using the MATLAB commands tic and toc. Henceforth, we use instead of , since the independent variable in the test problems represents time. We also assume .

5.1 Case study 1: ROBER system

The ROBER model was developed by Robertson to describe the kinetics of an autocatalytic reaction [63]. The set of the reactions reads:

 A →k1C+B (30) B+B →k2C+B B+C →k3A+C,

where A, B, C are chemical species and , and are reaction rate constants. Assuming idealized conditions and that the mass action law is applied for the rate functions, we have the following system of ODEs:

 y′1 =−k1y1+k3y2y3 (31) y′2 =k1y1−k2y22−k3y2y3 y′3 =k2y22,

where , and denote the concentrations of A, B and C, respectively. The ROBER problem has been used as a benchmark problem to test stiff ODE integrators.

In our simulations, we set the typical values of the parameters, i.e. , and , and , and as initial values of the concentrations. Stiffness arises from the large difference among the reaction rate constants. Based on the proposed methodology, we construct three trial solutions:

 Ψ1(t) =α1+tN1(t,wo1), Ψ2(t) =α2+tN2(t,wo2), Ψ3(t) =α3+tN3(t,wo3),

where , and and the ’s have been neglected because they have been fixed. The error function to be minimized is given by:

 E(wo1,wo2,wo3) =n∑i=1((dΨ1dt(ti,wo1)−f1(ti,Ψ1,Ψ2,Ψ3))2 +(dΨ2dt(ti,wo2)−f2(ti,Ψ1,Ψ2,Ψ3))2 +(dΨ3dt(ti,wo3)−f3(ti,Ψ1,Ψ2,Ψ3))2),

where

 f1 =−k1(α1+tiN1(t1,wo1)), f2 =k1(α1+tiN1(t1,wo1))−k2t2iN2(ti,wo2)2−k3t2iN2(ti,wo2)N3(ti,wo3), f3 =k2t2iN2(ti,wo2)2.

For the numerical solution of the ROBER system with the proposed method, we used a grid of equidistant collocation points in the time interval . We also set and hence . For our illustrations, we tested the numerical performance of the scheme using two values of the tolerance , i.e.  and . For comparison purposes, we used the same values of both the absolute and relative error tolerances for the functions ode45 and ode15s. The approximate solutions obtained with tolerances and are shown in Figure 1. Furthermore, in Tables 1 and 2, we report the corresponding numerical approximation accuracy obtained with the various methods in terms of the MAE, -norm and -norm of the error with respect to the reference solution. Here, in order to compute these errors, we evaluated the corresponding solutions in the grid points . In Table 3, we show the number of points required by each method, and in Table 4, we report the execution times (median, 5th percentile and 95th percentile) of all the methods, including the time required to compute the reference solution. Finally, in Tables 5 and 6, we present the errors of the approximate solutions obtained with both tolerances, evaluated at 1000 equidistant points in . We note that for ode45 and ode15s the values of the solutions at these points were obtained by using