1 Introduction
The discovery of physical laws and the solution of the inverse problem in complex systems modelling, i.e. the construction of Partial Differential Equations (PDEs) for the emergent dynamics from data and consequently the systematic analysis of their dynamics with established numerical analysis techniques is a holy grail in the study of complex systems and has been the focus of intense research efforts over the the last years [1, 2, 3, 4]. From the early ’90s, exploiting both theoretical and technological advances, researchers employed machine learning algorithms for system identification using macroscopic observations, i.e. assuming that we already know the set of coarse variables to model the underlying dynamics and the derivation of normal forms ([5, 6, 7, 8, 9, 10]). More recently, Bongard and Lipson [11]
proposed a method for generating symbolic equations for nonlinear dynamical systems that can be described by ordinary differential equations (ODEs) from time series. Brunton et al.
[12] addressed the socalled sparse identification of nonlinear dynamics (SINDy) method to obtain explicit datadriven PDEs when the variables are known, and construct normal forms for bifurcation analysis. Wang et al. [3] addressed a physicsinformed machine learning scheme based on deep learning to learn the solution operator of arbitrary PDEs. Kovachki et al. [4] addressed the concept of Neural Operators, meshfree, infinite dimensional operators with neural networks, to learn surrogate functional maps for the solution operators of PDEs.However, for complex systems, such “good” macroscopic observables that can be used effectively for modelling the dynamics of the emergent patterns are not always directly available. Thus, such an appropriate set of “hidden” macroscopic variables have to be identified from data. Such data can be available either directly from experiments or from detailed simulations using for example molecular dynamics, agentbased models, and MonteCarlo methods. Hence, all in all, we confront with two major problems: (a) the identification of the appropriate variables that define (parametrize) the emerging (coarsegained) dynamics, (b) the construction of models based on these variables. In the early 2000’s, the EquationFree and VariableFree multiscale framework [13, 14, 15, 16, 17] provided a systematic framework for the numerical analysis (numerical bifurcation analysis, design of controllers, optimization, rareevents analysis) of the emergent dynamics as well as for the acceleration of microscopic simulations, by bridging the microscale where the physical laws may be known and the macroscopic scale where the emergent dynamics evolve. This bridging is achieved via the concept of the “coarse time steppers”, i.e. the construction of a blackbox map on the macroscopic scale. By doing so, one can perform multiscale numerical analysis, even for microscopically largescale systems tasks by exploiting the algorithms (toolkit) of matrixfree methods in the Krylov subspace [13, 18, 19, 20, 21, 17], thus bypassing the need to construct explicitly models in the form of PDEs. In the case when the macroscopic variables are not known apriori, one can resort to nonlinear manifold learning algorithms such as Diffusion maps [22, 23, 24, 25] to identify the intrinsic dimension of the slow manifold where the emergent dynamics evolve.
Over the last few years, efforts have been focused on developing physicsinformed machine learning methods for solving both the forward and inverse problems, i.e. the numerical solution of highdimensional multiscale problems described by PDEs, and that of discovering the hidden physics [1, 26, 27, 28, 29], thus both identifying the set to coarse observables and based on them to learn the effective PDEs. Lee et al. [29] addressed a methodology to find the righthandside of macroscopic PDEs directly from microscopic data using Diffusion maps and Automatic Relevance Determination for selecting a good set of macroscopic variables, and Gaussian processes and artificial neural networks for modelling purposes. The approach was applied to learn a “blackbox” PDE from data produced by Lattice Boltzmann simulations of the FitzHughNagumo model at a specific value of the bifurcation parameter where sustained oscillations are observed.
In this paper, building on previous efforts [29], we exploit machine learning to perform numerical bifurcation analysis from spatiotemporal data produced by microscopic simulators. For the discovery of the appropriate set of coarsegained variables, we used parsimonious Diffusion maps [30, 31], while for the identification of the righthand side of the emergent coarsegrained PDEs, we used shallow Feedforward Neural Networks (FNNs) and Random Projection Neural Networks (RPNNs), thus proposing an appropriate sampling approach for the construction of the (random) basis functions. For our illustrations, we have used a Lattice Boltzmann simulator of the FitzHugh Nagumo (FHN) spatiotemporal dynamics. Upon training, the tracing of the coarsegrained bifurcation diagram was obtained by coupling the machine learning models with the pseusoarclength continuation approach. The performance of the machine learning schemes was compared with the reference bifurcation diagram obtained by finite differences of the FHN PDEs.
2 Methodology
The pipeline of our computational framework for constructing the bifurcation diagrams from data produced from detailed microscopic simulations consists of three tasks: (a) the identification of a set of coarsescale variables from finescale spatiotemporal data using manifold learning and in particular parsimonious Diffusion maps using leaveoneout crossvalidation (LOOCV), (b) based on the parsimonious coarsegrained set of variables, the reconstruction of the righthandside of the effective PDEs using machine learning and, (c) based on the machine learning models, the construction of the coarsescale bifurcation diagrams of the emergent dynamics using the numerical bifurcation analysis toolkit.
The assumption here is that the emergent dynamics of the complex system under study on a domain can be modelled by a system, of say (parabolic) PDEs in the form of:
(1)  
where , is a nonlinear operator, is the generic multiindex th order spatial derivative at time i.e.:
and denotes the (bifurcation) parameters of the system.
The boundary conditions read:
(2) 
where denotes an partition of the boundary of , and initial conditions
(3) 
The righthandside of the th PDE depend on say number of variables and bifurcation parameters from the set of variables
Let us denote this set as , with cardinality . Hence, at each spatial point and time instant the set of features for the
th PDE can be described by a vector
.Here, we assume that such macroscopic PDEs in principle exist but there are not available in a closedform.
Instead, we assume that we have detailed observations from microscopic simulations from which we can compute the time and spatial derivatives of all the observables in points in time and points in space using e.g. finite differences. Thus, we aim to (a) identify the intrinsic dimension of the manifold on which the coarsegrained dynamics evolve, i.e. for each PDE identify , and the coordinates that define the lowdimensional manifold, i.e. the sets , and based on them (b) identify the righthandside (RHS) of the effective PDEs using machine learning.
To demonstrate the proposed approach, we have chosen to produce data from Lattice Boltzmann (LB) simulations of the coupled FitzHughNagumo PDEs of activationinhibition dynamics. Using the LB simulator, we produced data in time and space from different initial conditions and values of the bifurcation parameter. For the identification of an appropriate set of coarsescale variables that define the lowdimensional manifold on which the emergent dynamics evolve, we performed feature selection using parsimonious Diffusion Maps [30, 31]. Then, we trained the machine learning schemes to learn the righthandside of the coarsegrained PDEs on the lowdimensional manifold. Based on the constructed models, we performed numerical bifurcation analysis, employing the pseudoarclength continuation method. The performance of the proposed datadriven scheme for constructing the coarsegrained bifurcation diagram was validated against the one computed with the PDEs using finite differences. A schematic overview of the proposed framework for the case of two effective PDEs (as in the problem of the FitzHughNagumo activationinhibition dynamics) is shown in Figure 1.
In what follows, we first describe the parsimonious Diffusion Maps algorithm for feature selection. Then, we present the machine learning schemes used for identifying the righthandside of the effective PDEs from the microscopic simulations, and then we show how one can couple the machine learning models with the pseudoarclength continuation method to construct the coarsescale bifurcation diagrams. Finally, we present the numerical results and compare the performance of the proposed machine learning schemes.
2.1 Parsimonious Diffusion Maps
Diffusion Maps is a nonlinear manifold learning algorithm [22, 23, 24] that identifies a lowdimensional representation of a point in the high dimensional space () addressing the diffusion distance among points as the preserved metric [24]
. Difussion Maps assume that the highdimensional data are embedded in a smooth lowdimensional manifold. It can be shown that the eigenvectors of the large normalized kernel matrices constructed from the data converge to the eigenfunctions of the LaplaceBeltrami operator on this manifold at the limit of infinite data
[22, 24]. The approximation of this LaplaceBeltrami operator is made by representing the weighted edges connecting nodes and commonly by a normalized diffusion kernel matrix with elements:(4) 
Then, one can define the diffusion matrix by:
(5) 
which elements
correspond to the probability of jumping from one point to another in the highdimensional space.
Taking the power of of the diffusion matrix is essentially identical of observing
steps forward of a Markov chain process
on the data points. Thus, the transition probability of moving from point to point reads:(6) 
On the weighted graph, the random walk can be defined as:
(7) 
where denotes the weighted degree of the point defined as:
(8) 
At the next step, it is easy to compute the graph Laplacian matrix defined as:
(9) 
The eigendecomposition of results to , where
is a diagonal matrix with the eigenvalues and
is the matrix with columns the eigenvectors of . The eigenvalues of are the same of since is the adjoined of the symmetric matrix , while the left and right eigenvectors of (say and ) are related to those of as [32]:(10) 
The embedding of the manifold in dimensions is realized by taking the first nontrivial/dependent eigenvectors of :
(11) 
where denotes the number of diffusion steps (usually ) and the descending order eigenvalues. For any pair of points and , the diffusion distance at the time step is defined as:
(12) 
where denotes the stationary distribution of the random walk described by the diffusion matrix [33]:
(13) 
In practice, the embedding dimension is determined by the spectral gap in the eigenvalues of the final decomposition. Such a numerical gap means that the first few eigenvalues would be adequate for the approximation of the diffusion distance between all pairs of points [22, 23]. Here we retain only the parsimonious eigendimensions of the final decomposition as proposed in [30, 31].
2.1.1 Feature selection using Diffusion Maps with leaveoneout crossvalidation (LOOCV)
Here, by identifying the coarsescale spatiotemporal behaviour of a system of PDEs, we mean learning their righthandsides as a blackbox model. Hence, we first have to deal with the task of discovering a set of coarsegrained variables embedded in the highdimensional input data space. For this task, one can employ various methods for feature selection such as LASSO [34, 35]
and Random Forests
[36, 37]. In our framework, we used a method that extracts the dominant features based on manifold parametrization through outputinformed Diffusion Maps [30]. The core assumption of this method is that given a dataset in a highdimensional space, then we can parametrize it on a lowerdimensional manifold.For this purpose, given a set of Diffusion Maps eigenvectors, for each element of
, we use a local linear regression model:
(14) 
to investigate if is an dependent eigendirection; , and . The values of parameters and are found solving an optimization problem of the form:
(15) 
where is a kernel weighted function, usually the Gaussian kernel:
(16) 
where is the shape parameter. The final normalized leaveoneout crossvalidation (LOOCV) error for this local linear fit is defined as:
(17) 
For small values of , is considered to be dependent of the other eigenvectors and hence as a harmonic or repeated eigendirection, while large values of , suggest that can serve as a new independent eigendirection.
In our approach, we provide as inputs to the Diffusion Maps algorithm the combined inputoutput domain (the observables and their spatial and time derivatives). In principle, any of the subsets that is capable to parametrize the discovered embedding coordinates that were chosen after the above described methodology, can be considered as a new possible input data domain that can be used for learning the righthandside of the effective PDEs. Actually, we find the subsets of variables of the input space that minimally parametrize the intrinsic embedding by quantifying it with a total regression loss based on a mean squared error:
(18) 
Here, as , we define the regression loss for representing the intrinsic coordinate when using out of selected input features:
(19) 
where is the output of the regressors with inputs the values of the features in the ambient space and target values the eigenvectors . Note, that in this procedure, we did not include the values of the bifurcation parameter into the dataset. We have chosen to employ the above method separately for every subset of the same value of the bifurcation parameter and finally to select the subset of features with the minimum sum of the total regression loses across all the embedding spaces.
2.2 Shallow Feedforward Neural Networks
It is well known that FNNs are universal approximators of any (piecewise) continuous (multivariate) function, to any desired accuracy [38, 39, 40, 41, 42]. This implies that any failure of a network must arise from an inadequate choice/calibration of weights and biases or an insufficient number of hidden nodes.
The output of a FNN with two hidden layers, with hidden units in each layer, that models the righthandside of the th PDE at each input point , evaluated at each point in space , and time can be written as:
(20)  
is the activation function (based on the above formulation it is assumed to be the same for all nodes in the hidden layers)
are the external weights connecting the second hidden layer and the output, is the bias of the output node, the matrix with rows are the weights connecting the input and the first hidden layer, are the biases of the nodes of the first hidden layer, the matrix contains the weights connecting the th unit of the first hidden layer with the th unit of the second hidden layer andare the biases of the second hidden layer. In the same way, one can easily extend the above formula for more than two hidden layers. Then, a loss function for each one of the
PDEs can be specified as:(21) 
The main task of a neural network is the generalization. Foresee and Hagan ([43]) showed that adding the regularization term
to the cost function will maximize the posterior probability based on Bayes’ rule. Hence, the total cost function is:
(22) 
where is the regularization parameter that has to be tuned. For our simulations we used the Bayesian regularized backpropagation updating the weight values using the LevenbgergMarquadt algorithm. ([44]) as implemented in Matlab.
2.3 Random Projection Neural Networks
Random Projection Neural Networks (RPNNs) are a family of neural networks including Random Vector Functional Links (RVFLs) [45, 46], Reservoir Computing/ Echo state networks [47, 48], Extreme Learning Machines [49] and LiquidState Networks [50]. The basic idea, which seed can be found in the pioneering work of Rosenblatt back in ’60s [51]
, behind all these approaches is to use FNNs with fixedweights between the input and the hidden layer(s), fixed biases for the nodes of the hidden layer(s), and a linear output layer. Based on that configuration, the output is projected linearly onto the functional subspace spanned by the nonlinear basis functions of the hidden layer, and the only remaining unknowns are the weights between the hidden and the output layer. Their estimation is done by solving a nonlinear regularized least squares problem
[52, 53]. The universal approximation properties of the RPNNs has been proved in a series of papers (see e.g. [45, 46, 48, 49]). In general, the universal approximation property of random projections can been can be rationalized by the celebrated Johnson and Lindenstrauss (JL) Theorem [54]:Theorem 1 (Johnson and Lindenstrauss)
Let matrix with points . Then, and such that , there exists a map such that :
(23) 
Note, that while the above theorem is deterministic, its proof relies on probabilistic techniques combined with Kirszbraun’s theorem to yield a socalled extension mapping [54]. In particular, it can be shown, that one of the many such embedding maps is simply a linear projection matrix with entries
that are i.i.d. random variables sampled from a normal distribution. In particular, the JL Theorem may be proved using the following lemma.
Lemma 2
Let be a set of points in and let be the random projection defined by
where has components that are i.i.d. random variables sampled from a normal distribution. Then,
is true with probability .
Similar proofs have been given for distributions different from the normal one (see, e.g. [55, 56, 57, 58]).
The above is a feature mapping, which may result in a dimensionality reduction () or, in analogy to the case of kernelbased manifold learning methods, a projection into a higher dimensional space (). We also note that while the above linear random projection is but one of the choices for constructing a JL embedding, it has been experimentally demonstrated and/or theoretically proven that appropriately constructed nonlinear random embeddings may outperform simple linear random projections. For example, in [59] it was shown that deep networks with random weights for each layer result in even better approximation accuracy than the simple linear random projection.
Here, for learning the righthandside of the set of PDEs, we used RPNNs (in the form of an Extreme Learning Machine[49]) with a single hidden layer [52]. For hidden units in the hidden layer, the output of the proposed RPNN can be written as:
(24)  
where denotes the inputs computed at each point in space , and time , is the activation function are the external weights connecting the hidden layer and the output and is the bias of the output node, while the matrix with rows and are the weights connecting the input and the hidden layer and the biases of the hidden layer, respectively. Now, in the proposed RPNN scheme, and
are random variables drawn from appropriate uniform distributions, the output bias
is set to zero, and the output weights are determined solving a linear least squares problem:(25) 
where is the vector collecting all the outputs of the RPNN for and , and the matrix is the matrix which elements are given by:
(26) 
where and . Regarding the regression problem, generally , the system Eq. (25) is overdetermined. As the resulting projection matrix
is not guaranteed to be full rowrank the solution can be computed with Singular Value Decomposition (SVD). Given the SVD decomposition of
, the pseudo inverse is:(27) 
where and are the unitary matrices of left and right eigenvectors respectively, and is the diagonal matrix of singular values . Finally, in order to regularize the problem, we can select just singular values that are greater than a given tolerance, i.e., . Hence, the output weights are computed as:
(28) 
where , and are restricted to the s.
For the regression problem, we aim at learning the righthandside of the PDEs from spatiotemporal data with singlelayer RPNNs with random basis functions:
(29) 
Then the approximated function is just a linear combination of the random basis functions . For our computations, we selected as activation function the logistic sigmoid given by:
(30) 
where, is given by linear combination .
2.3.1 Random Sampling Procedure
For the construction of the appropriate set of random basis functions for the solution of the inverse problem (i.e. that of learning the effective PDEs from data), we suggest a different random sampling procedure, than the one usually implemented in RPNNs and in particular in Extreme Learning Machines [52, 53, 60, 61, 62, 63] for the solution of the forward problem, i.e. that of the numerical solution of Partial Differential Equations. Since in the inverse problem, we aim at solving a highdimensional overdetermined system () is important to parsimoniously select the underlying basis functions , i.e. to seek for appropriate internal weights and biases that lead to nontrivial functions.
In general, the weights and biases are uniformly random sampled from a subset of the input/feature space, e.g., , where an high dimension
of the input/feature space leads to the phenomenon of curse of dimensionality. Indeed, it is necessary to use many function (
) to correctly “explore” the input space and give a good basis function.Hence, our goal is to construct and with a simple datadriven manifold learning in order to have a basis of functions that well describe the manifold where the data
are embedded. It is wellknown that the output of a neuron is given by a ridge function
such that , where and . The inflection point of the logistic sigmoid is at (, ). The points that satisfy the following relation [52, 53]:(31) 
form an hyperplane
of (MN dimension of ) defined by the direction of . Along , is constantly . We call the points the centers of the ridge function . Here the goal is to select centers that are on the manifold (note that this is not achieved by random weights and biases) and find directions that make nonconstant/nontrivial along (note that for ridge functions there are many directions for which this does not happen).Thus, here, being we suggest to uniformly random sample points from to be the centers of the functions : in this way the inflection points of are on the manifold . Also, we independently randomly sample other points from the inputs . Then, we construct the hidden weights as:
(32) 
in order to set the direction of the hyperplane parallel to the one connecting and . By doing so, the ridge function will be constant on a direction orthogonal to the connection between two points in the manifold and along this line will change in value, so it will be able to discriminate between the points lying on this direction. Thus, the biases are set as:
(33) 
Eq. (31) ensures that is a center of the ridge function.
3 Coarsegrained numerical bifurcation analysis from spatiotemporal data
For assessing the performance of the proposed scheme, we selected the celebrated, well studied FitzHughNagumo (FHN) model first introduced in [64] to simplify the HodgkinHuxley model into a twodimensional system of ODEs to describe the dynamics of the voltage across a nerve cell. In particular, we consider the FHN equations which add a spatial diffusion term to describe the propagation of an action potential as a traveling wave. The bifurcation diagram of the onedimensional set of PDEs is known to have a turning point and two supercritical AndronovHopf bifurcation points. In what follows, we describe the model along with the initial and boundary conditions and then we present the Lattice Boltzmann model.
3.1 The Macroscale model: the FitzHughNagumo Partial Differential Equations
The evolution of activation and inhibition dynamics are described by the following two coupled nonlinear parabolic PDEs:
(34)  
with homogeneous von Neumann Boundary conditions:
(35)  
and are parameters, is the kinetic bifurcation parameter.
For our simulations, we have set , , and varied the bifurcation parameter in the interval . For our simulations, in order to explore the dynamic behaviour, we considered various initial conditions and selected randomly as follows:
(36)  
where denotes the uniform distribution in the interval .
3.2 The Lattice Boltzmann model
The Lattice Boltzmann model serves as our finescale simulator. The statistical description of the system at a mesoscopic level uses the concept of distribution function , i.e. is the infinitesimal probability of having particles at location with velocities at a given time , for reducing the highnumber of equations and unknowns. Then, at this level, a system without an external force is governed by the Boltzmann Transport equation [66]:
(37) 
where the term describes the rate of collisions between particles. In 1954, Bhatnagar, Gross and Krook (BGK) [67] introduced an approximation model for the collision operator:
(38) 
where is the socalled relaxing time coefficient and denote the local equilibrium distribution function.
In the LBM, Eq.(37)(38) is collocated (assumed valid) along specific directions on a lattice:
(39) 
and then Eq.(39) is discretized with a time step as follows:
(40) 
One common interpretation of Eq.(40) is to think about the distribution functions as fictitious particles that stream and collide along specified linkages of the lattice. Lattices are usually denoted by the notation , where is the spatial dimension of the problem and refer to the number of connections of each node in the lattice. The node in the lattices coincide with the points of a spatial grid with a spatial step .
Here, in order to estimate the coarsescale observables and of the FHN dynamics, we considered the implementation, i.e. we used the onedimensional lattice with three velocities : particles can stream to the right (), to the left () or staying still on the node (). Also, we assume the coexistence of two different distribution functions for describing the distribution of the activator particles and the distribution of the inhibitor particles , where the subscript refer to the associated direction. Therefore, one can figure that at each instant there are six fictitious particles on each node of the lattice: two resting on the node (with distribution and ), two moving on the left (with distribution and ) and two moving on the right (with distribution and ). The relation between the above distributions and the coarsescale density and
is given by the zeroth moment (across the velocity directions) of the overall distribution function:
(41)  
The coexistence of multiple distributions renders necessary to introduce weights for the connections in the lattice that should satisfy the following properties:

Normalization

Simmetry

Isotropy:



,

where is the speed of sound in the lattice. Thus, the weights are equal to for the moving particles and for the resting particle. The resulting speed of sound in the lattice is .
As the BGK operator (38) suggests, one key step in applying LBM for solving reactionadvectiondiffusion PDEs is to determine the local equilibrium distribution function associated to a given model. For particles with macroscopic density that move in a medium macroscopic velocity , the Maxwell distribution is:
(42)  
where is the spatial dimension of the problem, is the temperature and is the universal gas constant. The exponential in Eq. (42) can be expanded using Taylor series, ignoring terms of order and higher, thus obtaining:
(43) 
with and , with speed of the sound.
Now, since the FHN PDEs are only diffusive, i.e. there are no advection terms, the medium is stationary () and the equilibrium distribution function, discretized on the lattice direction , is simplified in:
(44)  
Now, in the FHN model, we need to consider also reaction terms and so finally, the time evolution of the microscopic simulator associated to the FHN on a given lattice is:
(45) 
where the superscript denotes the activator and the inhibitor and the reaction terms are directly derived by:
(46)  
Finally, the relaxation coefficient is related to the macroscopic kinematic viscosity of the FHN model and in general depends on the speed of the sound associated to the lattice [68]:
(47) 
4 Algorithm flow chart
Summarizing, the proposed threetier algorithm for constructing bifurcation diagrams from data is provided in the form of a pseudo code in Algorithm 1. The first two steps are related to the identification of the effective coarse scale observables and the learning of the righthandside of the effective PDEs. The third step is where the pseudoarclength continuation method is applied for the tracing of the solution branch through saddlenode bifurcations.