Combining Machine Learning with Knowledge-Based Modeling for Scalable Forecasting and Subgrid-Scale Closure of Large, Complex, Spatiotemporal Systems

02/10/2020 ∙ by Alexander Wikner, et al. ∙ 14

We consider the commonly encountered situation (e.g., in weather forecasting) where the goal is to predict the time evolution of a large, spatiotemporally chaotic dynamical system when we have access to both time series data of previous system states and an imperfect model of the full system dynamics. Specifically, we attempt to utilize machine learning as the essential tool for integrating the use of past data into predictions. In order to facilitate scalability to the common scenario of interest where the spatiotemporally chaotic system is very large and complex, we propose combining two approaches:(i) a parallel machine learning prediction scheme; and (ii) a hybrid technique, for a composite prediction system composed of a knowledge-based component and a machine-learning-based component. We demonstrate that not only can this method combining (i) and (ii) be scaled to give excellent performance for very large systems, but also that the length of time series data needed to train our multiple, parallel machine learning components is dramatically less than that necessary without parallelization. Furthermore, considering cases where computational realization of the knowledge-based component does not resolve subgrid-scale processes, our scheme is able to use training data to incorporate the effect of the unresolved short-scale dynamics upon the resolved longer-scale dynamics ("subgrid-scale closure").

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 27

page 31

page 32

page 35

page 36

page 37

page 39

page 40

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, machine learning techniques have been used to solve a number of complex modeling problems ranging from effective translation between hundreds of different human languages  [wu_googles_2016] to predicting the bioactivity of small molecules for drug discovery  [wallach_atomnet:_2015]

. Typically, the most impressive results have been obtained using artificial neural networks with many hidden neural states  

[lecun_deep_2015]. These hidden layers form a “black box” model where internal parameters are trained given a set of measured training data but, after which, only the final model output is observed. This formulation using measured training data contrasts with how models used for forecasting physical spatiotemporally chaotic processes are formulated, which is typically done using the available scientific knowledge of the underlying mechanisms that govern the system’s evolution. For example, in the case of forecasting weather, this knowledge includes the Navier-Stokes equations, the first law of thermodynamics, the ideal gas law, and (see Sec. IV) simplified representations of physics at the unresolved spatial scales [bauer_quiet_2015].

In this paper, focusing on the key issues of scalability and unresolved subgrid physics, we consider the general problem of forecasting a vary large and complex spatiotemporally chaotic system where we have access to both past time series of measurements of the system state evolution and to an imperfect knowledge-based prediction model. We present a method for combining machine learning prediction with imperfect knowledge-based forecasting that is scalable to large systems with the aim that the resulting combined prediction system can be significantly more accurate and efficient than either a pure knowledge-based prediction or a pure machine-learning-based prediction. A main source of difficulty for scalability of the machine learning is that the dimension of the state of the systems we are interested in can be extremely large. For example, in state-of-the-art global numerical weather models the state dimension (number of variables at all grid points) can be on the order of . Thus, both the machine learning input (the current atmospheric state) and output (the predicted atmospheric state) have this dimensionality. (In contrast to the description of some machine learning techniques as “deep”, one might refer to the situations we address as “wide”.) The prediction method that we propose for such large complex systems builds on the previous work on parallelizable machine learning prediction [parallelreservoir2018] and hybridization of knowledge-based modeling with machine learning  [hybridreservoir2018]. We call our technique Combined Hybrid/Parallel Prediction (CHyPP, pronounced “chip”). Although the general method we propose is applicable to different kinds of machine learning, the numerical examples presented in this paper use a machine learning method known as reservoir computing  [herbert2001echo, Maass_LSM_2002]. Jaeger and Haas  [jaeger_harnessing_2004] described the effectiveness of reservoir computing for predicting low-dimensional chaotic systems. Research surrounding this technique has since expanded  [lu_reservoir_2017, inubushi_reservoir_2017]

, and it has recently been shown that reservoir computing using recurrent neural networks can produce similar quality predictions for chaotic systems to those of other recurrent architectures, such as LSTM’s  

[LSTM_1997] and GRU’s  [cho_learning_2014], while often requiring much less computational time to train [vlachas_forecasting_2019]. Reservoir computing techniques can additionally be extended to physical implementations using, e.g., photonics  [vinckier_high-performance_2015] and Field Programable Gate Arrays (FPGA’s) [penkovsky_efficient_2018, tanaka_recent_2019].

The rest of the paper is organized as follows. In Sec. II, we first review a simple version of reservoir computing  [herbert2001echo, Maass_LSM_2002] and discuss its shortcomings for forecasting high-dimensional spatiotemporal chaos. We next describe the hybrid reservoir prediction technique (Refs.  [hybridreservoir2018] and  [wan2018data]), as well as previous work on how machine learning can be parallelized for prediction of spatiotemporal systems  [parallelreservoir2018]. We then present our proposed CHyPP architecture combining the two. In Sec. III, we demonstrate how the CHyPP methodology improves on each of the component prediction methods. For these demonstrations we use the paradigmatic example of the Kuramoto-Sivashinky model as our test model of the spatiotemporally chaotic system that we aim to predict. We highlight the scalability of the proposed method to very large systems as well as its efficient use of training data, which we view as the crucial issues for the general class of applications in which we are interested. In Sec. IV

, we consider a situation with multiple time and space scales and show by numerical simulation tests, that CHyPP can, through its use of data, effectively account for unknown subgrid scale processes. The main conclusion of this paper is that our CHyPP methodology provides an extremely promising framework, potentially facilitating significant advances in the forecasting of large, complex, spatiotemporally chaotic systems. We believe that, in addition to weather, the method that we propose may potentially be applicable to a host of important areas, enabling currently unattainable capabilities. Some speculative examples of potential applications are forecasting of ocean conditions, forecasting conditions in the solar wind, magnetosphere and ionosphere (also known as ’space weather’, important for its impact on Earth-orbiting satellites, GPS accuracy, and high frequency communications), forecasting the evolution of forest fires and their response to mitigating interventions, forecasting the responses of ecological systems to climate change, analysis of neuronal activity, etc.

Ii Reservoir Computing Architecture for CHyPP

ii.1 A Simple Machine Learning Predictor

To begin, we initially consider the goal of a generic machine learning system for time-series prediction of an unknown dynamical system evolving on an attractor of that system. Later, we will consider that the machine learning system is a reservoir computer and that the unknown chaotic system is not completely unknown, and we will try to make use of that knowledge. Given a finite duration time series of the unknown system state’s evolution up to a certain time , where the state at each time is represented by the

-dimensional vector

, our goal is to predict the subsequent evolution of the state. As illustrated in Fig. 1(a), in the initial training phase, at each time , is input to the machine learning system (), which is trained to output a time prediction of the dynamical system state (). We refer to the just-described input-output configuration as the “open-loop” system (Fig. 1a). To ensure an accurate representation of the true dynamics with a reservoir of limited size, is typically short compared to natural time scales (such as the correlation time or the “Lyapunov time”) present in the unknown dynamical system whose state is to be predicted. Once trained, the machine learning system can be run in a “closed-loop” feedback configuration (Fig. 1(b)) to autonomously generate predictions over a finite duration of time. That is, with representing the time at the end of the training data, we replace the former input from the training data by the output by inserting an output-to-input feedback loop, shown by the dashed line in Fig. 1(b). Then, when is the input, the reservoir computer produces an output prediction for , which we refer to as . When this predicted state is then used as the input (), the reservoir computer produces an output prediction for , denoted (). This process is then iterated to produce predictions of the system state at for (Fig. 1b).

Figure 1: Machine learning prediction device (a) open-loop training phase and (b) closed-loop prediction phase.

In the rest of this section, we first present background from previous work (Secs.  II.2II.4), then introduce our CHyPP method for combining a knowledge-based model with reservoir-based machine learning to form a scalable system concept suitable for state prediction of very large, spatiotemporally chaotic systems. Specifically in Sec. II.2, we review a basic reservoir computing setup based on the methods of [herbert2001echo, Maass_LSM_2002] along with the proposal for its use as a predictor carried out in Ref. [jaeger_harnessing_2004]. In Sec. II.3, we build upon the simple setup of Sec II.2 and describe the methodology from Ref. [parallelreservoir2018] for hybrid forecasting of the dynamical system using a single reservoir computing network and an imperfect model. In Sec. II.4, the reservoir computing forecasting technique of Sec. II.2 is extended via parallelization of the machine learning with multiple parallel reservoir computers, in order to predict high-dimensional spatiotemporally chaotic systems, as was first described in [parallelreservoir2018] (but without the incorporation of a knowledge-based model). Finally, in Sec. II.5, we present our proposed CHyPP architecture and technique for combining the parallel reservoir method of Sec. II.4 with the hybridization of a knowledge-based predictor and a parallel reservoir-based machine learning prediction of Sec. II.3. It is our belief that it is only by means of such a combination that the most effective application of machine-learning-enabled prediction can be realized for large, complex, spatiotemporally chaotic dynamical systems.

ii.2 Basic Reservoir Computing

We now consider that the ML device shown in Fig. 1 is a reservoir computer which, as shown in Fig. 2 (and further discussed subsequently), consists of a linear input coupler () that couples the state into the reservoir (the circle in Fig. 2). The state of the reservoir is given by a high-dimensional vector , which is then linearly coupled by to produce an output vector which, through the training, is a very good approximation to . In this paper, our implementation of the reservoir is an artificial recurrent neural network with a large number of nodes.

Figure 2: Diagram of the reservoir computer setup. In the “open-loop” training phase (analogous to Fig. 1(a)), the dashed line representing coupling from the output back to the input is absent. In the “closed-loop” prediction phase (analogous to Fig. 1(b)), the coupling from the output back to the input (dashed line) is activated.

The artificial neural network that forms our basis of the reservoir computing implementation is illustrated in Fig. 2. The reservoir network adjacency matrix is chosen to be a randomly generated, sparse matrix A that represents a directed graph with weighted edges. The adjacency matrix A has dimensions , where is the number of nodes in the network. Elements of A are randomly generated such that the average number of incident connections per node (average number of nonzero elements of the matrix in each row) is set to a chosen value , the “average in-degree", while the nonzero elements of

are chosen from a uniform distribution over the interval

. Once generated, A

is re-scaled (i.e., multiplied by a scalar) so that the its largest absolute eigenvalue is a prescribed value

, called the spectral radius. Each node in the network has an associated scalar state . The state of the network is represented by the dimensional vector , whose elements are where .

The reservoir network state evolves dynamically while receiving input through a input coupling matrix, . We choose the matrix to contain an equal number of nonzero elements in each column, which corresponds to coupling each element of the reservoir input to an equal number of reservoir node states. Nonzero elements of this matrix are selected randomly and uniformly from the interval [-, ], where is referred to as the input weight. Given the current state of the reservoir

, the reservoir state is advanced at each time using a hyperbolic tangent activation function,

(1)

Before prediction begins, the reservoir computer is trained in the “open-loop” configuration. During this training phase, . Here, is the measurement of dynamical system state at time in the form of a -dimensional vector. As in Ref. [herbert2001echo]

, we add a small, normally distributed

-dimensional vector

of mean 0 and standard deviation

to the input dynamical system state during training. The elements of the vector are chosen randomly and independently at each time . The function of this added “stabilization noise” is to allow the reservoir computer to learn how to return to the true trajectory when the input trajectory has been perturbed away from it. We find that, in many cases, this additional small noise input beneficially promotes stability of the closed-loop prediction configuration once training has been completed.

The adjustable constants characterizing the overall prediction system (e.g., , , , , and

) are referred to as “hyperparameters”, and it is important that they be chosen carefully in order for the reservoir computer to predict accurately. For example, as explained in previous literature, the hyperparameters can be chosen so that the reservoir system has the so-called “echo-state property” (see, e.g., Ref. 

[herbert2001echo]) whereby, when in the open-loop training phase, the reservoir state , aside from an initial transient and with the random sequence fixed, the reservoir state becomes uniquely determined by the reservoir input sequence (and hence independent of the initial values of r). Accordingly, prior to initiation of the training, we ignore and discard the reservoir and input states for the first few time steps. The state of the reservoir at the end of this transient nullification period is labelled . Starting with , the training system states ( an integer, ) and the resulting reservoir states, , are recorded and saved. We then desire to the use these saved states to produce an output, , when is the input, which we desire to be very close to . To do this, we find it useful to perform an ad-hoc operation on the reservoir state vectors that squares the value of half of the node states. Specifically, we define such that,

(2)
(3)

As surmised in footnote [16] of Ref. [parallelreservoir2018], this operation improves prediction by breaking a particular odd symmetry of the reservoir dynamics that is not generally present in the dynamics to be predicted. We next couple the transformed reservoir state via a output coupling matrix to produce an output ,

(4)

and we endeavor to choose (“train”) the matrix elements of so that is close to . In general, this will require that . To accomplish this, we try to minimize the difference between and . To prevent overfitting, we insert a Tikhonov regularization term to penalize very large values of the matrix elements of  [tikhonov_solutions_1977]; that is, we find

(5)

over the scalar values of the matrix . Here, is a small regularization parameter and denotes the

norm. This technique is commonly known as ridge regression. In our subsequent numerical experiments (Secs. 

III and IV), we use the “matrix solution” for the minimization problem to determine the trained . In particular, we proceed as follows. We first form a matrix where the column is the transformed reservoir state . We define a target matrix U consisting of the time series of training data such that the column of U is . We then determine a matrix that satisfies the following linear system,

(6)

We note that methods of solving Eq.(5) for other than direct matrix solution are also available and may sometimes be advantageous (e.g., GMRES [saad_gmres_1986]

, stochastic gradient descent 

[zhang_solving_2004], etc.). By means of this minimization, it is hoped that is achieved. This completes the training process, following which we can switch to the closed-loop configurations (Fig. 1b and the dashed line in Fig. 2) and attempt to predict the subsequent evolution of . Prediction can then proceed via Eqs.(1), (2) and (6) where the prediction of the dynamical system state and the reservoir input is received from the feedback loop ().

The closed-loop configuration system can be regarded as a surrogate dynamical system that mimics the original unknown dynamical system. As such, if the original unknown dynamical system is chaotic, the closed-loop predictor system will also be chaotic. Due to the exponential growth of small errors implied by chaos, we cannot expect prediction to be good for more than several Lyapunov times (the Lyapunov time is the typical e-folding time for error growth in a chaotic system). Thus we will regard our predictions to be successful when they are good for a few Lyapunov times.

Now consider that we have made a prediction for , and, at some later time, we wish to perform another prediction of the same spatiotemporally evolving system with unknown dynamics. It is not necessary to retrain our predictor; we can, instead, re-use the previously obtained  [parallelreservoir2018]. To do so, we re-initialize the reservoir state to zero, switch the reservoir computer into its open-loop configuration, and allow it to evolve given input states of the unknown dynamical system measured at times (i.e., ). is some synchronization time that is sufficiently longer than the characteristic memory of the reservoir computer but, importantly, is much shorter than the necessary training time needed to determine . is the time at which we want to begin our prediction. After this synchronization period, the reservoir computer is switched to its standard closed-loop prediction configuration and is used to make predictions at later times.

If the original system is very high dimensional (i.e., the dimension of the measure vector is very large), then must be exceedingly large. This can make the training to determine infeasible. For example, if we solve Eq.(5) by the direct matrix method, Eq.(6) shows that we must solve a linear system of equations. For our computational resources, we find that this becomes impossible as approaches . Due to this and other similar considerations, we deem the method discussed in this section to be untenable for the prediction of large, spatiotemporally chaotic systems of the type we are interested in.

ii.3 Hybrid Reservoir Computing

Figure 3: Diagram of the hybrid reservoir computer setup. In the “open-loop” training phase (analogous to Fig. 1(a)), the dashed line representing coupling from the output back to the input is absent. In the “closed-loop” prediction phase (analogous to Fig. 1(b)), the coupling from the output back to the input (dashed line) is activated.

In this section, we briefly review a hybrid scheme proposed in Ref. [hybridreservoir2018] for combining reservoir computing with an imperfect knowledge-based model of the dynamical system of interest. We again assume access to time series data of measurements of the state of the dynamical system. We further assume that an imperfect knowledge-based model of the system producing the measurements is available and that this imperfect model is capable of forecasting the state of the dynamical system with some degree of accuracy, which we wish to improve upon. In the hybrid setup of Ref. [hybridreservoir2018] (Fig. 3) described below, it has been shown that the machine learning method and the knowledge-based model augment each other and, in conjunction, can provide a significantly better forecasts than either the knowledge-based model or the pure machine learning model acting alone.

As in Sec. II, we assume that the data used for training is given by measurements of the state of the dynamical system at equally spaced increments in time, , forming a vector time series . The imperfect knowledge-based model is an operator that maps the state to a forecast of the state at time .

We advance the reservoir state in time using the same activation function as described in Sec. II.2,

(7)

Once again, during the training phase, . The training process is similar to the one employed in Sec. II for the basic reservoir computer but with the addition of the knowledge-based prediction (as illustrated in Fig. 3). Using ridge regression, we find a linear mapping from and to produce an approximate prediction of ,

(8)

Here, is the same as that input to the reservoir, . We again include the small vector in the knowledge-based model input during training to improve the stability of the method. Additionally, recall that is related to r by Eq.(2). In the prediction phase, we run the hybrid system in a closed loop feedback configuration (Fig. 3 with the dashed line feedback connection present) using Eqs.(7), (2), and the following equation,

(9)

During the prediction phase, the hybrid forecast and the hybrid input is received from the feedback loop (). Note that, in this scheme, the output is a linear combination of the reservoir state and the knowledge-based model output that optimizes the agreement of the combined system output with the training data. Thus, we can regard the result as being an optimum combination of the reservoir and knowledge-based components. Hence, we expect that if one component is superior for some aspect of the prediction, then it will be weighted more highly for that aspect of the prediction. This suggests that predictions by this method may be greatly improved over those available from either the knowledge-based component or the reservoir component acting alone (e.g., see Fig. 7 of Ref. [hybridreservoir2018]).

In addition to the hybrid configuration shown in Fig. 3, we have also tested a modified configuration in which there is an additional input to the reservoir component from the output of the knowledge-based model . We have empirically found that this modification sometimes yields a small positive improvement in prediction; however, for simplicity, we henceforth only consider the configuration in the figure.

ii.4 Parallelization

Figure 4: Diagram of parallel reservoir computer setup.

To obtain a good prediction of a chaotic dynamical system state using reservoir computing, the reservoir dimensionality must be much greater than that of the dynamical system (i.e., ) so that there are enough free parameters available in for fitting the reservoir output state to the measured dynamical system state at time . This can cause the computational cost of determining an optimum output matrix to become unfeasibly high for large dynamical systems, e.g. because implementation of this step by the method of Eq.(6) involves solving a linear system. As a point of reference, we note that the dimension of the state vector of a current typical operational global weather forecasting models is on the order of . A method to make consideration of such problems feasible for machine learning approaches was proposed in Ref. [parallelreservoir2018]. The idea is to exploit the short range of causal interactions over a small time period in many spatiotemporally chaotic dynamical systems. This was shown to allow the use of multiple relatively small reservoir computers that each make predictions of a part of the full dynamical system state as in a local region, illustrated in Fig. 4 and explained below. This method has the advantage that, in the training phase, all of the relatively small output matrices of each reservoir computer can be trained independently in parallel, thus greatly reducing the difficulty of training.

For illustrative purposes, consider a spatiotemporal dynamical system in one spatial dimension with periodic boundary conditions. Let the dynamical system state be represented by a -dimensional vector time series where each scalar component represents the time series at a single spatial grid point. We divide the system state into equally sized, contiguous regions containing system variables, where . We denote the system variables in these regions as where . Each local region in space is predicted by a reservoir , each of which has internal reservoir states and adjacency matrix generated via the process described in Sec. II.2. Each reservoir is coupled to its input, , via a matrix of input weights, . This input corresponds to a local region of the system that contains the region to be predicted by that reservoir as well as a size overlap region on either side, . We denote the dynamical system state in this input region by the size dimensional vector . This overlap accounts for the short range causal interactions across the boundaries of the local regions. The assumption here is that, over the incremental prediction time , state information does not propagate fast enough for nodes outside the input regions of reservoir to influence the time change in the dynamical system states predicted by reservoir .

Each reservoir state is advanced using the following equation,

(10)

During the training phase (Fig. 4 with the dashed output-to-input connection absent), Here, the dimensional vector is the

local region of a global vector of normally distributed random variables,

, chosen independently at each time ,

(11)

After a suitably long transient nullification period, we determine the output matrices for each reservoir that solve the least squared optimization problem using ridge regression,

(12)

Note that, for each , the matrix can be relatively small as the number of outputs is the number of state variables in region (not the entire global state). Furthermore, the determinations of the relatively small matrices are independent for each region , and thus can be computed in parallel. The “direct matrix method” solution for determining each of the matrices proceeds as follows. First, we rewrite Eq.(12) as

(13)

where, in Eq.(13), and are analogous to U and in the single reservoir prediction (see Eq.(6)). is the target matrix such that the column is , while is obtained from analogous to the single reservoir case as described in Eq.(2). Each is calculated by solving the following equation,

(14)

Note that, as previously claimed, the minimization problem for each , Eq.(13), is completely independent, and can be solved for the different in parallel. As in Sec. II.2, in the prediction phase and, after a period of synchronization, we produce a full state prediction by running the system in a closed loop feedback configuration (i.e., Fig. 4 with the dashed output-to-input feedback connection present). This is done by concatenating the local predictions from each reservoir (where is the output state from each reservoir reservoir). Reservoir then receives inputs from its own outputs in addition to the left and right overlap zone inputs from the left grid points, and the right grid points. The entire system thus evolves as follows,

(15)
(16)
(17)

is obtained from using Eq.(2).

ii.5 Combined Hybrid/Parallel Prediction (CHyPP)

Figure 5: Diagram of the Combined Hybrid/Parallel Prediction (CHyPP) architecture using reservoir computing.

In our previous work [hybridreservoir2018], we constructed a hybrid prediction method using a single reservoir. In this section, we consider a combination of the parallel approach of Sec. II.4 (which enables computationally efficient scaling to high-dimensional spatiotemporal chaos) and the hybrid approach of Sec. II.3 (which allows us to utilize partial prior knowledge of the dynamical system) where we assume that the knowledge-based system provides global predictions over the entire spatial domain. While the approach is easily generalized to 2- and 3-dimensional spatial processes, in order to most simply demonstrate our proposed methodology, we again consider a one-dimensional, spatiotemporally chaotic dynamical system with periodic boundary conditions, with a state represented by a -dimensional vector time series . Our approximate knowledge-based prediction operator gives a global prediction of the full system state for a time : . As in our parallel reservoir computer prediction described in Sec. II.4, we partition the system state into equally sized, continuous regions containing variables, where and each such region is predicted by a reservoir , .

Each reservoir input is coupled to a local region of the system states as in Sec. II.4, and the reservoir state is advanced using the following equation,

(18)

During the initial training phase, . In Eq.(18), is the input coupling matrix for the local system states, analogous to that described in Sec. II.2. As in Sec. II.4, is the state measurements at grid points within the local region to be predicted along with grid points to either side and is the local region of the global vector of normally distributed random numbers . Each reservoir is trained independently in parallel using a set of training data consisting of an equally spaced time series of measured states of the large scale dynamics beginning at after some initial transient nullification period. Again, we solve the least squares optimization problem with ridge regression to determine an output mapping for each reservoir (analogous to Eq.(13)),

(19)

In Eq.(19), is a matrix whose column is , where is the knowledge-based prediction of the local region of the system,

(20)

The solution to Eq.(19) by the direct matrix method is

(21)

In the prediction phase, we run the system in a closed-loop configuration according to Eqs.(22) to predict each local region of the system given each reservoir state and knowledge-based model prediction, obtaining . The local predictions are appropriately concatenated to form a full state prediction, which is then used as input for the next prediction step, as follows,

(22)
(23)
(24)
(25)

In the above equations, we once again calculate from using Eq.(2).

Iii Test Results on the Kuramoto-Sivashinsky Equation

We test the effectiveness of our proposed CHyPP method (Sec. II.5) by forecasting the state evolution of the Kuramoto-Sivashinsky equation with periodic boundary conditions  [kuramoto1978, sivashinsky1977],

(26)

where and . This spatially one-dimensional model generally produces spatiotemporally chaotic dynamics for periodicity length . For the purpose of comparing various methods, we regard Eq.(26) as generating the state measurements of a putative system that we are interested in, while the imperfect prediction model we use is a modified version of this same equation, where an error term is introduced in the coefficient of the second derivative term,

(27)

We have also investigated the case where the error is introduced by multiplying the term in Eq.(26) by . For the latter case, the results of our method are qualitatively similar to those for Eq.(27). We form our simulated measured time series by taking the element of to be , where is the grid spacing used for our numerical solutions of Eq.(26). As a metric for how long a prediction is valid, we calculate the normalized root-mean-square error (NRMSE) between the true and predicted system states. We define the length of valid prediction, or “valid time”, to be the time at which the NRMSE exceeds 0.2. Since the NRMSE saturates at , we consider this to be the point when error in the prediction reaches about 15% of its saturation value. In Sec. III.1, we demonstrate that our CHyPP methodology can scale to predict very large systems. In Sec. III.2, we show that the CHyPP method requires significantly less training data than the parallel scheme of Sec. II.4 (using reservoirs without a knowledge-based component). We then discuss, in Sec. III.3, the sensitivity of the CHyPP and parallel machine learning (Sec. II.4) methods to the local overlap length .

For all of our numerical experiments in this paper, in addition to our solution of the imperfect model, we use a digital computer to implement the machine learning. However, if, in the future, CHyPP is applied to very large systems (e.g., weather forecasting) requiring a much larger number of parallel machine learning units, then we envision that it may prove useful to perform the parallel machine learning using a special purpose physically implemented reservoir computing array, e.g., based on FPGA’s or photonic devices [penkovsky_efficient_2018, tanaka_recent_2019, vinckier_high-performance_2015]. Such implementations, called “AI hardware accelerators”, show great promise with respect to low cost, speed, and compactness.

iii.1 Prediction Scalability

(average in-degree) 3
(spectral radius) 0.6 (local overlap length) 6
(input coupling strength) 0.1 (Prediction time step) 0.25
(regularization) (synchronization time) 25
(reservoir(s)-only tests) (CHyPP tests)
Table 1: Hyperparameters
Figure 6: Average normalized root-mean-square error (NRMSE) in the CHyPP prediction where we have fixed the number of system variables predicted in each region to and the reservoir spatial density at . Predictions are made using an imperfect model with second derivative error . The results shown are the average of 100 predictions where each prediction is made using the same set of reservoirs and training data set but where the CHyPP method is synchronized to different initial conditions for . we see that the NRMSE curves are relatively invariant as increases, indicating that our method can be scaled to large systems.

We first test the ability of our CHyPP method to scale to large system sizes. We consider the Kuramoto-Sivashinsky equation where we fix the number of system variables (grid points) each reservoir is trained to predict to , as well as the reservoir spatial density to (P is the number of reservoirs), while varying the periodicity length . For all tests in this section, we use the hyperparameters in Table 1 unless otherwise specified. We additionally fix the error in the incorrect model to . Fig. 6 shows the resulting NRMSE between the true state and our hybrid parallel prediction averaged over 100 prediction periods versus time. For each value of plotted in Fig. 6, the density of the parallel reservoir computers is kept constant at . Additionally, the time plotted horizontally is in units of the Lyapunov time (the average chaos-induced e-folding time of small errors in the predicted state orbit). The NRMSE is relatively unchanged as the value of is increased, indicating that the CHyPP prediction, like the parallel reservoir-only prediction in [parallelreservoir2018], can be scaled to very large systems by the addition of more reservoirs.

Finally, we note that our test system, the Kuramoto-Sivashinky equation, is homogeneous in space, while the real systems we are interested in are generally spatially inhomogeneous. For example, weather forecasting accounts for geographic features (continents, mountains, etc.), as well as for latitudinal variation of solar input, among other spatially inhomogeneous factors. In order to ensure that our numerical tests with a homogeneous model are also relevant for a typical inhomogeneous situation, we have not made any use of the homogenaity of Eq.(26): the adjacency matrices corresponding to each reservoir are independently randomly generated, and each is determined separately (rather than taking all and to be the same, as would be possible for a homogeneous system).

iii.2 CHyPP Promotes the Possibility of Good Performance Using a Relatively Small Duration of Training Data

Figure 7: (A) The true dynamics of the Kuramoto-Sivashinsky equation obtained by numerical evolution of Eq.(26

) normalized to have mean 0 and variance 1. We plot the spatial grid point along the vertical axis and the Lyapunov time on the horizontal axis. The Lyapunov time is defined by

where is the largest Lyapunov exponent computed from Eq.(26). Predictions of this evolution by four different methods are shown in panels (B), (C), (D), and (E). The prediction in (B) is made by a single reservoir computer with 16,000 nodes, the prediction in (C) is made by 16 reservoir computers in parallel each with 4,000 nodes, the prediction in (D) is made by a hybrid with a single reservoir computer with 16,000 nodes(D), and the prediction in (E) is made by CHyPP with 16 reservoir computers in parallel each with 4,000 nodes and an imperfect model with second derivative . The plots on the right (b), (c), (d), and (e) display the difference between the true dynamics (A) and the prediction using each of the corresponding techniques. The vertical black line marks the valid prediction time. Each technique was trained using a 3,375 Lyapunov time sequence of training data.

The prediction results shown in Figs. 7 and 8 use the parameters contained in Table  1. In addition, the number of reservoirs and the length of training data are specified in the figure captions. The imperfect model has an error of and a small valid prediction time of only Lyapunov times. Each plotted valid time is averaged over 100 predictions that are generated using the same set of reservoirs with the same training data sequence but that are synchronized to different initial conditions. Figure 7 displays a set of example predictions of one of these test time series. Figures 7 and 8 both demonstrate that CHyPP yields significantly longer valid predictions than the parallel reservoir-only method, which (for our choice of and reservoir size) produces longer valid predictions than the imperfect model alone. Both the parallel hybrid and parallel reservoir-only predictions outperform the imperfect model-only approach. From Fig. 8, we also observe that the CHyPP saturates, or reaches a valid prediction time that increases only negligibly with the addition of more training data, at a much shorter length of training data than the parallel reservoir-only. As a result of this, the length of training data before which we would not benefit from increasing the size of the reservoir is also much shorter in the CHyPP prediction. For example, consider the plots in (b) and (f) in Fig. 8, where each prediction uses 4 reservoirs. If we have a 500 Lyapunov time length of training data, we would not benefit from increasing the number of nodes per reservoir above 2000 in the reservoirs-only prediction, but could obtain a significant performance improvement if we did so using CHyPP. CHyPP exhibits its most impressive performance for very short lengths of training data. With only 22.5 Lyapunov times worth of training data, the parallel reservoir-only method is able to predict for only 0.44 Lyapunov times, whereas CHyPP with 16 reservoirs of 2000 nodes each predicts for 3.35 Lyapunov times on average, matching the saturated single reservoir-only prediction that is trained using 150 times more training data.

Figure 8: We plot the average valid time as a function of the duration of the training data, each in Lyapunov time, for the parallel reservoir(s)-only method using (a) 1 reservoir, (b) 4 reservoirs, (c) 8 reservoirs, and (d) 16 reservoirs. We plot the same quantity obtained from the parallel hybrid method using (e) 1 reservoir, (f) 4 reservoirs, (g) 8 reservoirs, and (h) 16 reservoirs. In each case, the imperfect knowledge-based predictor in the hybrid prediction had an error in the second derivative term of , resulting in an average valid time for the model-only prediction of 0.48 of a Lyapunov time. A knowledge-based predictor with no error begun using the most accurate initial condition from the parallel hybrid was able to predict for an average of 11 Lyapunov times. The legend in each plot indicates the number of nodes used in each reservoir.

iii.3 Prediction Quality dependence on local overlap length

In this section, we investigate the dependence of CHyPP performance on the local overlap length . Figure 9 shows that when the local overlap length is zero or close to zero, the parallel reservoir-only prediction valid time (solid red line) is very poor, predicting for only around 0.5 Lyapunov times. CHyPP, however, is still able to obtain good predictions with very little local overlap length. For CHyPP, this indicates that it is able to utilize the inaccurate model prediction to infer the propagation of dynamical influences between the adjacent prediction regions. We note that neither CHyPP nor the 16 parallel reservoirs-only prediction are able to improve upon a corresponding single, large reservoir hybrid prediction or corresponding single large reservoir-only prediction when the local overlap length is near 0. Regarding this comparison, however, it is important to recognize that in systems larger than the one we have used to test our method here, use of a single reservoir prediction is not possible because the reservoir size necessary for prediction becomes infeasible, as discussed in Sec. II. This result is nevertheless very important for large-scale implementations of the proposed method. In realistic systems with 3 spatial dimensions, a small increase in the local overlap length can lead to increasingly large memory requirements and increasingly large amounts of information that must be communicated between reservoirs at each prediction iteration. Being able to obtain a good prediction with less local overlap length when the propagating dynamics is adequately explained by the imperfect model is thus another advantage of CHyPP.

Figure 9: We plot the valid time for the parallel and non-parallel hybrid and reservoir(s)-only prediction as a function of the local overlap length .

Iv Test Results on a Multiscale System Prediction and the Problem of Subgrid-Scale Closure

A common difficulty in numerical modeling arises because many physical processes involve dynamics on multiple scales. As a result, fundamentally crucial subgrid scale dynamics is often only crudely captured in an ad-hoc manner. The formulation of subgrid scale models by analytical techniques has been extensively studied (e.g., see Ref. [meneveau_scale-invariance_2000]) and is sometimes referred to as “subgrid-scale closure”. In this section, we show that our CHyPP methodology provides a very effective data-based (as opposed to analysis-based) approach to subgrid-scale closure.

In particular, we test our CHyPP prediction method on a dynamical system with multiple spatial and temporal scales. The particular system we choose to predict is the multiscale “toy” atmospheric model formulated by E.N. Lorenz in his 2005 paper [lorenz2005]. We refer to this model as Lorenz Model III. This model is a smoothed extension of Lorenz’s original “toy” atmospheric model described in his 1996 paper [lorenz1996predictability] (hereafter referred to as Lorenz Model I), with the addition of small scale dynamical activity. Lorenz Model III describes the evolution of a single atmospheric variable, , on a one-dimensional grid with grid points and periodic boundary conditions, representing a single latitude. The value of at each grid point, , evolves according to the following equation,

(28)

In Eq.(28), is a smoothed version of , and is the difference between and ,

(29)
(30)

Here, denotes the smoothing distance. Thus, describes the large-spatial-scale, long-time-scale wave component of , while describes the small-spatial-scale, short-time-scale wave component. indicates a coupling between the variables and [i.e., interaction within scales ( or ) or between scales for (, )]. For odd, this coupling takes the form,

(31)

Here, and denotes a modified summation where the first and last summands are divided by 2. For even, and each becomes a standard summation . In Eq.(28), the parameters , , , and describe the coupling distance of the system’s large scale dynamics, the increase in small scale oscillation rapidity and decrease in amplitude (relative to the large scale dynamics), the degree of interaction between the large and small scale dynamics, and the overall forcing in the system, respectively. We note that when and , this model reduces to the Lorenz Model I.

For the purposes of testing our CHyPP method, we also introduce another model formulated by Lorenz, which we will refer to as Lorenz Model II [lorenz2005]. This model is equivalent to Lorenz Model III with no distinct small scale wave component or smoothing present in the equation,

(32)

Important for our tests, Lorenz notes that, for constant , the dominant wavenumber in Lorenz Model II depends only on the ratio .

We test our CHyPP method by using it to predict the dynamics generated from Lorenz Model III using Lorenz Model II as our imperfect knowledge-based predictor. As we discussed in Sec. I, knowledge-based models used to predict physical multiscale systems (i.e., the weather) often use simplified representations of subgrid scale dynamics. To replicate this type of imperfection in the knowledge-based model in CHyPP, we realize the Lorenz Model II dynamics over an equal or fewer number of grid points while using the same ratio as used to generate the true Lorenz Model III dynamics (i.e., ). Our imperfect model thus incorrectly represents the effect of small scale dynamics that, when , are also subgrid.

(average in-degree) 3 (spectral radius) 0.6
(local overlap length) N/12 (input coupling strength) 1
(Prediction time step) 0.005 (regularization)
(synchronization time) 0.5
Table 2: Hyperparameters

For the following results in this section, we use the parameters in Table 2. In addition, the value of used in each of the reservoir computing-based prediction methods has been selected for each method to maximize the valid prediction time.

Figure 10: Plot of the true dynamics from Lorenz Model III using the parameters , , , , , and . The dynamics have been normalized to set the mean value to 0 and variance to 1. We plot the value of color-coded (color bar on the right), the spatial grid point along the vertical axis, and the model time on the horizontal axis.
Figure 11: We plot the predictions of the normalized true dynamics from Lorenz Model III (displayed in Fig. 10) where we have measured every spatial grid point of Model III during training. We plot the spatial grid point along the vertical axis and the model time on the horizontal axis. The plots on the left show the predictions made using each of the specified methods, while the plots on the right show the difference between the true dynamics and the prediction (i.e., the prediction error). The parallel reservoir-only prediction used its optimized value of , while CHyPP used its optimized value of .
Figure 12: We plot the predictions of the normalized true dynamics from Lorenz Model III (displayed in Fig. 10), where, corresponding to the Model II grid, we have measured only every second spatial grid point of Model III during training. We plot the spatial grid point along the vertical axis and the model time on the horizontal axis. The plots on the left show the predictions made using each of the specified methods, while the plots on the right show the difference between the true dynamics and the prediction (i.e., the prediction error). The parallel reservoir-only prediction used its optimized value of , while CHyPP used its optimized value of .
Figure 13: We plot the predictions of the normalized true dynamics from Lorenz Model III (displayed in Fig. 10), where, corresponding to the Model II grid, we have measured only every fourth spatial grid point of Model III during training. We plot the spatial grid point along the vertical axis and the model time on the horizontal axis. The plots on the left show the predictions made using each of the specified methods, while the plots on the right show the difference between the true dynamics and the prediction (i.e., the prediction error). The parallel reservoir-only prediction used its optimized value of , while CHyPP used its optimized value of .
Figure 14: We plot the predictions of the normalized true dynamics from Lorenz Model III (displayed in Fig. 10), where, corresponding to the Model II grid, we have measured only every eighth spatial grid point of Model III during training. We plot the spatial grid point along the vertical axis and the model time on the horizontal axis. The plots on the left show the predictions made using each of the specified methods, while the plots on the right show the difference between the true dynamics and the prediction (i.e., the prediction error). The parallel reservoir-only prediction used its optimized value of , while CHyPP used its optimized value of .

Figure 10 shows a solution of Lorenz Model III where the value of is color-coded. The spatial variable is plotted vertically, time is plotted horizontally, and the model parameters are given in the caption. As seen in Fig. 10, there is a wave-like motion with a dominant wavenumber of (7 oscillations along the vertical periodicity length). This corresponds to Lorenz’s design of the model to mimic atmospheric dynamics, which has a predominant wavenumber for Rossby waves of this order as one goes around a mid-latitude circle. Figures 11-14 show predictions of the true Lorenz Model III dynamics (in Fig. 10) for grid resolutions of varying coarseness. In particular, while the truth (Fig. 10) is obtained from Model III with grid points, the number of Model II grid points is , , , and for Figs. 111213, and 14 respectively. Also note that the measurements are taken from the Model III result only at the Model II grid points. For both the parallel reservoir-only predictions and CHyPP, we fix the ratio of the number of nodes per reservoir to the number of grid points each reservoir predicts to be .

We find that CHyPP significantly outperforms each of its component methods, and that this is true for all of the grid resolutions we tested. We note that, unlike in the Model II and parallel reservoir-only predictions, the prediction error in CHyPP seems to appear locally in a small region and manifests as slanted streaks in the error plots in Figs. 11-14. We have verified that the slant of these streaks corresponds to the group velocity of the dominant wave motion (wavenumber ). For all predictions made, we again calculate a valid time of prediction; however, since in this case early error growth using CHyPP is local in space and affects only a small part of the prediction domain, we choose to use a higher valid time error threshold of 0.85 (approximately 60% of the error saturation value) so that the valid time metric reflects when error is present in the entire prediction domain. Figure 15 shows the valid time averaged over 100 predictions from different initial conditions versus training time for different grid resolutions and reservoir sizes. The dashed lines are parallel reservoir-only predictions, while the solid lines are CHyPP predictions. In this figure, the reservoir sizes are scales in proportion to the number of Model II grid points used. We find that, while the quality of the scaled parallel reservoir-only prediction degrades significantly as the grid resolution decreases, the quality of the scaled CHyPP prediction degrades much more slightly from an average valid time of at full resolution to at resolution.

Figure 15: We plot the average valid time for each prediction as a function of the length of training data used. Dashed lines display the results of parallel reservoir-only predictions, whereas solid lines display results from CHyPP where the knowledge-based model is Lorenz Model II with Number of grid points, , and . Each prediction method uses 24 reservoirs.

The Model II valid time is essentially constant at roughly , corresponding to the fact that all of the grid spacings tested are well below the characteristic Model II spatial scale. We see from Fig. 15 that the parallel reservoir-only prediction and CHyPP prediction appear to reach valid time saturation at about the same training data length for each grid resolution.

V Conclusion

In this paper, we address the general goal of utilizing machine learning to enable expanded capability in the forecasting of a large, complex, spatiotemporally chaotic systems for which an imperfect knowledge-based model exists. Some typical common sources of imperfection in such a knowledge-based model are unresolved subgrid scale processes and lack of first principles knowledge or computational ability for modeling some necessary aspect or aspects of the physics. The hope is that these “imperfections" can be compensated for by use of measured time series data and machine learning. The two main foreseeable difficulties in realizing this hope are how to effectively combine the machine learning component with the knowledge-based component in such a way that they mutually enhance each other, and how to promote feasible scaling of the machine learning requirements with respect to its computational cost and necessary amount of training data. Note that addressing the first of these issues necessarily lessens the difficulty of dealing with the second issue, since good use of any valid scientific knowledge of the system being forecast can potentially reduce the amount of learning required from the machine learning component.

To address these two issues, we propose a methodology (CHyPP) that combines two previously proposed techniques: (i) a hybrid utilization of an imperfect knowledge-based model with a single machine learning component [hybridreservoir2018, wan2018data], and (ii) a parallel machine learning scheme using many spatially distributed machine learning devices [parallelreservoir2018]. We note that (ii) applies to large spatial systems with the common attribute of what we have called ‘local short-time causal interactions’. Numerical tests of our proposed combination of (i) and (ii) are presented in Secs. III and IV and demonstrate good quality prediction that is scalable with size (Figs. 6 and 7), reduced required length of training data (Fig. 8), and ability to compensate for unresolved subgrid-scale processes (Figs. 10-15).

In this paper, our proof-of-principle problems have been one-dimensional in space with relatively simple mathematical formulations. We note, however, that the CHyPP methodology is readily applicable to higher dimensions and more complicated situations. For example, we are currently in the process of applying CHyPP to global atmospheric weather forecasting for which the atmospheric state is spatially three-dimensional and the system is strongly inhomogeneous, e.g., due to the presence of complex geographic features (including continents, mountains, and oceans), as well as the latitudinal variation of solar heating.

In conclusion, we have shown that data-assisted forecasting via parallel reservoir computing and an imperfect knowledge-based model can significantly improve prediction of a large spatiotemporally chaotic system over existing methods. This method is scalable to even larger systems, requires significantly less training data than previous methods to obtain high quality predictions, is able to effectively utilize a knowledge-based forecast of information propagation between local regions to improve prediction quality, and effectively provides a means of using data to compensate for unresolved subgrid scale processes (playing a role akin to traditional analysis-based closure schemes).

Acknowledgements

We thank Sarthak Chandra for his helpful discussion and comments. This work was supported by DARPA contract DARPA-PA-18-01(HR111890044).

References