1 Introduction
Natural systems, ranging from climate and ocean circulation to organisms and cells, involve complex dynamics extending over multiple spatiotemporal scales. Centuries old efforts to comprehend and forecast the dynamics of such systems have spurred developments in large scale simulations, dimensionality reduction techniques and a multitude of forecasting methods. The goals of understanding and prediction have been complementing each other but have been hindered by the high dimensionality and chaotic behavior of these systems. In recent years we observe a convergence of these approaches due to advances in computing power, algorithmic innovations and the ample availability of data. A major beneficiary of this convergence are datadriven dimensionality reduction methods [7, 2, 3, 5, 6, 4, 8], model identification procedures [11, 10, 12, 13, 14, 15, 16, 9] and forecasting techniques [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] that aim to provide precise shortterm predictions while capturing the longterm statistics of these systems. Successful forecasting methods address the highly nonlinear energy transfer mechanisms between modes not captured effectively by the dimensionality reduction methods.
The pioneering technique of analog forecasting proposed in [32] inspired a widespread research in nonparametric prediction approaches. Two dynamical system states are called analogues if they resemble one another on the basis of a specific criterion. This class of methods uses a training set of historical observations of the system. The system evolution is predicted using the evolution of the closest analogue from the training set corrected by an error term. This approach has led to promising results in practice [33] but the selection of the resemblance criterion to pick the optimal analogue is far from straightforward. Moreover, the geometrical association between the current state and the training set is not exploited. More recently [34], analog forecasting is performed using a weighted combination of datapoints based on a localized kernel that quantifies the similarity of the new point and the weighted combination. This technique exploits the local geometry instead of selecting a single optimal analogue. Similar kernelbased methods, [35]
use diffusion maps to globally parametrize a low dimensional manifold capturing the slower time scales. Moreover, nontrivial interpolation schemes are investigated in order to encode the system dynamics in this reduced order space as well as map them to the full space (lifting). Although the geometrical structure of the data is taken into account, the solution of an eigensystem with a size proportional to the training data is required, rendering the approach computationally expensive. In addition, the inherent uncertainty due to sparse observations in certain regions of the attractor introduces prediction errors which cannot be modeled in a deterministic context. In
[36] a method based on Gaussian process regression (GPR) [37] was proposed for prediction and uncertainty quantification in the reduced order space. The technique is based on a training set that sparsely samples the attractor. Stochastic predictions exploit the geometrical relationship between the current state and the training set, assuming a Gaussian prior over the modeled latent variables. A key advantage of GPR is that uncertainty bounds can be analytically derived from the hyperparameters of the framework. Moreover, in [36] a Mean Stochastic Model (MSM) is used for undersampled regions of the attractor to ensure accurate modeling of the steady state in the longterm regime. However the resulting inference and training have a quadratic cost in terms of the number of data samples .Some of the earlier approaches to capture the evolution of time series in chaotic systems using recurrent neural networks were developed during the inception of the Long ShortTerm Memory networks (LSTM) [38]. However, to the best of our knowledge, these methods have been used only on lowdimensional chaotic systems [43]
. Similarly, other machine learning techniques based on Multi Layer Perceptrons (MLP)
[44], Echo State Networks (ESNs) [45, 46][47, 48] have been successful, albeit only for low order dynamical systems. Recent work in [49, 50] demonstrated promising results of ESNs for high dimensional chaotic systems.In this paper, we propose LSTM based methods that exploit information of the recent history of the reduced order state to predict the highdimensional dynamics. Timeseries data are used to train the model while no knowledge of the underlying system equations is required. Inspired by Taken’s theorem [51] an embedding space is constructed using time delayed versions of the reduced order variable. The proposed method tries to identify an approximate forecasting rule globally for the reduced order space. In contrast to GPR [36], the method has a deterministic output while its training cost scales linearly with the number of training samples and it exhibits an inference computational cost. Moreover, following [36], LSTM is combined with a MSM, to cope with attractor regions that are not captured in the training set. In attractor regions, underrepresented in the training set, the MSM is used to guarantee convergence to the invariant measure and avoid an exponential growth of the prediction error. The effectiveness of the proposed hybrid method in accurate shortterm prediction and capturing the longterm behavior is shown in the Lorenz 96 system and the KuramotoSivashinsky system. Finally the method is also tested on predictions of a prototypical climate model.
The structure of the paper is as follows: In Section 2 we explain how the LSTM can be employed for modeling and prediction of a reference dynamical system and a blended LSTMMSM technique is introduced. In Section 3 three other state of the art methods, GPR, MSM and the hybrid GPRMSM scheme are presented and two comparison metrics are defined. The proposed LSTM technique and its LSTMMSM extension are benchmarked against GPR and GPRMSM in three complex chaotic systems in Section 4. In Section 5 we discuss the computational complexity of training and inference in LSTM. Finally, Section 6 offers a summary and discusses future research directions.
2 Long ShortTerm Memory (LSTM) Recurrent Neural Networks
The LSTM was introduced in order to regularize the training of recurrent neural networks (RNNs)[38]. RNNs contain loops that allow information to be passed between consecutive temporal steps (see Figure 0(a)) and can be expressed as:
(1)  
(2) 
where , and are the input, the output and the hidden state of the RNN at time step . The weight matrices are (inputtohidden), (hiddentohidden), (hiddentooutput), and . Moreover, and
are the hidden and output activation functions, while
and are the respective biases. Temporal dependencies are captured by the hiddentohidden weight matrix , which couples two consecutive hidden states together. A schematic of the RNN architecture is given in Figure 1.In many practical applications, RNNs suffer from the vanishing (or exploding) gradient problem and have failed to capture longterm dependencies
[52, 53]. Today the RNNs owe their renaissance largely to the LSTM, that copes effectively with the aforementioned problem using gates. The LSTM has been successfully applied in sequence modeling [41], speech recognition [39], handwriting recognition [40] and language translation [42].The equations of the LSTM are
(3)  
where , are the gate signals (forget, input and output gates), is the input, is the hidden state, is the cell state, while , , , are weight matrices and biases. The activation functions , and are sigmoids. For a more detailed explanation on the LSTM architecture refer to [38]. In the following we refer to the LSTM hidden and cell states ( and ) jointly as LSTM states. The dimension of these states is called the number of hidden units and it controls the capability of the cell to encode history information. In practice we want the output to have a specific dimension . For this reason, a fully connected final layer without activation function is added
(4) 
where all parameters (weights and biases) are encoded in , , and is the LSTM cell function that maps the previous LSTM States and current input to the output. By unfolding the LSTM timesteps in the past and ignoring dependencies longer that we get
(5) 
where represents the iterative application of and computation of the LSTM states for time steps. For a more detailed explanation of the formula for , and a Figure of the neural network architecture refer to the Appendix.
In this work, we consider the reduced order problem where the state of a dynamical system is projected in the reduced order space. The system is considered to be autonomous, while is the system state derivative at time step . Following [43], The LSTM is trained using time series data from the system in the reduced order space to predict the reduced state derivative from a short history of the reduced order state consisting of past temporally consecutive states. In this work, we approximated the derivative from the original time series using first order forward differences. The loss that has to be minimized is defined as
(6) 
The shortterm history for the states before is not available, that is why in total we have training samples from a time series with samples. During training the weights of the LSTM are optimized according to . The parameter is denoted as truncation layer and time dependencies longer than
are not explicitly captured in the loss function.
Training of this model is performed using Backpropagation through time, truncated at layer and minibatch optimization with the Adam method [1] with an adaptive learning rate (initial learning rate ). The LSTM weights are initialized using the method of Xavier [69]. Training is stopped when convergence of the training error is detected or the maximum of epochs is reached. During training the loss of the model is evaluated on a separate validation dataset to avoid overfitting. The training procedure is explained in detail in the Appendix.
An important issue is how to select the hidden state dimension and how to initialize the LSTM states at the truncation layer . A small reduces the expressive capabilities of the LSTM and deteriorates inference performance. On the other hand, a big is more sensitive to overfitting and the computational cost of training rises. For this reason, has to be tuned depending on the observed training behavior. In this work, we performed a grid search and selected the optimal for each application. For the truncation layer , there are two alternatives, namely stateless and stateful LSTM. In stateless LSTM the LSTM states at layer are initialized to zero as in equation (5). As a consequence, the LSTM can only capture dependencies up to previous time steps. In the second variant, the stateful LSTM, the state is always propagated for time steps in the future and then reinitialized to zero, to help the LSTM capture longer dependencies. In this work, the systems considered exhibit chaotic behavior and the dependencies are inherently shortterm, as the states in two time steps that differ significantly can be considered statistically independent. For this reason, the short temporal dependencies can be captured without propagating the hidden state for a long horizon. As a consequence, we consider only the stateless variant . We also applied stateful LSTM without any significant improvement so we omit the results for brevity. The trained LSTM model can be used to iterative predict the system dynamics as illustrated in Figure 2. This is a solely datadriven approach and no explicit information regarding the form of the underlying equations is required.
2.1 Mean Stochastic Model (MSM) and Hybrid LSTMMSM
The MSM is a powerful datadriven method used to quantify uncertainty and perform forecasts in turbulent systems with high intrinsic attractor dimensionality [54, 36]. It is parameterized a priori to capture global statistical information of the attractor by design, while its computationally complexity is very low compared to LSTM or GPR. The concept behind MSM is to model each component of the state independently with an OrnsteinUhlenbeck (OU) process that captures the energy spectrum and the damping time scales of the statistical equilibrium. The process takes the following form
(7) 
where are parameters fitted to the centered training data and is a Wiener process. In the statistical steady state the mean, energy and damping time scale of the process are given by
(8) 
where denotes the complex conjugate of . In order to fit the model parameters
we directly estimate the variance
from the time series training data and the decorrelation time using(9) 
After computing these two quantities we replace in (8) and solve with respect to and . Since the MSM is modeled a priori to mimic the global statistical behavior of the attractor, forecasts made with MSM can never escape. This is not the case with LSTM and GPR, as prediction errors accumulate and iterative forecasts escape the attractor due to the chaotic dynamics, although shortterm predictions are accurate. This problem has been addressed with respect to GPR in [36]. In order to cope effectively with this problem we introduce a hybrid LSTMMSM technique that prevents forecasts from diverging from the attractor.
The state dependent decision rule for forecasting in LSTMMSM is given by
(10) 
where
is an approximation of the probability density function of the training dataset and
a constant threshold tuned based on . We approximate using a mixture of Gaussian kernels. This hybrid architecture exploits the advantages of LSTM and MSM. In case there is a high probability that the state lies close to the training dataset (interpolation) the LSTM having memorized the local dynamics is used to perform inference. This ensures accurate LSTM shortterm predictions. On the other hand, close to the boundaries the attractor is only sparsely sampled and errors from LSTM predictions would lead to divergence. In this case, MSM guarantees that forecasting trajectories remain close to the attractor, and that we converge to the statistical invariant measure in the longterm.3 Benchmark and Performance Measures
The performance of the proposed LSTM based prediction mechanism is benchmarked against the following stateoftheart methods:

Mean Stochastic Model (MSM)

Gaussian Process Regression (GPR)

Mixed Model (GPRMSM)
In order to guarantee that the prediction performance is independent of the initial condition selected, for all applications and all performance measures considered the average value of each measure for a number of different initial conditions sampled independently and uniformly from the attractor is reported. The ground truth trajectory is obtained by integrating the discretized reference equation starting from each initial condition, and projecting the states to the reduced order space. The reference equation and the projection method are of course application dependent.
From each initial condition, we generate an empirical Gaussian ensemble of dimension around the initial condition with a small variance . This noise represents the uncertainty in the knowledge of the initial system state. We forecast the evolution of the ensemble by iteratively predicting the derivatives and integrating (deterministically for each ensemble member for the LSTM, stochastically for GPR) and we keep track of the mean. We select an ensemble size , which is the usual choice in environmental science, e.g. weather prediction and shortterm climate prediction [55].
The ground truth trajectory at each time instant is then compared with the predicted ensemble mean . As a comparison measure we use the root mean square error (RMSE) defined as where index denotes the ^{th} component of the reduced order state , is the initial condition, and is the total number of initial conditions. The RMSE is computed at each time instant for each component
of the reduced order state, resulting in error curves that describe the evolution of error with time. Moreover, we use the standard deviation
of the attractor samples in each dimension as a relative comparison measure. Assuming that the attractor consists of samples , with , the attractor standard deviation in dimension is defined as , where is the mean of the samples in this dimension. If the prediction error is bigger than this standard deviation, then a trivial mean predictor performs better.Moreover, we use the mean Anomaly Correlation Coefficient (ACC) [58] over initial conditions to quantify the pattern correlation of the predicted trajectories with the groundtruth. The ACC is defined as
(11) 
where refers to the mode number, refers to the initial condition, are mode weights selected according to the energies of the modes after dimensionality reduction and is the time average of the respective mode, considered as reference. This score ranges from to . If the forecast is perfect, the score equals to . The ACC coefficient is a widely used forecasting accuracy score in the meteorological community [57].
4 Applications
In this section, the effectiveness of the proposed method is demonstrated with respect to three chaotic dynamical systems, exhibiting different levels of chaos, from weakly chaotic to fully turbulent, i.e. the Lorenz 96 system, the KuramotoSivashinsky equation and a prototypical barotropic climate model.
4.1 The Lorenz 96 System
In [56] a model of the largescale behavior of the midlatitude atmosphere is introduced. This model describes the time evolution of the components for of a spatially discretized (over a single latitude circle) atmospheric variable. In the following we refer to this model as the Lorenz 96. The Lorenz 96 is usually used ([57, 36] and references therein) as a toy problem to benchmark methods for weather prediction.
The system of differential equations that governs the Lorenz 96 is defined as
(12) 
for , where by definition . In our analysis . The righthand side of (12) consists of a nonliner adjective term , a linear advection (dissipative) term and a positive external forcing term . The discrete energy of the system remains constant throughout time and the Lorenz 96 states remain bounded. By increasing the external forcing parameter F the behavior that the system exhibits changes from periodic to weakly chaotic () to end up in fully turbulent regimes (). These regimes can be observed in Figures 3.
Following [36, 55] we apply a shifting and scaling to standardize the Lorenz 96 states . The discrete or Dirichlet energy is given by . In order for the scaled Lorenz 96 states to have zero mean and unit energy we transform them using
(13) 
where is the average energy fluctuation. In this way the scaled energy is and the scaled variables have zero mean with the mean state. The scaled Lorenz 96 states obey the following differential equation
(14) 
4.1.1 Dimensionality Reduction: Discrete Fourier Transform
Firstly, the Discrete Fourier Transform (DFT) is applied to the energy standardized Lorenz 96 states
. The Fourier coefficients and the inverse DFT to recover the Lorenz 96 states are given by(15) 
; ; ; of the total energy
After applying the DFT to the Lorenz 96 states we end up with a symmetric energy spectrum that can be uniquely characterized by ( is considered to be an even number) coefficients for . In our case , thus we end up with complex coefficients . These coefficients are referred to as the Fourier modes or simply modes. The Fourier energy of each mode is defined as .
The energy spectrum of the Lorenz 96 system is plotted in Figure 4 for different values of the forcing term . We take into account only the modes corresponding to the highest energies and the rest of the modes are truncated. For the different forcing regimes , the six most energetic modes correspond to approximately , and of the total energy respectively. The space where the reduced variables live in is referred to as the reduced order phase space and the most energetic modes are notated as for . As shown in [59] the most energetic modes are not necessarily the ones that capture better the dynamics of the model. Including more modes, or designing a criterion to identify the most important modes in the reduced order space may boost prediction accuracy. However, in this work we are not interested in an optimal reduced space representation, but rather in the effectiveness of a prediction model given this space. The truncated modes are ignored for now. Nevertheless, their effect can be modeled stochastically as in [36].
Since each Fourier mode is a complex number, it consists of a real part and an imaginary part. By stacking these real and imaginary parts of the truncated modes we end up with the dimensional reduced model state
(16) 
Assuming that for are the Lorenz 96 states at time instant , the mapping ,
is unique and the reduced model state of the Lorenz 96 has a specific vector value.
4.1.2 Training and Prediction in Lorenz 96
The reduced Lorenz 96 system states are considered as the true reference states . The LSTM is trained to forecast the derivative of the reduced order state as elaborated in Section 2. We use a stateless LSTM with hidden units and the backpropagation truncation horizon set to .
In order to obtain training data for the LSTM, we integrate the Lorenz 96 system state Eg. (12) starting from an initial condition for using a RungeKutta 4th order method with a time step up to . In this way a time series is constructed. We obtain the reduced order state time series , using the DFT mapping . From this time series we discard the first initial time steps as initial transients, ending up with a time series with samples. A similar but independent process is repeated for the validation set.
4.1.3 Results
The trained LSTM models are used for prediction based on the iterative procedure explained in Section 2. In this section, we demonstrate the forecasting capabilities of LSTM and compare it with GPs. different initial conditions uniformly sampled from the attractor are simulated. For each initial condition, an ensemble with size is considered by perturbing it with a normal noise with variance .
In Figures 4(a), 4(b), and 4(c) we report the mean RMSE prediction error of the most energetic mode , scaled with for the forcing regimes for the first time steps (). In the RMSE the complex norm is taken into account. The of the standard deviation of the attractor is also plotted for reference (). As increases, the system becomes more chaotic and difficult to predict. As a consequence, the number of prediction steps that remain under the threshold are decreased. The LSTM models extend this predictability horizon for all forcing regimes compared to GPR and MSM. However, when LSTM is combined with MSM the shortterm prediction performance is compromised. Nevertheless, hybrid LSTMMSM models outperform GPR methods in shortterm prediction accuracy.
In Figures 4(d), 4(e), and 4(f), the RMSE error for is plotted. The standard deviation from the attractor is plotted for reference. We can observe that both GPR and LSTM diverge, while MSM and blended schemes remain close to the attractor in the longterm as expected.
; ; threshold ; MSM; GPR; GPRMSM; LSTM; LSTMMSM
In Figures 4(g), 4(h), and 4(i), the mean ACC over 1000 initial conditions is given. The predictability threshold of is also plotted. After crossing this critical threshold, the methods do not predict better than a trivial mean predictor. For GPR methods show inferior performance compared to LSTM approaches as analyzed previously in the RMSE comparison. However, for LSTM models do not predict better than the mean after , while GPR shows better performance. In turn, when blended with MSM the compromise in the performance for GPRMSM is much bigger compared to LSTMMSM. The LSTMMSM scheme shows slightly superior performance than GPRMSM during the entire relevant time period (). For the fully turbulent regime , LSTM shows comparable performance with both GPR and MSM and all methods converge as chaoticity rises, since the intrinsic dimensionality of the system attractor increases and the system become inherently unpredictable.
In Figure 6, the evolution of the mean RMSE over initial conditions of the wavenumbers of the Lorenz 96 with forcing is plotted. In contrast to GPR, the RMSE error of LSTM is much lower in the moderate and low energy wavenumbers compared to the most energetic mode . This difference among modes is not observed in GPR. This can be attributed to the highly nonlinear energy transfer mechanisms between these lower energy modes as opposed to the Gaussian and locally linear energy transfers of the most energetic mode.
; MSM; GPR; GPRMSM; LSTM; LSTMMSM
As illustrated before, the hybrid LSTMMSM architecture effectively combines the accurate shortterm prediction performance of LSTM with the longterm stability of MSM. The ratio of ensemble members modeled by LSTM in the hybrid scheme is plotted with respect to time in Figure 6(a). Starting from the initial ensemble of size , as the LSTM forecast might deviate from the attractor, the MSM is used to forecast in the hybrid scheme. As a consequence, the ratio of ensemble members modeled by LSTM decreases with time. In parallel with the GPR results presented in [36] and plotted in Figure 6(b), the slope of this ratio curve increases with up to time . However, the LSTM ratio decreases slower compared to GPR.
; ;
4.2 KuramotoSivashinsky Equation
The Kuramotosivashinsky (KS) system is extensively used in many scientific fields to model a multitude of chaotic physical phenomena. It was first derived by Kuramoto [60, 61] as a turbulence model of the phase gradient of a slowly varying amplitude in a reactiondiffusion type medium with negative viscosity coefficient. Later, Sivashinsky [62] studied the spontaneous instabilities of the plane front of a laminar flame ending up with the KS equation, while in [63] the KS equation is found to describe the surface behavior of viscous liquid in a vertical flow.
For our study, we restrict ourselves to the one dimensional KS equation with boundary and initial conditions given by
(17) 
where is the modeled quantity of interest depending on a spatial variable and time . The negative viscosity is modeled by the parameter . We impose Dirichlet and secondtype boundary conditions to guarantee ergodicity [64]. In order to spatially discretize (17) we use a grid size with the number of nodes. Further, we denote with the value of at node . Discretization using a second order differences scheme yields
(18) 
Further, we impose and add ghost nodes , to account for the Dirichlet and secondorder boundary conditions. In our analysis, the number of nodes is . The KuramotoSivashinsky equation exhibits different levels of chaos depending on the bifurcation parameter [65]. Higher values of lead to more chaotic systems [36].
; ; ; 20 modes
In our analysis the spatial variable bound is held constant to and chaoticity level is controlled through the negative viscosity , where a smaller value leads to a system with a higher level of chaos (see Figure 7(a)). In our study, we consider two values, namely and to benchmark the prediction skills of the proposed method. The discretized equation (18) is integrated with a time interval up to . The data points up to are discarded as initial transients. Half of the remaining data ( samples) are used for training and the other half for validation.
4.2.1 Dimensionality Reduction: Singular Value Decomposition
The dimensionality of the problem is reduced using Singular Value Decomposition (SVD). By subtracting the temporal mean
and stacking the data, we end up with the data matrix , where is the number of data samples ( in our case). Performing SVD on leads to(19) 
with diagonal, with descending diagonal elements. The right singular vectors corresponding to the largest singular values are the first columns of . Stacking these singular vectors yields . Assuming that is a vector of the discretized values of in time , in order to get a reduced order representation corresponding to the components with the highest energies (singular values) we multiply
(20) 
The percentage of cumulative energy w.r.t. to the number of PCA modes considered is plotted in Figure 7(b). In our study, we pick (out of ) most energetic modes, as they explain approximately of the total energy.
4.2.2 Results
We train stateless LSTM models with and . For testing, starting from initial conditions uniformly sampled from the attractor, we generate a Gaussian ensemble of dimension centered around the initial condition in the original space with standard deviation of . This ensemble is propagated using the LSTM prediction models, and GPR, MSM and GPRMSM models trained as in [36]. The root mean square error between the predicted ensemble mean and the groundtruth is plotted in Figures 8(a), 8(b) for different values of the parameter . All methods reach the invariant measure much faster for compared to the less chaotic regime (note the different integration times for , while for ).
In both chaotic regimes and , the reduced order LSTM outperforms all other methods in the shortterm before escaping the attractor. However, in the longterm, LSTM does not stabilize and will eventually diverge faster than GPR (see Figure 8(b)). Blending LSTM with MSM alleviates the problem and both accurate shortterm predictions and longterm stability is attained. Moreover, the hybrid LSTMMSM has better forecasting capabilities compared to GPR.
The need for blending LSTM with MSM in the KS equation is less imperative as the system is less chaotic than the Lorenz 96 and LSTM methods diverge much slower, while they sufficiently capture the complex nonlinear dynamics. As the intrinsic dimensionality of the attractor rises LSTM diverges faster.
The mean ACC (11) is plotted with respect to time in Figures 8(c) and 8(d) for and respectively. The evolution of the ACC justifies the aforementioned analysis. The mean ACC of the trajectory predicted with LSTM remains above the predictability threshold of for a highest time duration compared to other methods. This predictability horizon is approximately for and for , since the chaoticity of the system rises and accurate predictions become more challenging. For a plot of the time evolution of the ratio of the ensemble members that are modeled with LSTM dynamics in the hybrid LSTMMSM refer to the Appendix.
; threshold; MSM; GPR; GPRMSM; LSTM; LSTMMSM
4.3 A Barotropic Climate Model
In this section, we examine a standard barotropic climate model [66] originating from a realistic winter circulation. The model equations are given by
(21) 
where is the stream function, the relative vorticity, the Coriolis parameter, a constant vorticity forcing, while and are the Ekman damping and the scaleselective damping coefficient. is the Jacobi operator given by
(22) 
where and denote the sine of the geographical latitude and longitude respectively. The equation of the barotropic model (21) is nondimensionalized using the radius of the earth as unit length and the inverse of the earth angular velocity as time unit. The nondimensional orography is related to the real Northern Hemisphere orography by , where is a fixed amplitude of , is a factor expressing the surface wind strength blowing across the orography, and a scale height [66]. The streamfunction is expanded into a spherical harmonics series and truncated at wavenumber 21, while modes with an even total wavenumber are excluded, avoiding currents across the equator and ending up with a hemispheric model with degrees of freedom.
The training data are obtained by integrating the Eq. (21) for days after an initial spinup period of days, using a fourthorder AdamsBashforth integration scheme with a min time step in accordance with [36], with days, while is selected such that wavenumber is damped at a time scale of days. In this way we end up with a time series with samples. The spherical surface is discretized into a mesh with equally spaces latitude and longitude. From the gathered data, is used for training and for validation. The mean and variance of the statistical steady state are shown in Figures 9(a) and 9(b).
The dimensionality of the barotropic climate model truncated at wavenumber is . In order to reduce it, we identify Empirical Orthogonal Functions (EOFs) , that form an orthogonal basis of the reduced order space. The details of the method are described in the Appendix. EOF analysis has been used to identify individual realistic climatic modes such as the Arctic Oscillation (AO), the Pacific/North America (PNA) and the Tropical/Northern Hemisphere (TNH) pattern known as teleconnections [67, 68]. Accurate prediction of these modes is of high practical importance as they feature realistic climate patterns. After projecting the state of the barotropic model to the EOFs, we take into account only the coefficients corresponding to the most energetic EOFs that form the reduced order state . In our study, the dimensionality of the reduced space is , as contains only of the energy of , while the most energetic modes contain approximately of the total energy, as depicted in Figure 9(c).
4.3.1 Training and Prediction
The reduced order state that we want to predict using the LSTM are the components of . A stateless LSTM with hidden units is considered, while the truncated backpropagation horizon is set to . The prototypical system is less chaotic than the KS equation and the Lorenz 96, which enables us to use more hidden units. The reason is that as chaoticity is decreased trajectories sampled from the attractor as training and validation dataset become more interconnected and the task is inherently easier and less prone to overfitting. In the extreme case of a periodic system, the information would be identical. points are randomly and uniformly picked from the attractor as initial conditions for testing. A Gaussian ensemble with a small variance () along each dimension is formed and marched using the reducedorder GPR, MSM, Mixed GPRMSM and LSTM methods.
4.3.2 Results
The RMSE error of the four most energetic reduced order space variables for is plotted in Figure 11. The LSTM takes to reach the attractor, while GPR based methods generally take . In contrast, the MSM reaches the attractor already after hour. This implies that the LSTM can better capture the nonlinear dynamics compared to GPR. Note that the barotropic model is much less chaotic than the Lorenz 96 system with , where all methods show comparable prediction performance. Blended LSTM models with MSM are omitted here, as LSTM models only reach the attractor standard deviation towards the end of the simulated time and MSMLSTM shows identical performance.
; MSM ; GPR ; GPRMSM ; LSTM
5 A Comment on Computational Cost of Prediction
The computational cost of making a single prediction can be quantified by the number of operations (multiplications and additions) needed. In GPR based approaches the computational cost is of order , where is the number of samples used in training. For GPR methods illustrated in the previous section . The GPR models the global dynamics by uniformly sampling the attractor and ”carries” this training dataset at each time instant to identify the geometric relation between the input and the training dataset (modeled with a covariance matrix metric) and make (exact or approximate) probabilistic inference on the output.
In contrast, LSTM adjusts its parameters to reproduce the local dynamics. As a consequence, the inference computational complexity does not depend on the number of samples used for training. The inference complexity is roughly , where is the dimension of each input, is the number of inputs and is the number of hidden units. This complexity is significantly smaller than GPR, which can be translated to faster prediction. However, it is logical that the LSTM is more prone to diverge from the attractor, as there is no guarantee that the infrequent training samples near the attractor limits where memorized. This remark explains the faster divergence of LSTM in the more turbulent regimes considered in Section 4.
6 Conclusions
We propose a datadriven method, based on long shortterm memory networks, for modeling and prediction in the reduced space of chaotic dynamical systems. The LSTM uses the shortterm history of the reduced order variable to predict the state derivative and uses it for onestep prediction. The network is trained on timeseries data and it requires no prior knowledge of the underlying governing equations. Using the trained network, longterm predictions are made by iteratively predicting one step forward.
The features of the proposed technique are showcased through comparisons with GPR and MSM on benchmarked cases. Three applications are considered, the Lorenz 96 system, the KuramotoSivashinsky equation and a barotropic climate model. The chaoticity of these systems ranges from weakly chaotic to fully turbulent, ensuring a complete simulation study. Comparison measures include the RMSE and ACC between the predicted trajectories and trajectories of the real dynamics.
In all cases, the proposed approach performs better, in shortterm predictions, as the LSTM is more efficient in capturing the local dynamics and complex interactions between the modes. However, the prediction error accumulates as we iteratively perform predictions and similar to GPR does not converge to the invariant measure. Furthermore in the cases of increased chaoticity the LSTM diverges faster than GPR. This may be attributed to the absence of certain attractor regions in the training data, insufficient training, and propagation of the exponentially increasing prediction error. To mitigate this effect, LSTM is also combined with MSM, following ideas presented in [36], in order to guarantee convergence to the invariant measure. Blending LSTM or GPR with MSM leads to a deterioration in the shortterm prediction performance but the steadystate statistical behavior is captured. The hybrid LSTMMSM exhibits a slightly superior performance than GPRMSM in all systems considered in this study.
In the KuramotoSivashinsky equation LSTM can capture better the local dynamics compared to Lorenz 96 due to the lower intrinsic attractor dimensionality. LSTM is more accurate than GPR in the shortterm, but especially in the chaotic regime forecasts of LSTM fly away from the attractor faster. LSTMMSM counters this effect and longterm forecasts converge to the invariant measure at the expense of a compromise in the shortterm forecasting accuracy. The higher shortterm forecasting accuracy of LSTM can be attributed to the fact that it is a nonlinear approximator and can also capture correlations between modes in the reduced space. In contrast, GPR is a locally linear approximator modeling each mode independently in the output, assuming Gaussian correlations between modes in the input. LSTM and GPR show comparable forecasting accuracy in the barotropic model, as the intrinsic dimensionality is significantly smaller than KuramotoSivashinsky and Lorenz 96 and both methods can effectively capture the dynamics.
Future directions include modeling the lower energy modes and interpolation errors using a stochastic component in the LSTM to improve the forecasting accuracy. Another possible research direction is to model the attractor in the reduced space using a mixture of LSTM models, one model for each region. The LSTM proposed in this work models the attractor globally. However, different attractor regions may exhibit very different dynamic behaviors, which cannot be simultaneously modeled using only one network. Moreover, these local models can be combined with a closure scheme compensating for truncation and modeling errors. This local modeling approach may further improve prediction performance.
7 Data Accessibility
The code and data used in this work are available at the link:
https://polybox.ethz.ch/index.php/s/keH7PftvLmbkYU1
The password is rspa_paper. The TensorFlow library and python3 were used for the implementation of LSTM architectures while Matlab was used for Gaussian Processes. These packages need to be installed in order to run the codes.
8 Authors’ Contributions
PRV conceived the idea of the blended LSTMMSM scheme, implemented the neural network architectures and the simulations, interpreted the computational results, and wrote the manuscript. WB supervised the work and contributed to the implementation of the LSTM. ZYW implemented the GPR and made contributions to the manuscript. PK had the original idea of the LSTM scheme and contributed to the manuscript. PK and TPS contributed to the interpretation of the results and offered consultation. All authors gave final approval for publication.
9 Competing Interests
We have no competing interests.
10 Funding
TPS and ZYW have been supported by an Air Force Office of Scientific Research grant FA95501610231, an Office of Naval Research grant N000141512381, and an Army Research Office grant 66710EGYIP. PK and PRV gratefully acknowledge support from the European Research Council (ERC) Advanced Investigator Award (No. 341117).
11 Acknowledgments
We thank the two anonymous reviewers whose insightful comments helped us to enhance the manuscript.
Appendix A Long shortterm memory (LSTM)
a.1 Training and inference
In this section, the LSTM training procedure is explained in detail. We assume that time series data stemming from a dynamical system is available in the form , where is the state at time step and is the derivative. The available time series data are divided into two separate sets, the training dataset and the validation dataset, i.e. and . and are the number of training and validation samples respectively. We set the ratio to . This data is stacked as
(23) 
for , in order to form the training (and validation) input and output of the LSTM. These training samples are used to optimize the parameters of the LSTM (weights and biases) in order to learn the mapping . The loss function of each sample is
(24) 
while the total Loss is defined as
(25) 
where is the total number of samples. These samples can be further stacked together as batches of size , with the loss of the batch defined as the mean loss of the samples belonging to the batch. Using only one sample for the loss gradient estimation may lead to noisy gradient estimates and slow convergence. Minibatch optimization tackles this issue.
At the beginning of the training the weights are randomly initialized to
using Xavier initialization. We also tried other initialization methods like drawing initial weights from random normal distributions, or initializing them to constant values, but they often led to saturation of the activation functions, especially for architectures with higher backpropagation horizon
. The training proceeds by optimizing the network weights iteratively for each batch. In order to perform this optimization step, a gradient descent optimizer can be used(26) 
where is the stepsize parameter, are the weights before optimizing the batch and
are the updated weights. Plain gradient descent optimization suffers from slow convergence in practice and convergence to local suboptimal solutions. This approach is especially not wellsuited for high dimensional problems in deep learning where the number of parameters (weights) to be optimized lie in a highdimensional manifold with many local optima. Sparse gradients stemming from the minibatchoptimization lead also to slow convergence as previously computed gradients are ignored. Recent advances in stochastic optimization led to the invention of adaptive schemes that efficiently cope with the aforementioned problems.
In our work, we used the Adam stochastic optimization method. Adam exploits previously computed gradients using moments. The weights are initialized to
and the moment vectors to and . At each step the updates in the Adam optimizer are(27)  
where and are hyperparameters, is the pointwise square of the gradient and is the parameter in the ^{th} power, where is the iteration number. After updating the weights using the Adam procedure (27) for every batch, a training epoch is completed. Many such epochs are performed until the total training loss reaches a plateau. After each epoch the loss is evaluated also in the validation data set, in order to avoid overfitting. The validation loss is used as a proxy of the generalization error. The training is stopped when the validation error is not decreasing for consecutive epochs or the maximum of epochs is reached. In our work we used , , . We found that our results were robust towards the selection of these hyperparameters. To speed up convergence speed, a higher initial learning rate was used and the models were then refined with .
a.2 Weighting the loss function
In the training procedure described above the loss function for each sample is given by
(28) 
However, in the applications considered in this paper the neural network output is a multidimensional vector and represents a prediction of the derivative of the reduced order state of a dynamical system. In a dynamical system, specific reduced order states are more important than others as they may explain a bigger portion of the total energy. This importance can be introduced in the loss function by assigning different weights in different outputs of the neural network. The loss of each sample takes then the following form
(29) 
where is the output dimension and weights are selected according to the significance of each output component, e.g. energy of each component in the physical system.
a.3 LSTM architecture
An RNN unfolded temporal time steps in the past is illustrated in Figure 12. The following discussion on Stateless and Stateful RNNs generalizes to LSTMs, with the only difference that the hidden state consists of instead of solely and the functions coupling the hidden states with the input as well as the output with the hidden states are more complicated.
In Stateless RNNs the hidden states at the truncation layer , are initialized always with . As a consequence, and only the shortterm history is used to perform a prediction. The only difference when using LSTM cells is that the function has a more complex structure and additionally .
In contrast, in Stateful RNNs the states . In this case, these states can be initialized by ”teacher forcing” the RNN using data from a longer history in the past. For example, assuming is known, we can set , and compute using the given history ignoring the outputs. This value can then be used to predict as in Figure 12. This approach has two disadvantages.

In order to be able to forecast starting from various initial conditions, even with ”teacher forcing” some initialization of the hidden states is imperative. This initialization introduces additional error, which is not the case for the Stateless RNN.

In the Stateful RNN a longer history needs to be known in order to initialize the hidden states with ”teacher forcing”. Even though more data needs to be available, we did not observe any prediction accuracy improvement in the cases considered. This follows from the chaotic nature of the systems, as information longer than some timesteps in the past are irrelevant for the prediction.