Natural systems, ranging from climate and ocean circulation to organisms and cells, involve complex dynamics extending over multiple spatio-temporal scales. Centuries old efforts to comprehend and forecast the dynamics of such systems have spurred developments in large scale simulations, dimensionality reduction techniques and a multitude of forecasting methods. The goals of understanding and prediction have been complementing each other but have been hindered by the high dimensionality and chaotic behavior of these systems. In recent years we observe a convergence of these approaches due to advances in computing power, algorithmic innovations and the ample availability of data. A major beneficiary of this convergence are data-driven dimensionality reduction methods [7, 2, 3, 5, 6, 4, 8], model identification procedures [11, 10, 12, 13, 14, 15, 16, 9] and forecasting techniques [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] that aim to provide precise short-term predictions while capturing the long-term statistics of these systems. Successful forecasting methods address the highly non-linear energy transfer mechanisms between modes not captured effectively by the dimensionality reduction methods.
The pioneering technique of analog forecasting proposed in  inspired a widespread research in non-parametric prediction approaches. Two dynamical system states are called analogues if they resemble one another on the basis of a specific criterion. This class of methods uses a training set of historical observations of the system. The system evolution is predicted using the evolution of the closest analogue from the training set corrected by an error term. This approach has led to promising results in practice  but the selection of the resemblance criterion to pick the optimal analogue is far from straightforward. Moreover, the geometrical association between the current state and the training set is not exploited. More recently , analog forecasting is performed using a weighted combination of data-points based on a localized kernel that quantifies the similarity of the new point and the weighted combination. This technique exploits the local geometry instead of selecting a single optimal analogue. Similar kernel-based methods, 
use diffusion maps to globally parametrize a low dimensional manifold capturing the slower time scales. Moreover, non-trivial interpolation schemes are investigated in order to encode the system dynamics in this reduced order space as well as map them to the full space (lifting). Although the geometrical structure of the data is taken into account, the solution of an eigen-system with a size proportional to the training data is required, rendering the approach computationally expensive. In addition, the inherent uncertainty due to sparse observations in certain regions of the attractor introduces prediction errors which cannot be modeled in a deterministic context. In a method based on Gaussian process regression (GPR)  was proposed for prediction and uncertainty quantification in the reduced order space. The technique is based on a training set that sparsely samples the attractor. Stochastic predictions exploit the geometrical relationship between the current state and the training set, assuming a Gaussian prior over the modeled latent variables. A key advantage of GPR is that uncertainty bounds can be analytically derived from the hyper-parameters of the framework. Moreover, in  a Mean Stochastic Model (MSM) is used for under-sampled regions of the attractor to ensure accurate modeling of the steady state in the long-term regime. However the resulting inference and training have a quadratic cost in terms of the number of data samples .
Some of the earlier approaches to capture the evolution of time series in chaotic systems using recurrent neural networks were developed during the inception of the Long Short-Term Memory networks (LSTM) . However, to the best of our knowledge, these methods have been used only on low-dimensional chaotic systems 44], Echo State Networks (ESNs) [45, 46]47, 48] have been successful, albeit only for low order dynamical systems. Recent work in [49, 50] demonstrated promising results of ESNs for high dimensional chaotic systems.
In this paper, we propose LSTM based methods that exploit information of the recent history of the reduced order state to predict the high-dimensional dynamics. Time-series data are used to train the model while no knowledge of the underlying system equations is required. Inspired by Taken’s theorem  an embedding space is constructed using time delayed versions of the reduced order variable. The proposed method tries to identify an approximate forecasting rule globally for the reduced order space. In contrast to GPR , the method has a deterministic output while its training cost scales linearly with the number of training samples and it exhibits an inference computational cost. Moreover, following , LSTM is combined with a MSM, to cope with attractor regions that are not captured in the training set. In attractor regions, under-represented in the training set, the MSM is used to guarantee convergence to the invariant measure and avoid an exponential growth of the prediction error. The effectiveness of the proposed hybrid method in accurate short-term prediction and capturing the long-term behavior is shown in the Lorenz 96 system and the Kuramoto-Sivashinsky system. Finally the method is also tested on predictions of a prototypical climate model.
The structure of the paper is as follows: In Section 2 we explain how the LSTM can be employed for modeling and prediction of a reference dynamical system and a blended LSTM-MSM technique is introduced. In Section 3 three other state of the art methods, GPR, MSM and the hybrid GPR-MSM scheme are presented and two comparison metrics are defined. The proposed LSTM technique and its LSTM-MSM extension are benchmarked against GPR and GPR-MSM in three complex chaotic systems in Section 4. In Section 5 we discuss the computational complexity of training and inference in LSTM. Finally, Section 6 offers a summary and discusses future research directions.
2 Long Short-Term Memory (LSTM) Recurrent Neural Networks
The LSTM was introduced in order to regularize the training of recurrent neural networks (RNNs). RNNs contain loops that allow information to be passed between consecutive temporal steps (see Figure 0(a)) and can be expressed as:
where , and are the input, the output and the hidden state of the RNN at time step . The weight matrices are (input-to-hidden), (hidden-to-hidden), (hidden-to-output), and . Moreover, and
are the hidden and output activation functions, whileand are the respective biases. Temporal dependencies are captured by the hidden-to-hidden weight matrix , which couples two consecutive hidden states together. A schematic of the RNN architecture is given in Figure 1.
In many practical applications, RNNs suffer from the vanishing (or exploding) gradient problem and have failed to capture long-term dependencies[52, 53]. Today the RNNs owe their renaissance largely to the LSTM, that copes effectively with the aforementioned problem using gates. The LSTM has been successfully applied in sequence modeling , speech recognition , hand-writing recognition  and language translation .
The equations of the LSTM are
where , are the gate signals (forget, input and output gates), is the input, is the hidden state, is the cell state, while , , , are weight matrices and biases. The activation functions , and are sigmoids. For a more detailed explanation on the LSTM architecture refer to . In the following we refer to the LSTM hidden and cell states ( and ) jointly as LSTM states. The dimension of these states is called the number of hidden units and it controls the capability of the cell to encode history information. In practice we want the output to have a specific dimension . For this reason, a fully connected final layer without activation function is added
where all parameters (weights and biases) are encoded in , , and is the LSTM cell function that maps the previous LSTM States and current input to the output. By unfolding the LSTM time-steps in the past and ignoring dependencies longer that we get
where represents the iterative application of and computation of the LSTM states for time steps. For a more detailed explanation of the formula for , and a Figure of the neural network architecture refer to the Appendix.
In this work, we consider the reduced order problem where the state of a dynamical system is projected in the reduced order space. The system is considered to be autonomous, while is the system state derivative at time step . Following , The LSTM is trained using time series data from the system in the reduced order space to predict the reduced state derivative from a short history of the reduced order state consisting of past temporally consecutive states. In this work, we approximated the derivative from the original time series using first order forward differences. The loss that has to be minimized is defined as
The short-term history for the states before is not available, that is why in total we have training samples from a time series with samples. During training the weights of the LSTM are optimized according to . The parameter is denoted as truncation layer and time dependencies longer than
are not explicitly captured in the loss function.
Training of this model is performed using Back-propagation through time, truncated at layer and mini-batch optimization with the Adam method  with an adaptive learning rate (initial learning rate ). The LSTM weights are initialized using the method of Xavier . Training is stopped when convergence of the training error is detected or the maximum of epochs is reached. During training the loss of the model is evaluated on a separate validation dataset to avoid overfitting. The training procedure is explained in detail in the Appendix.
An important issue is how to select the hidden state dimension and how to initialize the LSTM states at the truncation layer . A small reduces the expressive capabilities of the LSTM and deteriorates inference performance. On the other hand, a big is more sensitive to overfitting and the computational cost of training rises. For this reason, has to be tuned depending on the observed training behavior. In this work, we performed a grid search and selected the optimal for each application. For the truncation layer , there are two alternatives, namely stateless and stateful LSTM. In stateless LSTM the LSTM states at layer are initialized to zero as in equation (5). As a consequence, the LSTM can only capture dependencies up to previous time steps. In the second variant, the stateful LSTM, the state is always propagated for time steps in the future and then reinitialized to zero, to help the LSTM capture longer dependencies. In this work, the systems considered exhibit chaotic behavior and the dependencies are inherently short-term, as the states in two time steps that differ significantly can be considered statistically independent. For this reason, the short temporal dependencies can be captured without propagating the hidden state for a long horizon. As a consequence, we consider only the stateless variant . We also applied stateful LSTM without any significant improvement so we omit the results for brevity. The trained LSTM model can be used to iterative predict the system dynamics as illustrated in Figure 2. This is a solely data-driven approach and no explicit information regarding the form of the underlying equations is required.
2.1 Mean Stochastic Model (MSM) and Hybrid LSTM-MSM
The MSM is a powerful data-driven method used to quantify uncertainty and perform forecasts in turbulent systems with high intrinsic attractor dimensionality [54, 36]. It is parameterized a priori to capture global statistical information of the attractor by design, while its computationally complexity is very low compared to LSTM or GPR. The concept behind MSM is to model each component of the state independently with an Ornstein-Uhlenbeck (OU) process that captures the energy spectrum and the damping time scales of the statistical equilibrium. The process takes the following form
where are parameters fitted to the centered training data and is a Wiener process. In the statistical steady state the mean, energy and damping time scale of the process are given by
where denotes the complex conjugate of . In order to fit the model parametersfrom the time series training data and the decorrelation time using
After computing these two quantities we replace in (8) and solve with respect to and . Since the MSM is modeled a priori to mimic the global statistical behavior of the attractor, forecasts made with MSM can never escape. This is not the case with LSTM and GPR, as prediction errors accumulate and iterative forecasts escape the attractor due to the chaotic dynamics, although short-term predictions are accurate. This problem has been addressed with respect to GPR in . In order to cope effectively with this problem we introduce a hybrid LSTM-MSM technique that prevents forecasts from diverging from the attractor.
The state dependent decision rule for forecasting in LSTM-MSM is given by
is an approximation of the probability density function of the training dataset anda constant threshold tuned based on . We approximate using a mixture of Gaussian kernels. This hybrid architecture exploits the advantages of LSTM and MSM. In case there is a high probability that the state lies close to the training dataset (interpolation) the LSTM having memorized the local dynamics is used to perform inference. This ensures accurate LSTM short-term predictions. On the other hand, close to the boundaries the attractor is only sparsely sampled and errors from LSTM predictions would lead to divergence. In this case, MSM guarantees that forecasting trajectories remain close to the attractor, and that we converge to the statistical invariant measure in the long-term.
3 Benchmark and Performance Measures
The performance of the proposed LSTM based prediction mechanism is benchmarked against the following state-of-the-art methods:
Mean Stochastic Model (MSM)
Gaussian Process Regression (GPR)
Mixed Model (GPR-MSM)
In order to guarantee that the prediction performance is independent of the initial condition selected, for all applications and all performance measures considered the average value of each measure for a number of different initial conditions sampled independently and uniformly from the attractor is reported. The ground truth trajectory is obtained by integrating the discretized reference equation starting from each initial condition, and projecting the states to the reduced order space. The reference equation and the projection method are of course application dependent.
From each initial condition, we generate an empirical Gaussian ensemble of dimension around the initial condition with a small variance . This noise represents the uncertainty in the knowledge of the initial system state. We forecast the evolution of the ensemble by iteratively predicting the derivatives and integrating (deterministically for each ensemble member for the LSTM, stochastically for GPR) and we keep track of the mean. We select an ensemble size , which is the usual choice in environmental science, e.g. weather prediction and short-term climate prediction .
The ground truth trajectory at each time instant is then compared with the predicted ensemble mean . As a comparison measure we use the root mean square error (RMSE) defined as where index denotes the th component of the reduced order state , is the initial condition, and is the total number of initial conditions. The RMSE is computed at each time instant for each component
of the reduced order state, resulting in error curves that describe the evolution of error with time. Moreover, we use the standard deviationof the attractor samples in each dimension as a relative comparison measure. Assuming that the attractor consists of samples , with , the attractor standard deviation in dimension is defined as , where is the mean of the samples in this dimension. If the prediction error is bigger than this standard deviation, then a trivial mean predictor performs better.
Moreover, we use the mean Anomaly Correlation Coefficient (ACC)  over initial conditions to quantify the pattern correlation of the predicted trajectories with the ground-truth. The ACC is defined as
where refers to the mode number, refers to the initial condition, are mode weights selected according to the energies of the modes after dimensionality reduction and is the time average of the respective mode, considered as reference. This score ranges from to . If the forecast is perfect, the score equals to . The ACC coefficient is a widely used forecasting accuracy score in the meteorological community .
In this section, the effectiveness of the proposed method is demonstrated with respect to three chaotic dynamical systems, exhibiting different levels of chaos, from weakly chaotic to fully turbulent, i.e. the Lorenz 96 system, the Kuramoto-Sivashinsky equation and a prototypical barotropic climate model.
4.1 The Lorenz 96 System
In  a model of the large-scale behavior of the mid-latitude atmosphere is introduced. This model describes the time evolution of the components for of a spatially discretized (over a single latitude circle) atmospheric variable. In the following we refer to this model as the Lorenz 96. The Lorenz 96 is usually used ([57, 36] and references therein) as a toy problem to benchmark methods for weather prediction.
The system of differential equations that governs the Lorenz 96 is defined as
for , where by definition . In our analysis . The right-hand side of (12) consists of a non-liner adjective term , a linear advection (dissipative) term and a positive external forcing term . The discrete energy of the system remains constant throughout time and the Lorenz 96 states remain bounded. By increasing the external forcing parameter F the behavior that the system exhibits changes from periodic to weakly chaotic () to end up in fully turbulent regimes (). These regimes can be observed in Figures 3.
Following [36, 55] we apply a shifting and scaling to standardize the Lorenz 96 states . The discrete or Dirichlet energy is given by . In order for the scaled Lorenz 96 states to have zero mean and unit energy we transform them using
where is the average energy fluctuation. In this way the scaled energy is and the scaled variables have zero mean with the mean state. The scaled Lorenz 96 states obey the following differential equation
4.1.1 Dimensionality Reduction: Discrete Fourier Transform
Firstly, the Discrete Fourier Transform (DFT) is applied to the energy standardized Lorenz 96 states. The Fourier coefficients and the inverse DFT to recover the Lorenz 96 states are given by
; ; ; of the total energy
After applying the DFT to the Lorenz 96 states we end up with a symmetric energy spectrum that can be uniquely characterized by ( is considered to be an even number) coefficients for . In our case , thus we end up with complex coefficients . These coefficients are referred to as the Fourier modes or simply modes. The Fourier energy of each mode is defined as .
The energy spectrum of the Lorenz 96 system is plotted in Figure 4 for different values of the forcing term . We take into account only the modes corresponding to the highest energies and the rest of the modes are truncated. For the different forcing regimes , the six most energetic modes correspond to approximately , and of the total energy respectively. The space where the reduced variables live in is referred to as the reduced order phase space and the most energetic modes are notated as for . As shown in  the most energetic modes are not necessarily the ones that capture better the dynamics of the model. Including more modes, or designing a criterion to identify the most important modes in the reduced order space may boost prediction accuracy. However, in this work we are not interested in an optimal reduced space representation, but rather in the effectiveness of a prediction model given this space. The truncated modes are ignored for now. Nevertheless, their effect can be modeled stochastically as in .
Since each Fourier mode is a complex number, it consists of a real part and an imaginary part. By stacking these real and imaginary parts of the truncated modes we end up with the dimensional reduced model state
Assuming that for are the Lorenz 96 states at time instant , the mapping ,
is unique and the reduced model state of the Lorenz 96 has a specific vector value.
4.1.2 Training and Prediction in Lorenz 96
The reduced Lorenz 96 system states are considered as the true reference states . The LSTM is trained to forecast the derivative of the reduced order state as elaborated in Section 2. We use a stateless LSTM with hidden units and the back-propagation truncation horizon set to .
In order to obtain training data for the LSTM, we integrate the Lorenz 96 system state Eg. (12) starting from an initial condition for using a Runge-Kutta 4th order method with a time step up to . In this way a time series is constructed. We obtain the reduced order state time series , using the DFT mapping . From this time series we discard the first initial time steps as initial transients, ending up with a time series with samples. A similar but independent process is repeated for the validation set.
The trained LSTM models are used for prediction based on the iterative procedure explained in Section 2. In this section, we demonstrate the forecasting capabilities of LSTM and compare it with GPs. different initial conditions uniformly sampled from the attractor are simulated. For each initial condition, an ensemble with size is considered by perturbing it with a normal noise with variance .
In Figures 4(a), 4(b), and 4(c) we report the mean RMSE prediction error of the most energetic mode , scaled with for the forcing regimes for the first time steps (). In the RMSE the complex norm is taken into account. The of the standard deviation of the attractor is also plotted for reference (). As increases, the system becomes more chaotic and difficult to predict. As a consequence, the number of prediction steps that remain under the threshold are decreased. The LSTM models extend this predictability horizon for all forcing regimes compared to GPR and MSM. However, when LSTM is combined with MSM the short-term prediction performance is compromised. Nevertheless, hybrid LSTM-MSM models outperform GPR methods in short-term prediction accuracy.
In Figures 4(d), 4(e), and 4(f), the RMSE error for is plotted. The standard deviation from the attractor is plotted for reference. We can observe that both GPR and LSTM diverge, while MSM and blended schemes remain close to the attractor in the long-term as expected.
; ; threshold ; MSM; GPR; GPR-MSM; LSTM; LSTM-MSM
In Figures 4(g), 4(h), and 4(i), the mean ACC over 1000 initial conditions is given. The predictability threshold of is also plotted. After crossing this critical threshold, the methods do not predict better than a trivial mean predictor. For GPR methods show inferior performance compared to LSTM approaches as analyzed previously in the RMSE comparison. However, for LSTM models do not predict better than the mean after , while GPR shows better performance. In turn, when blended with MSM the compromise in the performance for GPR-MSM is much bigger compared to LSTM-MSM. The LSTM-MSM scheme shows slightly superior performance than GPR-MSM during the entire relevant time period (). For the fully turbulent regime , LSTM shows comparable performance with both GPR and MSM and all methods converge as chaoticity rises, since the intrinsic dimensionality of the system attractor increases and the system become inherently unpredictable.
In Figure 6, the evolution of the mean RMSE over initial conditions of the wavenumbers of the Lorenz 96 with forcing is plotted. In contrast to GPR, the RMSE error of LSTM is much lower in the moderate and low energy wavenumbers compared to the most energetic mode . This difference among modes is not observed in GPR. This can be attributed to the highly non-linear energy transfer mechanisms between these lower energy modes as opposed to the Gaussian and locally linear energy transfers of the most energetic mode.
; MSM; GPR; GPR-MSM; LSTM; LSTM-MSM
As illustrated before, the hybrid LSTM-MSM architecture effectively combines the accurate short-term prediction performance of LSTM with the long-term stability of MSM. The ratio of ensemble members modeled by LSTM in the hybrid scheme is plotted with respect to time in Figure 6(a). Starting from the initial ensemble of size , as the LSTM forecast might deviate from the attractor, the MSM is used to forecast in the hybrid scheme. As a consequence, the ratio of ensemble members modeled by LSTM decreases with time. In parallel with the GPR results presented in  and plotted in Figure 6(b), the slope of this ratio curve increases with up to time . However, the LSTM ratio decreases slower compared to GPR.
4.2 Kuramoto-Sivashinsky Equation
The Kuramoto-sivashinsky (K-S) system is extensively used in many scientific fields to model a multitude of chaotic physical phenomena. It was first derived by Kuramoto [60, 61] as a turbulence model of the phase gradient of a slowly varying amplitude in a reaction-diffusion type medium with negative viscosity coefficient. Later, Sivashinsky  studied the spontaneous instabilities of the plane front of a laminar flame ending up with the K-S equation, while in  the K-S equation is found to describe the surface behavior of viscous liquid in a vertical flow.
For our study, we restrict ourselves to the one dimensional K-S equation with boundary and initial conditions given by
where is the modeled quantity of interest depending on a spatial variable and time . The negative viscosity is modeled by the parameter . We impose Dirichlet and second-type boundary conditions to guarantee ergodicity . In order to spatially discretize (17) we use a grid size with the number of nodes. Further, we denote with the value of at node . Discretization using a second order differences scheme yields
Further, we impose and add ghost nodes , to account for the Dirichlet and second-order boundary conditions. In our analysis, the number of nodes is . The Kuramoto-Sivashinsky equation exhibits different levels of chaos depending on the bifurcation parameter . Higher values of lead to more chaotic systems .
; ; ; 20 modes
In our analysis the spatial variable bound is held constant to and chaoticity level is controlled through the negative viscosity , where a smaller value leads to a system with a higher level of chaos (see Figure 7(a)). In our study, we consider two values, namely and to benchmark the prediction skills of the proposed method. The discretized equation (18) is integrated with a time interval up to . The data points up to are discarded as initial transients. Half of the remaining data ( samples) are used for training and the other half for validation.
4.2.1 Dimensionality Reduction: Singular Value Decomposition
The dimensionality of the problem is reduced using Singular Value Decomposition (SVD). By subtracting the temporal meanand stacking the data, we end up with the data matrix , where is the number of data samples ( in our case). Performing SVD on leads to
with diagonal, with descending diagonal elements. The right singular vectors corresponding to the largest singular values are the first columns of . Stacking these singular vectors yields . Assuming that is a vector of the discretized values of in time , in order to get a reduced order representation corresponding to the components with the highest energies (singular values) we multiply
The percentage of cumulative energy w.r.t. to the number of PCA modes considered is plotted in Figure 7(b). In our study, we pick (out of ) most energetic modes, as they explain approximately of the total energy.
We train stateless LSTM models with and . For testing, starting from initial conditions uniformly sampled from the attractor, we generate a Gaussian ensemble of dimension centered around the initial condition in the original space with standard deviation of . This ensemble is propagated using the LSTM prediction models, and GPR, MSM and GPR-MSM models trained as in . The root mean square error between the predicted ensemble mean and the ground-truth is plotted in Figures 8(a), 8(b) for different values of the parameter . All methods reach the invariant measure much faster for compared to the less chaotic regime (note the different integration times for , while for ).
In both chaotic regimes and , the reduced order LSTM outperforms all other methods in the short-term before escaping the attractor. However, in the long-term, LSTM does not stabilize and will eventually diverge faster than GPR (see Figure 8(b)). Blending LSTM with MSM alleviates the problem and both accurate short-term predictions and long-term stability is attained. Moreover, the hybrid LSTM-MSM has better forecasting capabilities compared to GPR.
The need for blending LSTM with MSM in the KS equation is less imperative as the system is less chaotic than the Lorenz 96 and LSTM methods diverge much slower, while they sufficiently capture the complex nonlinear dynamics. As the intrinsic dimensionality of the attractor rises LSTM diverges faster.
The mean ACC (11) is plotted with respect to time in Figures 8(c) and 8(d) for and respectively. The evolution of the ACC justifies the aforementioned analysis. The mean ACC of the trajectory predicted with LSTM remains above the predictability threshold of for a highest time duration compared to other methods. This predictability horizon is approximately for and for , since the chaoticity of the system rises and accurate predictions become more challenging. For a plot of the time evolution of the ratio of the ensemble members that are modeled with LSTM dynamics in the hybrid LSTM-MSM refer to the Appendix.
; threshold; MSM; GPR; GPR-MSM; LSTM; LSTM-MSM
4.3 A Barotropic Climate Model
In this section, we examine a standard barotropic climate model  originating from a realistic winter circulation. The model equations are given by
where is the stream function, the relative vorticity, the Coriolis parameter, a constant vorticity forcing, while and are the Ekman damping and the scale-selective damping coefficient. is the Jacobi operator given by
where and denote the sine of the geographical latitude and longitude respectively. The equation of the barotropic model (21) is non-dimensionalized using the radius of the earth as unit length and the inverse of the earth angular velocity as time unit. The non-dimensional orography is related to the real Northern Hemisphere orography by , where is a fixed amplitude of , is a factor expressing the surface wind strength blowing across the orography, and a scale height . The stream-function is expanded into a spherical harmonics series and truncated at wavenumber 21, while modes with an even total wavenumber are excluded, avoiding currents across the equator and ending up with a hemispheric model with degrees of freedom.
The training data are obtained by integrating the Eq. (21) for days after an initial spin-up period of days, using a fourth-order Adams-Bashforth integration scheme with a -min time step in accordance with , with days, while is selected such that wavenumber is damped at a time scale of days. In this way we end up with a time series with samples. The spherical surface is discretized into a mesh with equally spaces latitude and longitude. From the gathered data, is used for training and for validation. The mean and variance of the statistical steady state are shown in Figures 9(a) and 9(b).
The dimensionality of the barotropic climate model truncated at wavenumber is . In order to reduce it, we identify Empirical Orthogonal Functions (EOFs) , that form an orthogonal basis of the reduced order space. The details of the method are described in the Appendix. EOF analysis has been used to identify individual realistic climatic modes such as the Arctic Oscillation (AO), the Pacific/North America (PNA) and the Tropical/Northern Hemisphere (TNH) pattern known as teleconnections [67, 68]. Accurate prediction of these modes is of high practical importance as they feature realistic climate patterns. After projecting the state of the barotropic model to the EOFs, we take into account only the coefficients corresponding to the most energetic EOFs that form the reduced order state . In our study, the dimensionality of the reduced space is , as contains only of the energy of , while the most energetic modes contain approximately of the total energy, as depicted in Figure 9(c).
4.3.1 Training and Prediction
The reduced order state that we want to predict using the LSTM are the components of . A stateless LSTM with hidden units is considered, while the truncated back-propagation horizon is set to . The prototypical system is less chaotic than the KS equation and the Lorenz 96, which enables us to use more hidden units. The reason is that as chaoticity is decreased trajectories sampled from the attractor as training and validation dataset become more interconnected and the task is inherently easier and less prone to overfitting. In the extreme case of a periodic system, the information would be identical. points are randomly and uniformly picked from the attractor as initial conditions for testing. A Gaussian ensemble with a small variance () along each dimension is formed and marched using the reduced-order GPR, MSM, Mixed GPR-MSM and LSTM methods.
The RMSE error of the four most energetic reduced order space variables for is plotted in Figure 11. The LSTM takes to reach the attractor, while GPR based methods generally take . In contrast, the MSM reaches the attractor already after hour. This implies that the LSTM can better capture the non-linear dynamics compared to GPR. Note that the barotropic model is much less chaotic than the Lorenz 96 system with , where all methods show comparable prediction performance. Blended LSTM models with MSM are omitted here, as LSTM models only reach the attractor standard deviation towards the end of the simulated time and MSM-LSTM shows identical performance.
; MSM ; GPR ; GPR-MSM ; LSTM
5 A Comment on Computational Cost of Prediction
The computational cost of making a single prediction can be quantified by the number of operations (multiplications and additions) needed. In GPR based approaches the computational cost is of order , where is the number of samples used in training. For GPR methods illustrated in the previous section . The GPR models the global dynamics by uniformly sampling the attractor and ”carries” this training dataset at each time instant to identify the geometric relation between the input and the training dataset (modeled with a covariance matrix metric) and make (exact or approximate) probabilistic inference on the output.
In contrast, LSTM adjusts its parameters to reproduce the local dynamics. As a consequence, the inference computational complexity does not depend on the number of samples used for training. The inference complexity is roughly , where is the dimension of each input, is the number of inputs and is the number of hidden units. This complexity is significantly smaller than GPR, which can be translated to faster prediction. However, it is logical that the LSTM is more prone to diverge from the attractor, as there is no guarantee that the infrequent training samples near the attractor limits where memorized. This remark explains the faster divergence of LSTM in the more turbulent regimes considered in Section 4.
We propose a data-driven method, based on long short-term memory networks, for modeling and prediction in the reduced space of chaotic dynamical systems. The LSTM uses the short-term history of the reduced order variable to predict the state derivative and uses it for one-step prediction. The network is trained on time-series data and it requires no prior knowledge of the underlying governing equations. Using the trained network, long-term predictions are made by iteratively predicting one step forward.
The features of the proposed technique are showcased through comparisons with GPR and MSM on benchmarked cases. Three applications are considered, the Lorenz 96 system, the Kuramoto-Sivashinsky equation and a barotropic climate model. The chaoticity of these systems ranges from weakly chaotic to fully turbulent, ensuring a complete simulation study. Comparison measures include the RMSE and ACC between the predicted trajectories and trajectories of the real dynamics.
In all cases, the proposed approach performs better, in short-term predictions, as the LSTM is more efficient in capturing the local dynamics and complex interactions between the modes. However, the prediction error accumulates as we iteratively perform predictions and similar to GPR does not converge to the invariant measure. Furthermore in the cases of increased chaoticity the LSTM diverges faster than GPR. This may be attributed to the absence of certain attractor regions in the training data, insufficient training, and propagation of the exponentially increasing prediction error. To mitigate this effect, LSTM is also combined with MSM, following ideas presented in , in order to guarantee convergence to the invariant measure. Blending LSTM or GPR with MSM leads to a deterioration in the short-term prediction performance but the steady-state statistical behavior is captured. The hybrid LSTM-MSM exhibits a slightly superior performance than GPR-MSM in all systems considered in this study.
In the Kuramoto-Sivashinsky equation LSTM can capture better the local dynamics compared to Lorenz 96 due to the lower intrinsic attractor dimensionality. LSTM is more accurate than GPR in the short-term, but especially in the chaotic regime forecasts of LSTM fly away from the attractor faster. LSTM-MSM counters this effect and long-term forecasts converge to the invariant measure at the expense of a compromise in the short-term forecasting accuracy. The higher short-term forecasting accuracy of LSTM can be attributed to the fact that it is a nonlinear approximator and can also capture correlations between modes in the reduced space. In contrast, GPR is a locally linear approximator modeling each mode independently in the output, assuming Gaussian correlations between modes in the input. LSTM and GPR show comparable forecasting accuracy in the barotropic model, as the intrinsic dimensionality is significantly smaller than Kuramoto-Sivashinsky and Lorenz 96 and both methods can effectively capture the dynamics.
Future directions include modeling the lower energy modes and interpolation errors using a stochastic component in the LSTM to improve the forecasting accuracy. Another possible research direction is to model the attractor in the reduced space using a mixture of LSTM models, one model for each region. The LSTM proposed in this work models the attractor globally. However, different attractor regions may exhibit very different dynamic behaviors, which cannot be simultaneously modeled using only one network. Moreover, these local models can be combined with a closure scheme compensating for truncation and modeling errors. This local modeling approach may further improve prediction performance.
7 Data Accessibility
The code and data used in this work are available at the link:
The password is rspa_paper. The TensorFlow library and python3 were used for the implementation of LSTM architectures while Matlab was used for Gaussian Processes. These packages need to be installed in order to run the codes.
8 Authors’ Contributions
PRV conceived the idea of the blended LSTM-MSM scheme, implemented the neural network architectures and the simulations, interpreted the computational results, and wrote the manuscript. WB supervised the work and contributed to the implementation of the LSTM. ZYW implemented the GPR and made contributions to the manuscript. PK had the original idea of the LSTM scheme and contributed to the manuscript. PK and TPS contributed to the interpretation of the results and offered consultation. All authors gave final approval for publication.
9 Competing Interests
We have no competing interests.
TPS and ZYW have been supported by an Air Force Office of Scientific Research grant FA9550-16-1-0231, an Office of Naval Research grant N00014-15-1-2381, and an Army Research Office grant 66710-EG-YIP. PK and PRV gratefully acknowledge support from the European Research Council (ERC) Advanced Investigator Award (No. 341117).
We thank the two anonymous reviewers whose insightful comments helped us to enhance the manuscript.
Appendix A Long short-term memory (LSTM)
a.1 Training and inference
In this section, the LSTM training procedure is explained in detail. We assume that time series data stemming from a dynamical system is available in the form , where is the state at time step and is the derivative. The available time series data are divided into two separate sets, the training dataset and the validation dataset, i.e. and . and are the number of training and validation samples respectively. We set the ratio to . This data is stacked as
for , in order to form the training (and validation) input and output of the LSTM. These training samples are used to optimize the parameters of the LSTM (weights and biases) in order to learn the mapping . The loss function of each sample is
while the total Loss is defined as
where is the total number of samples. These samples can be further stacked together as batches of size , with the loss of the batch defined as the mean loss of the samples belonging to the batch. Using only one sample for the loss gradient estimation may lead to noisy gradient estimates and slow convergence. Mini-batch optimization tackles this issue.
At the beginning of the training the weights are randomly initialized to
using Xavier initialization. We also tried other initialization methods like drawing initial weights from random normal distributions, or initializing them to constant values, but they often led to saturation of the activation functions, especially for architectures with higher back-propagation horizon. The training proceeds by optimizing the network weights iteratively for each batch. In order to perform this optimization step, a gradient descent optimizer can be used
where is the step-size parameter, are the weights before optimizing the batch and
are the updated weights. Plain gradient descent optimization suffers from slow convergence in practice and convergence to local sub-optimal solutions. This approach is especially not well-suited for high dimensional problems in deep learning where the number of parameters (weights) to be optimized lie in a high-dimensional manifold with many local optima. Sparse gradients stemming from the mini-batch-optimization lead also to slow convergence as previously computed gradients are ignored. Recent advances in stochastic optimization led to the invention of adaptive schemes that efficiently cope with the aforementioned problems.
In our work, we used the Adam stochastic optimization method. Adam exploits previously computed gradients using moments. The weights are initialized toand the moment vectors to and . At each step the updates in the Adam optimizer are
where and are hyper-parameters, is the point-wise square of the gradient and is the parameter in the th power, where is the iteration number. After updating the weights using the Adam procedure (27) for every batch, a training epoch is completed. Many such epochs are performed until the total training loss reaches a plateau. After each epoch the loss is evaluated also in the validation data set, in order to avoid overfitting. The validation loss is used as a proxy of the generalization error. The training is stopped when the validation error is not decreasing for consecutive epochs or the maximum of epochs is reached. In our work we used , , . We found that our results were robust towards the selection of these hyper-parameters. To speed up convergence speed, a higher initial learning rate was used and the models were then refined with .
a.2 Weighting the loss function
In the training procedure described above the loss function for each sample is given by
However, in the applications considered in this paper the neural network output is a multidimensional vector and represents a prediction of the derivative of the reduced order state of a dynamical system. In a dynamical system, specific reduced order states are more important than others as they may explain a bigger portion of the total energy. This importance can be introduced in the loss function by assigning different weights in different outputs of the neural network. The loss of each sample takes then the following form
where is the output dimension and weights are selected according to the significance of each output component, e.g. energy of each component in the physical system.
a.3 LSTM architecture
An RNN unfolded temporal time steps in the past is illustrated in Figure 12. The following discussion on Stateless and Stateful RNNs generalizes to LSTMs, with the only difference that the hidden state consists of instead of solely and the functions coupling the hidden states with the input as well as the output with the hidden states are more complicated.
In Stateless RNNs the hidden states at the truncation layer , are initialized always with . As a consequence, and only the short-term history is used to perform a prediction. The only difference when using LSTM cells is that the function has a more complex structure and additionally .
In contrast, in Stateful RNNs the states . In this case, these states can be initialized by ”teacher forcing” the RNN using data from a longer history in the past. For example, assuming is known, we can set , and compute using the given history ignoring the outputs. This value can then be used to predict as in Figure 12. This approach has two disadvantages.
In order to be able to forecast starting from various initial conditions, even with ”teacher forcing” some initialization of the hidden states is imperative. This initialization introduces additional error, which is not the case for the Stateless RNN.
In the Stateful RNN a longer history needs to be known in order to initialize the hidden states with ”teacher forcing”. Even though more data needs to be available, we did not observe any prediction accuracy improvement in the cases considered. This follows from the chaotic nature of the systems, as information longer than some time-steps in the past are irrelevant for the prediction.