Hierarchical (Deep) Echo State Networks with Uncertainty Quantification for Spatio-Temporal Forecasting

06/28/2018 ∙ by Patrick L. McDermott, et al. ∙ 0

Long-lead forecasting for spatio-temporal problems can often entail complex nonlinear dynamics that are difficult to specify it a priori. Current statistical methodologies for modeling these processes are often overparameterized and thus, struggle from a computational perspective. One potential parsimonious solution to this problem is a method from the dynamical systems and engineering literature referred to as an echo state network (ESN). ESN models use so-called reservoir computing to efficiently estimate a dynamical neural network forecast, model referred to as a recurrent neural network (RNN). Moreover, so-called deep models have recently been shown to be successful at predicting high-dimensional complex nonlinear processes. These same traits can be used to characterize many spatio-temporal processes. Here we introduce a deep ensemble ESN (D-EESN) model. Through the use of an ensemble framework, this model is able to generate forecasts that are accompanied by uncertainty estimates. After introducing the D-EESN, we then develop a hierarchical Bayesian implementation. We use a general hierarchical Bayesian framework that accommodates non-Gaussian data types and multiple levels of uncertainties. The proposed methodology is first applied to a data set simulated from a novel non-Gaussian multiscale Lorenz-96 dynamical system simulation model and then to a long-lead United States (U.S.) soil moisture forecasting application.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spatio-temporal data are ubiquitous in engineering and the sciences, and their study is important for understanding and predicting a wide variety of processes. One of the chief difficulties in modeling spatial processes that change with time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional complex datasets and large prediction domains. It is particularly challenging to specify parameterizations for nonlinear dynamical spatio-temporal models that are simultaneously useful scientifically (e.g., long-lead forecasting as discussed below) and efficient computationally. Statisticians have developed some “deep” mechanistically-motivated hierarchical models that can accommodate process complexity as well as the uncertainties in predictions and inference, typically within a hierarchical Bayesian paradigm (see the overview in Cressie  Wikle, 2011)

. However, these models can be computationally expensive, require prior information and/or a significant amount of data to fit, and are typically application specific. On the other hand, the science, engineering, and machine learning communities have developed alternative approaches for nonlinear spatio-temporal modeling, particularly in the neural network context (e.g., recurrent neural networks, RNNs). These approaches can be very flexible but, again, are computationally expensive, require a lot of training data, and/or “pre-training.” In addition, these approaches often do not provide formal measures of uncertainty quantification. There are, however, parsimonious approaches to RNNs in the engineering literature, such as the echo state network (ESN), although the standard implementation of this approach does not include spatio-temporal dependencies, deep learning, or formal uncertainty quantification. Here, we present a hierarchical deep statistical implementation of an ESN model for spatio-temporal processes with the goal of long-lead forecasting with uncertainty quantification.

The methodology presented here is motivated by the problem of long-lead forecasting of environmental processes. The atmosphere is a chaotic dynamical system. Because of that, skillful weather forecasts are only possible out to about 10-14 days (e.g., Stern  Davidson, 2015). However, dynamical processes in the ocean operate on much longer time scales, and many atmospheric processes depend crucially on the ocean as a forcing mechanism. This coupling between the slowly varying ocean and the faster varying atmosphere (and associated processes), allows for the skillful prediction of some general properties of the atmospheric state many months to over a year in advance (i.e., long-lead forecasting). Although statistical models are not as skillful as deterministic numerical weather prediction models for short-to-medium weather forecasting, they have consistently performed as well or better than deterministic models for long-lead forecasting (e.g., Barnston ., 1999; Jan van Oldenborgh ., 2005)

. In some cases, fairly standard linear regression or multivariate canonical correlation analysis methods can be used to generate effective long-lead forecasts

(e.g., Penland  Magorian, 1993; Knaff  Landsea, 1997). However, given the inherent nonlinearity of these systems, it has consistently been shown that well crafted nonlinear statistical methods often perform better than linear methods, at least for some spatial regions and time spans (e.g., Drosdowsky, 1994; Tang ., 2000; Berliner ., 2000; Timmermann ., 2001; Kondrashov ., 2005; Wikle  Hooten, 2010; Gladish  Wikle, 2014). It remains an active area of research to develop nonlinear statistical models for long-lead forecasting, and there is a need to develop methods that are computationally efficient, skillful, and can provide realistic uncertainty quantification in the presence of multiple time and spatial scales.

Statistical approaches for nonlinear dynamical spatio-temporal models (DSTMs) have focused on accommodating the quadratic nonlinearity that is present in many mechanistic models of such systems (Kravtsov ., 2005; Wikle  Hooten, 2010; Richardson, 2017). These models, at least when implemented in a way that fully accounts for uncertainty in data, process, and parameters, can be quite computationally challenging, mainly due to the very large number of parameters that must be estimated. Solutions to this challenge require reducing the dimension of the state-space, regularizing the parameter space, the incorporation of additional information (prior knowledge), and novel computational approaches (see the summary in Wikle, 2015). Parsimonious alternatives include analog methods (e.g., McDermott  Wikle, 2016; Zhao  Giannakis, 2016), and individual (agent) based models (e.g., Hooten  Wikle, 2010).

RNNs provide an alternative approach to model multiple scale nonlinear spatio-temporal processes. In essence, RNNs are a type of artificial neural network, originally developed in the 1980s (Hopfield, 1982)

that, unlike traditional feed-forward neural networks, include “memory” and allow cycles that can process sequences in their hidden layers. That is, unlike feed-forward neural networks, RNNs explicitly account for the dynamical structure of the data. This has made them ideal for applications in natural language processing, speech recognition, and image captioning. These methods have not been used extensively for spatio-temporal prediction, although there are notable exceptions

(Dixon ., 2017; McDermott  Wikle, 20171). Like the quadratic nonlinear spatio-temporal DSTMs in statistics, these models have a very large number of parameters (weights), and can be quite difficult to tune and train, and are computationally intensive. One way to get the advantages of a RNN within a more parsimonious parameter estimation context is through the use of so-called “reservoir computing” methods - the most common of which is the ESN (Jaeger, 2001)

. In this case the hidden states and inputs evolve in a “dynamical reservoir” in which the parameters (weights) that describe their evolution are drawn at random with most (e.g., 90% or upwards) assumed to be zero. Then, the only parameters that are estimated are the output parameters (weights) that connect the hidden states to the output response. For example, in the context of continuous spatio-temporal responses, this is just a regression estimation problem (usually with a regularization penalty; e.g., ridge regression). See Section

3.1 for a model representation of the ESN.

Historically, the reservoir parameters in the ESN are just chosen once, with fairly large hidden state dimensions. Although this often leads to good predictions, it provides no opportunity for uncertainty quantification. An alternative is to perform parametric bootstrap or ensemble estimation, in which multiple reservoir samples are drawn (e.g., Sheng ., 2013; McDermott  Wikle, 20172). This provides a measure of uncertainty quantification and allows one to choose smaller hidden state dimensions, essentially building an ensemble of weak learners analogous to many methods in machine learning (e.g., Friedman ., 2001). There have also been Bayesian implementations of the ESN model (e.g., Li ., 2012; Chatzis, 2015)

, but none of these have been implemented with traditional Markov chain Monte Carlo (MCMC) estimation methods, as is the case here, where multiple levels of uncertainties can be accounted for. In the context of spatio-temporal forecasting,

(McDermott  Wikle, 20171, 20172) have shown that augmenting the traditional ESN with quadratic output terms (analogous to the quadratic nonlinear component in statistical DSTMs) and input embeddings (e.g., including lags of the input variables as motivated by Takens’ (1981) representation in dynamical systems) can improve forecast accuracy compared to traditional ESN models.

“Deep” models have recently shown great success in many neural network applications in machine learning (e.g., Krizhevsky ., 2012). As mentioned above, deep (hierarchical) statistical models have also been shown to be effective in complex spatio-temporal models. There are challenges in training such models given the very large number of parameters (weights), so it can be advantageous to consider deep models that are also relative parsimonious in their parameter space. It is then natural to explore the potential for deep or hierarchical ESN models. The purpose of adding additional layers in the ESN framework is to model (learn) additional temporal scales (Jaeger, 2007). Deep ESN models provide a greater level of flexibility by allowing individual layers to potentially represent different time scales. The model presented here attempts to exploit the multiscale nature of deep ESN models for spatio-temporal forecasting.

Although deep ESN models have been considered in the engineering literature (Jaeger, 2007; Triefenbach ., 2013; Antonelo ., 2017; Ma ., 2017), none of these approaches accommodate uncertainty quantification. Furthermore, these methods do not include a spatial component nor are they applied to spatio-temporal systems. Here we develop a hierarchical ESN approach for spatio-temporal prediction that explicitly accounts for uncertainty quantification. We first present an ensemble approach, extending the work of McDermott  Wikle (20172) to the deep setting, and then consider a hierarchical Bayesian implementation of the deep ESN model that more rigorously accounts for observation model uncertainty.

Section 2 describes the motivating spatio-temporal long-lead forecasting problem, in this case using Pacific sea surface temperature (SST) anomalies to forecast soil moisture anomalies over the US “corn belt” 6 months in the future. Section 3 describes the hierarchical ESN methodology, first from the ensemble perspective and then the Bayesian perspective. Section 4 presents a simulation study using a non-Gaussian multiscale Lorenz-96 system and then presents the soil moisture forecasting example. We conclude with a brief discussion in Section 5.

2 Motivating Problem: Long-Lead Forecasting

In the context of atmospheric and oceanic processes, long-lead forecasting refers to forecasting atmospheric, oceanic or related variables on monthly to yearly time scales. Although it is fundamentally impossible to generate skillful meteorological forecasts of atmospheric processes on time horizons of greater than about ten days to two weeks, the ocean operates on much longer time scales than the atmosphere and provides a significant amount of the forcing of atmospheric variability. Thus, this linkage between the ocean and the atmosphere can lead to skillful long-lead “forecasts” of the atmospheric state, or other processes linked to the atmospheric state (e.g., soil moisture) on time scales of months to two years or so (e.g., Philander, 1990). In this case, one cannot typically generate skillful meteorological point forecasts, but can provide general distributional forecasts that are skillful relative to naïve models such as climatology (long-term averages) or persistence (assuming current conditions persist into the future).

Historically, successful long-lead forecasting applications are typically tied to the ocean El Niño/Southern Oscillation (ENSO) phenomenon, which shows quasi-periodic variability between warmer than normal ocean states in the central and eastern tropical Pacific ocean (El Niño) and colder than normal ocean states in the central tropical Pacific (La Niña). The ENSO phenomenon accounts for the largest amount of variability in the tropical Pacific ocean and leads to world wide impacts due to atmospheric “teleconnections” (i.e., the shifting of the warm pools in the tropical Pacific also shift the convective clusters of precipitation that drive upper atmospheric wave trains, that in turn influence the atmospheric circulation; e.g., locations of jet streams). These atmospheric circulation changes then affect temperature, precipitation, and many responses to those variables such as habitat conditions for ecological processes, soil moisture, severe weather, etc., as described in Philander (1990). It has been demonstrated for over two decades that skillful long-lead forecasts of sea surface temperature (SST) in the tropical Pacific (and, hence, ENSO) is possible with both deterministic and statistical models. In fact, this is one of the aforementioned situations in the atmospheric and ocean sciences where statistical forecasting is as good as, or better than, deterministic models (e.g., Barnston ., 1999; Jan van Oldenborgh ., 2005).

There are essentially two general approaches to the long-lead forecasting of a response to SST forcing. One approach is to generate a long-lead forecast of SST (e.g., a 6 month forecast) and then use contemporaneous relationships between the ocean state and the response of interest (e.g., say midwest soil moisture) to generate a long-lead forecast of that response. This typically requires a dynamical forecast of SST and then some regression or classification model from SST to the outcome of interest. The alternative is to model the relationship between SST and the future response at a chosen lead time (e.g., forecasting midwest soil moisture in May given SST in November). This might be called a spatio-temporal regression, where one is predicting a spatial field in the future given a spatial field at the current time. In either case, linear models have been shown to perform reasonably well in these situations (e.g., Van den Dool ., 2003), but it is typically the case that models that include nonlinear interactions can perform more skillfully, and can produce more realistic long-lead forecast distributions (e.g., Sheffield ., 2004; Fischer ., 2007; McDermott  Wikle, 2016).

2.1 Long Lead Forecasting of Soil Moisture in the Midwest U.S. Corn Belt

Soil moisture is fundamentally important to many processes (e.g., agricultural production, hydrological runoff). In particular, the amount of soil moisture available to crops such as wheat and corn at certain critical phases of their growth cycle can make a significant impact on yield (Carleton ., 2008). Thus, having a long-lead understanding of the plausible distribution of soil moisture over an expansive area of agricultural production can assist producers by suggesting optimal management approaches (e.g., timing of planting, nutrient supplementation, and irrigation). Given the aforementioned links between tropical Pacific SST and North American weather patterns, it is not surprising that skillful long-lead forecasts of soil moisture in major production areas of the U.S. are possible (e.g., Van den Dool ., 2003). Indeed, the U.S. National Oceanic and Atmospheric Administration (NOAA) and National Center for Environmental Prediction (NCEP) routinely provide soil moisture outlooks (“forecasts”) that are based on a combination of deterministic and statistical (constructed analog) models (e.g., see http://www.cpc.ncep.noaa.gov/soilmst
/index_jh.html), although for shorter lead times than of interest here. Recently, McDermott  Wikle (2016) showed that a Bayesian analog forecasting model could provide skillful high-spatial-resolution forecasts of soil moisture anomalies over the state of Iowa in the U.S. at lead times up to 6 months.

The application here will consider the problem of forecasting soil moisture over the midwest U.S. corn belt in May given data from the previous November (i.e., a 6 month lead time). We consider May soil moisture because it is an important time period for planting corn in this region (i.e., Blackmer ., 1989). This application corresponds to the long-lead spatio-temporal field regression approach to nowcasting described above – that is, we regress a tropical Pacific SST field in November onto the soil moisture field from the following May. The details associated with the data and model used for this example are given in Section 4.3.

3 Methodology

Suppose we are interested in forecasting the spatio-temporal process for time periods at a discrete set of spatial locations . Using a chosen linear dimension reduction method, can be decomposed such that , where is a matrix of spatial basis functions and is a

-dimensional vector of basis coefficients, indexed by

(e.g., Cressie  Wikle, 2011, Chap. 7). Next, assume we have inputs corresponding to a spatio-temporal data set. Specifically, let be a vector of input variables that correspond to a discrete set of spatial locations at time . (More generally the input vector may consist of time-varying input covariates that are not indexed by space; e.g., ).

Below we provide a brief overview of the ensemble ESN model considered in McDermott  Wikle (20172), followed by the multi-level (deep) extension that we call a deep ensemble echo state network (D-EESN) model. We then describe a Bayesian version of the D-EESN model that we label (BD-ESSN).

3.1 Basic ESN Background

The quadratic ESN model outlined in McDermott  Wikle (20172) is given by:

Data stage: (1)
Output stage: (2)
Hidden stage: (3)

where is an matrix of spatial basis functions, is a -vector that contains the associated basis expansion coefficients (where ), is an -dimensional vector of “hidden units,”

is a nonlinear activation function (e.g., a sigmoidal function such as a hyperbolic tangent function),

is the “spectral radius” (largest eigenvalue) of

, is a scaling parameter, and , , , are weight (parameter) matrices of dimension , , , and , respectively (defined below). The square parameter matrix can be thought of analogously to a transition matrix in a vector autoregressive (VAR) model in that it can capture transition dynamic interactions between various inputs. The scaling parameter helps control the amount of memory in the system and is restricted such that, for stability purposes (Jaeger, 2007). We let be an -vector of embeddings (lagged input values) given by:


Importantly, only and the weight matrices and in the output stage (2) are estimated (usually with a ridge penalty). These weight matrices can be thought of similarly to regression parameters that weight the hidden units appropriately. In contrast, the “reservoir weights” in (3) are drawn from the following distributions:


where and denote indicator variables, denotes a Dirac function, and (and

) can be thought of as the probability of including a particular weight (parameter) in the model. As is common in the machine learning literature, both

and are also set to small values to help prevent overfitting. Similarly, the parameters and are set to small values to create a sparse network. Both the sparseness and randomness act as a regularization mechanism, which prevents the ESN model from overfitting to in-sample data. In summary, note that the hidden (reservoir) stage in (3) corresponds to a nonlinear stochastic transformation of the input vectors that are then regressed, with regularization, onto the output vectors in the output stage (i.e., (2) above), which are then transformed back to the data scale in (1).

Traditional ESN models (e.g., Jaeger, 2007; Lukoševičius  Jaeger, 2009) do not typically include the quadratic () term in the output stage nor do they include the embeddings in the input vector, but McDermott  Wikle (20172) found that those are often helpful when forecasting spatio-temporal processes. In addition, traditional ESN applications typically include a “leaking rate” parameter that corresponds to a convex combination of the previous hidden state and the current reservoir value, (e.g., Lukoševičius, 2012). We have not found this to be as useful for long-lead spatio-temporal forecasting as it has been in other applications and so omit it in our presentation for notational simplicity.

To facilitate uncertainty quantification, McDermott  Wikle (20172) were motivated by parametric bootstrap prediction methods (e.g., Genest  Rémillard, 2008; Sheng ., 2013) to consider an ensemble of predictions from the quadratic ESN model. In particular, Algorithm 1 of McDermott  Wikle (20172) provides a simple procedure that generates forecast realizations : by implementing the reservoir model in (3) to obtain in (2) through the use of simulated (sample) weight matrices from (5) and (6

). These samples can then be used in a Monte Carlo sense to obtain features of the predictive distribution (e.g., means, variances, etc.). This quadratic-ensemble ESN (Q-EESN) model was shown to be quite effective in capturing the uncertainty associated with long-lead forecasting of tropical Pacific SST.

3.2 Deep Ensemble ESN (D-EESN)

Here we develop a deep extension of the bootstrap-based Q-EESN model of McDermott  Wikle (20172) described in Section 3.1. That model has multiple levels, but is not a “deep” model in the sense that it has no mechanism to link hidden layers, which might be important for processes that occur on multiple time scales. To accommodate such structure, we extend some of the deep ESN model components developed in Ma . (2017) and Antonelo . (2017) to a spatio-temporal ensemble framework in the following D-EESN model. In particular, the D-EESN model with hidden layers is defined as follows for time period and :


where is a -dimensional vector for the hidden layer and each parameter matrix is defined as:


Let be a matrix. The function in (11) denotes a dimension reduction mapping function of this matrix (see below for specific examples) such that for , where is a matrix such that . More concisely, this transformation is on the matrix , resulting in the transformed matrix , and then for each time period the appropriate is extracted from the transformed matrix to construct (10). Through the inclusion of the terms in (9), the model can potentially weight layers lower down in the hierarchy that may represent different time scales or features (i.e., Ma ., 2017). Note, the activation function is included in (9) to place the dimension reduction variables on a similar scale as the variables (i.e., the hidden variables that have not gone through the dimension reduction transformation).

For a given hidden unit , denotes the largest eigenvalue of the square matrix , where as in the basic ESN model above, is drawn from the following distribution:


and each element of the parameter matrix is drawn from:


The embedded input vector, is defined as in (4) above. Estimation of is carried out through the use of ridge regression using the ridge hyper-parameter , where there is a one-to-one relationship between and the constant in (9). Note that we do not include a quadratic output term in this model as we did in (2) for simplicity, but this could easily be added if the application warrants (note - both applications presented here did not benefit from the addition of quadratic output terms in the deep model setting).

The bootstrap ensemble prediction for the D-EESN model is presented in Algorithm 1 below. In particular, Algorithm 1 starts by drawing and for every layer in the D-EESN model. Next, these parameters are plugged-in sequentially, starting with the input layer defined in (12) and ending with the hidden layer that comes directly before the output stage (i.e., (10) for ). Finally, with all of the hidden states calculated, ridge regression is used to estimate the regression parameters in (9).

Data :
Input : hyper-parameters: , chosen by cross-validation.
for  do
       Simulate and from (14) and (15) for
       Calculate with (12) using and
       for  to  do
             Calculate using (11)
             Calculate with (10) using and
       end for
      Use ridge regression to calculate
       Calculate out-of-sample forecasts
end for
Output : Ensemble of forecasts
Algorithm 1 D-EESN algorithm

The reservoir hyper-parameters,

are chosen by cross-validation driven by a genetic algorithm (see Section

4.1 for further details). As in McDermott  Wikle (20172), we found that fixing the parameters as well as the number of hidden units for all of the layers except the first (i.e., ) was a reasonable assumption here since the dimension of all the hidden units besides the top layer (i.e., ) are eventually reduced using the dimension reduction transformation.

Note that this model consists of a series of linked non-linear stochastic transformations of the input vector

that are available for prediction. In addition, the data reduction steps act similarly to the “pooling” step in a convolutional neural network

(Krizhevsky ., 2012). That is, it simultaneously reduces the dimension of the hidden layers and provides a summary of the important features. Similar to how ridge regression acts to reduce redundancy in the classic ESN model, dimension reduction serves this purpose in the deep framework. Due to the limited amount of research on deep ESNs, the question of which dimension reduction method to use for in (11

) is still an open research question. Here, principal component analysis (PCA) and Laplacian eigenmaps

(Belkin  Niyogi, 2001) were explored as possible choices for . Although these methods were selected to represent both linear and nonlinear dimension reduction techniques, there are certainly other choices that could be explored in future applications.

3.3 Bayesian Deep Ensemble ESN (BD-EESN)

The D-EESN model given in Section 3.2 is very efficient to implement, but at a potential cost of not fully accounting for all sources of uncertainty in estimation and prediction. In particular, the data stage given in (8) does not account for truncation error in the basis expansion, nor the error associated with the estimates of the regression or residual variance parameters in (9). This can easily be remedied via Bayesian estimation at these stages. To our knowledge, this is the first ESN model for spatio-temporal data to be implemented within a traditional hierarchical Bayesian framework.

We develop the Bayesian deep EESN (BD-EESN) model in a general form here so that it can be applied to both the traditional EESN and the D-EESN described above. In particular, let:


where and D denotes an unspecified distribution; for both of the applications discussed here and (this specification is application dependent and could be altered to accommodate other distributions). Here, is defined as a (known) spatial covariance matrix (e.g., accommodating the basis truncation and/or measurement error), see Section 4 for application-specific details. The regression matrices in (17) are defined as follows for and :


In this notation, is the sampled -dimensional vector from a dynamical reservoir generated as in the D-EESN model in Section 3.2; that is, we generate reservoir samples “off-line” using (10) from the D-EESN model (for ). Similarly, are also sampled a priori and come from the dimension reduction stage(s) of the D-EESN model (i.e., (12) above). That is, both and are treated as fixed covariates for the BD-EESN model. So, the output model takes an ensemble approach in terms of generating a suite of nonlinear stochastic transformed input variables for a Bayesian regression, where each are matrices of regression parameters as defined in (18). Note the obvious similarity between the process model in (9) and the one for the D-EESN model in (17). If all of the terms for are set to zero, the process model in (17) can be used with the traditional EESN model (or the Q-EESN model by simply adding quadratic terms).

This model is clearly over-parameterized, so the regression parameters in the BD-EESN model are given stochastic search variable selection (SSVS) priors (George  McCulloch, 1997). While one of the many variable selection priors from the Bayesian variable selection literature can be used here, we choose to use SSVS priors for their ability to produce efficient shrinkage. In particular, SSVS priors shrink a (large) percentage of parameters to zero (in a spike-and-slab implementation) or infinitesimally close to zero, while leaving the remaining variables unconstrained. Thus, the regression parameters in the BD-EESN model are given the following hierarchical prior distribution for and :


where indexes the hidden units for a particular layer, , and

can be thought of as the prior probability of including a particular variable in the model. Finishing the prior specifications for the model, the variance parameter

is given an inverse-gamma prior such that . The hyper-parameters , , , , and are problem-specific (see the examples in Section 4.1 below).

Since the parameters here are given conjugate priors, the full-conditional distributions that make up the Gibbs sampler MCMC algorithm needed to implement this model are straightforward to sample from. A summary of the entire estimation procedure for the BD-EESN can be found in Algorithm

LABEL:algorithm2. Forecasts for the BD-EESN are made in a similar manner as the D-EESN (as shown by the output step of Algorithm 1). That is, training is carried out using data up to time period , while out-of-sample forecasts are continually made periods into the future using the corresponding lagged input variable to generate the appropriate hidden state variables. Finally, at each iteration of the MCMC algorithm, the sampled regression parameters are used with these hidden states to produce out-of-sample forecasts.

4 Simulation and Motivating Examples

Here we describe the model setup used to implement the D-EESN and BD-EESN models on a complex simulated process and for the soil moisture long-lead forecasting problem that motivated these models.

4.1 Model Setup

The previously mentioned cross-validation for the D-EESN is carried out using a genetic algorithm (GA) (e.g., Sivanandam  Deepa, 2007) contained in the GA package (https://cran.r-project.org/web/packages/GA) from the R statistical computing program ( http://cran.r-project.us.org). Unlike the basic ESN model, the number of hyper-parameters for the deep EESN model can increase quickly as the number of layers increases (e.g., with five-layers and a relatively coarse search grid the number of total parameters in the search space can easily approach ), thus rendering grid search approaches computationally burdensome at best (i.e., Ma ., 2017). Through the use of a GA, the hyper-parameters that make up a deep ESN can be selected at a fraction of the computational cost. The GA was implemented with 40 generations and a population size of 20 for all of the applications presented below. This is the first implementation, to our knowledge, of either a traditional or deep ESN within an ensemble framework using a GA. The bounds of the parameter search space for each of the hyper-parameters in the D-EESN model can be found in Table 1.

All ensemble models are comprised of 100 ensembles, we did not find any of the applications to be overly sensitive to this choice. Using this number of ensembles represents a compromise between computational efficiency and achieving consistent (reproducible) results (e.g., we found that using less than approximately 30 ensembles produced unstable results, while values greater than 100 did not change the results significantly). Note, previous bootstrap ensemble ESN papers have generally used far fewer ensembles (e.g., Sheng ., 2013). For context, on a 2.3 GHz laptop the D-EESN algorithm defined in Algorithm 1 takes 4.3 and 17.5 seconds for a two-layer and seven-layer model, respectively, using 100 ensembles with the Lorenz-96 application described in Section 4.2 below.

Regarding the previous discussed dimension reduction function for the reduction stage of the D-EESN model (i.e., (11) above), PCA basis functions were selected for both applications and models using cross-validation. Although, we should note that the model was not very sensitive to this choice among the basis functions considered. Finally, for simplicity we use the same dimension (i.e., ) for each dimension reduction layer in the D-EESN model, a similar assumption was made in Ma . (2017). Similar to McDermott  Wikle (20172) all of the hyper-parameters in the set are fixed at . Finally, for was set to 84 for all of the deep models. The model was not sensitive to this value with values ranging between 60 and 100 producing nearly identical results for the metrics evaluated below. As previously discussed, the bootstrap framework presented here aims to create an ensemble of weak learners, thus the selection of a moderate value for .

Search Space:
Table 1: Search spaces for all of the hyper-parameters in the D-EESN model. These search spaces were implemented for the D-EESN using a genetic algorithm (GA) algorithm with cross-validation. Brackets denote discrete spaces and denote continuous spaces.

Estimation of all the Bayesian models considered here is carried out using a Gibbs sampler MCMC algorithm with 5,000 iterations where the first 1,000 iterations are treated as burn-in. The trace plots showed no evidence of non-convergence (more details regarding convergence are given below). The specific hyper-parameters used for the Bayesian implementation can be found in Table 2. Note, more restrictive priors (in terms of regularization) are employed for the models with more regression parameters (i.e., models with more hidden layers). This improves the mixing of the MCMC algorithm along with preventing the model from overfitting.

Application: Priors
BD-EESN models with , and  for
BD-EESN models with , and  for
Table 2: Specific hyper-parameters used for the various BD-EESN implementations. The hyper-parameters were selected so the in-sample MSE of the BD-EESN roughly matched the in-sample MSE for the corresponding D-EESN model. None of the applications were overly sensitive to the specified priors, with moderate variations from the values given below producing similar results.

Both the D-EESN and BD-EESN are compared against the previously described single-layered Q-EESN model in order to investigate the added utility of using a deep framework. In addition, naïve or simple forecasting methods are often employed for difficult long-lead forecasting problems such as the soil moisture application, to act as a baseline for comparison. We consider both a climatological and linear dynamical spatial-temporal model (DSTM) here as baseline models for the soil moisture application (for consistency, we also compare to the linear model in the deep Lorenz-96 simulation study described below). The linear DSTM model is defined as follows:


where is a transition matrix estimated using least squares (with associated Gaussian error-terms ) and is a covariance matrix estimated empirically using the in-sample residuals from the model fit (i.e., this is essentially a two-stage least squares estimation procedure, where is first estimated as previously described, followed by ). See Chapter 7 of Cressie and Wikle (2011) for a comprehensive overview of linear DSTMs.

The models considered here are evaluated in terms of both out-of-sample prediction accuracy, and the quality of their respective forecast distributions (i.e., uncertainty quantification). In particular, mean squared prediction error (MSPE), defined here as the average squared difference between out-of-sample realizations and forecasts averaged over space and time, is calculated for every model. The forecast distributions are evaluated using the continuous ranked probability score (CRPS). Unlike MSPE, CRPS also considers the quality of a given model’s uncertainty quantification by comparing the forecast distribution with the observed values (e.g., see Gneiting  Katzfuss, 2014). The classic CRPS formulation for Gaussian data (i.e., Gneiting ., 2005) is used for the soil moisture application, while the deep Lorenz-96 simulation uses the closed form expression for CRPS with log-Gaussian data from Baran  Lerch (2015). Regarding the soil moisture application, as previously noted, standard forecasting methodologies for difficult long-lead problems are often compared to simpler forecast methods such as climatological or linear forecasts. The so-called skill score

(SS) is an evaluation metric that compares a forecasting method to some reference forecast

(e.g., D. Wilks, 2001). Assuming one wants to compare a reference forecast to some model forecast in terms of MSPE, SS is defined as follows:


In our application, SS is calculated for each location in the soil moisture application by calculating MSPE across time. Thus, by computing a SS for each spatial location we can obtain a spatial field that shows where the new model improves upon a particular reference model.

4.2 Simulation Study: Deep Lorenz-96 Model

The deterministic model of Lorenz (1996), often referred to as the “Lorenz-96 model,” is frequently used as a simulation model in the atmospheric science and dynamical systems literature because it incorporates the quadratic nonlinear interactions of the famous Lorenz (1963) “butterfly” model in a one-dimensional spatial setting. A multiscale extension of this model (see DS. Wilks, 2005) has gained popularity in the atmospheric science and applied mathematics literature for its ability to represent realistic nonlinear interactions between small and large scale variables (Grooms  Lee, 2015). Moreover, the Lorenz-96 model operates on multiple scales in both space and time, consisting of locations with slowly varying large-scale behavior and locations with fast varying small-scale behavior. Thus, the multiscale Lorenz-96 model represents a very relevant example for the multiscale deep EESN methodology developed here. To make this simulation model even more realistic, we extend it by adding an additional data stage to allow for non-Gaussian data types. We will refer to this simulation model as the “deep Lorenz-96 model.” To our knowledge, this deep Lorenz-96 model is novel.

Figure 1: Description of the spatial structure for the deep Lorenz-96 model with 9 large-scale locations that each have 11 associated small-scale locations, which gives 99 small-scale locations. For the sake of visualization we have used a smaller number of large and small scale locations here than in the analyzed deep Lorenz-96 simulated data example. The various red-shaded small circles represent small-scale locations (i.e., in (23)), while the large blue and black circles denote the large-scale locations and , respectively, from the process model defined in (23).

The spatial structure of the Lorenz-96 system (as shown in Figure 1) is a one-dimensional circular structure (i.e., periodic boundary conditions) where each large-scale location is evenly spaced. Furthermore, each large-scale location is associated with small scale locations (see Figure 1). Using the process model parameterization from Chorin  Lu (2015) and the inclusion of a data model, our deep Lorenz-96 model is defined as follows for time period :

Data Model:
Process Model: (23)

where and are user defined parameters, and correspond to large-scale locations, corresponds to a small-scale location for and , and denotes the absolute value. The parameter corresponds to a forcing term and helps determine the amount of nonlinearity in the model. In this setting, is a “time separation parameter” such that smaller values lead to a faster-varying temporal-scale for the small-scale locations. The contribution of the small-scale locations to the large-scale locations (and vice versa) is determine by and , respectively. Finally, the scaling parameter helps ensure the log-Gaussian data model does not produce unrealistically large realizations (we let here).

Since the methodology presented here concerns multiscale spatio-temporal modeling, the parameters for the deep Lorenz-96 model are selected to emphasize the level of multiscale behavior in the process (as can be seen from the simulated deep Lorenz-96 data shown in Figure 2). In particular we use the following settings: , and , while we follow Chorin  Lu (2015) in setting and . The variance term in (23) is set such that . An Euler solver is used to numerically solve the Lorenz-96 equations in (23) using a time step of . After a burn-in period, 510 periods of simulated data from the deep Lorenz-96 model are retained, with the final 75 periods held out and treated as out-of-sample realizations.

Figure 2: A simulation from the deep Lorenz-96 model defined in (23) for 100 time periods. (a) Realizations from the data model in (23) for three locations where these locations are transformations of the slowly varying large-scale locations shown in the second row (i.e., (b)). (b) Realizations for three large-scale locations from the process model defined in (23). (c) Realizations corresponding to small-scale locations for each of the three large-scale locations displayed in the second row. Note, each column displays four (of the twenty) slow varying small-scale locations associated with a particular large-scale location.

Only the large-scale process is treated as observed here, thus both the large and small scale Lorenz-96 variables (i.e., and ) are considered unobserved. Therefore, unlike the soil moisture application, both the input and output of the model are the same process (i.e., ). The input and output are separated by three periods here (i.e., the lead time is set to three periods), in order to make the problem slightly more difficult. The previously discussed embedding lag (i.e., in (4)) is set to the lead time (three periods) for the deep Lorenz-96 D-EESN implementations. The number of embedding lags (i.e., in (4)) is selected with the GA (using cross-validation) to be three. Finishing the specification of the data models in (8) and (16), is defined as , where is a identity matrix, a log-Gaussian distribution is used for the unspecified distribution in (16), and the covariance matrix is set such that , where .

Q-EESN 82.64 3876.45
BQ-EESN 81.51 3875.61
D-EESN (7 L) 76.00 3872.72
BD-EESN (7 L) 79.08 3873.87
Lin. DSTM 100.12 4557.78
Table 3: Results for the deep Lorenz-96 simulation study in terms of mean squared prediction error (MSPE) and continuous ranked probability score (CRPS). Q-EESN refers to the quadratic EESN; BQ-EESN denotes the Bayesian version of the Q-EESN model; D-EESN (7 L) is the D-EESN model with seven layers, and BD-EESN (7 L) denotes the BD-EESN model with seven layers. Note, smaller is better for both MSPE and CRPS.

The out-of-sample validation metrics in Table 3 show the seven-layer D-EESN models to be the best forecast model for the deep Lorenz-96 data. In particular, both of the seven-layer D-EESN models perform better in terms of MSPE and CRPS than the Q-EESN or linear model. Although not shown here, D-EESN models with 2-6 layers all performed better than the Q-EESN model and worse than the seven-layer D-EESN model in terms of MSPE, with the MSPE monotonically decreasing as layers were added. We found that after seven-layers any gain in forecast accuracy was very minimal in comparison to the extra computation required to keep adding layers.

Figure 3: Out-of-sample forecast summary plots for 6 of the 18 large-scale locations from the deep Lorenz-96 simulation study using the seven layer D-EESN model. The black line denotes the true simulated value in each plot (i.e., in (23)), while the red line denotes the forecasted mean. In a given plot, the shaded gray area represents the 95% point-wise prediction intervals (P.I.s) over all ensembles.

The Bayesian version of the seven-layer D-EESN model produces similar MSPE and CRPS values to the non-Bayesian version. Although one would expect the Bayesian model to perform similar in terms of MSPE to the non-Bayesian version, there is potential for the Bayesian version to improve in terms of uncertainty quantification (as shown below for the soil moisture application below). This model’s ability to quantify uncertainty is illustrated in Figure 3, which shows forecast summaries for the first six locations with the seven-layer D-EESN model. Despite the clear nonlinearity in the process, the model correctly forecasts much of the overall quasi-cyclic nature and intensity of the process, while producing uncertainty metrics that cover many of the true values.

4.3 Midwest Soil Moisture Long-Lead Forecasting Application

The previously mentioned soil moisture data comes from the Climate Prediction Center’s (CPC) high resolution monthly global soil moisture data set (Fan  van den Dool (2004), https://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GM
SM/.w/). As described in Smith . (2008), the driving inputs used to create this derived data consist of global precipitation and temperature data along with an accompanying land model. Spatially, the data domain covers N - N latitude and W - W longitude at a resolution of . The monthly data set begins in January 1948 and goes through December 2017. While the model is trained on monthly data from January 1948 through November 2011, only the May forecasted values from the out-of-sample period covering 2012-2017 are used to evaluate the model, given the importance of May soil moisture for planting corn in the U.S. corn belt. We follow the common practice in the atmospheric science literature of converting data into anomalies by subtracting the respective in-sample monthly means from the data.

Dimension reduction for the soil moisture data is carried out using empirical orthogonal functions (EOFs; i.e., or spatial-temporal principal component analysis) (e.g., see Cressie  Wikle, 2011, Chapter 5). Therefore, is obtained using the first 15 EOFs, which account for of the variation in the soil moisture data (note, the model was not sensitive to small variations in this choice). The data model spatial covariance matrix is calculated using the following formulation from Cressie  Wikle (2011, Equation 7.6): , where and

are the remaining eigenvalues and eigenvectors, respectively, that are not used in the decomposition of

and is a constant (set to ). Since the data are converted into anomalies, a Gaussian distribution is used for the distribution in (16).

The monthly SST data comes from the extended reconstruction sea surface temperature (ERSST) data set (Smith . (2008), http://iridl.ldeo.columbia.edu/
SOURCES/.NOAA/.NCDC/.ERSST) and cover the same temporal period as the soil moisture data. The spatial domain for the SST data is given by S - N latitude and E - W longitude with a resolution of , and covers much of the mid-Pacific ocean. Dimension reduction is once again carried out using EOFs by retaining the first 5 EOFs of the SST data set, which account for almost of the variation in the data (the model was also not overly sensitive to this choice). The embedding lag for the input is set to the lead time of six periods and the number of embedding lags is selected to be three using cross-validation with the GA. Convergence for the MCMC Gibbs algorithm sampler was further assessed using the Gelman-Rubin diagnostic (Gelman  Rubin, 1992) with four chains, which did not suggest any lack of convergence.

Q-EESN 3507.84 239.16 50.60%
BQ-EESN 3463.28 196.59 55.20%
D-EESN (2 L) 3303.64 229.96 60.00%
BD-EESN (2 L) 3296.39 190.45 58.24%
D-EESN (3 L) 3307.51 224.68 60.32
BD-EESN (3 L) 3299.04 189.80 63.54%
Lin. DSTM 3509.63 198.09 50.00%
Climatological 3642.54 - -
Table 4: Validation results for the long-lead soil moisture forecasting application using mean squared prediction error (MSPE), continuous ranked probability score (CRPS), and percentage of skill score values greater than zero (i.e., % SS0 ). The prefix “B” denotes a Bayesian version of the model and and “L” denotes the number of layers in a given model.

Table 4 shows the performance of various models for out-of-sample long-lead forecasting of May soil moisture. Both the Bayesian and non-Bayesian D-EESN models perform the best in terms of MSPE. The D-EESN models also represent the largest improvement over the climatological forecast. Further, both D-EESN models also outperform the single-layered Q-EESN model, suggesting the soil moisture application benefits from a deep framework. Notably, the two and three layer models appear to have similar MSPE values, indicating that two layers is likely sufficient. While the Bayesian and non-Bayesian models perform similarly in terms of forecast accuracy, the Bayesian D-EESN models perform much better in terms of CRPS than the non-Bayesian versions. In particular, the Bayesian version produces CRPS values that are almost lower than the non-Bayesian version. Absent the improvement allowed by the Bayesian implementation, the D-EESN would be much less able to quantify the uncertainty in the long-lead soil moisture forecasts.

Figure 4:

Posterior summaries for the soil moisture application in May 2014 and 2015 using the two-layer BD-EESN model. (a) True soil moisture moisture values for each spatial location. (b) Posterior predictive mean values for each spatial location. (c) Posterior predictive standard deviations for each spatial location. Note, for the sake of visualization, each plot has been standardized by their respective means and standard deviations. Also, note that the scale bars are different for the 2014 and 2015 anomalies to improve forecast contrast. We have removed extreme outliers for the sake of visualization (indicated by grey and black grid squares).

Figure 4 shows the posterior predictive means and standard deviations for 2014 and 2015 over the prediction domain as given by the two-layer BD-EESN model. The model appears to mostly pick up the correct pattern of the soil moisture anomaly signal in the western and mid-central part of the spatial domain for both years, while struggling more with the upper midwest states in 2014. Regarding the critical corn belt, the model suggests an overall large amount of uncertainty in Missouri, especially in 2014, while Iowa appears to have a moderate to low amount of uncertainty for both years. As previously noted, the BD-EESN produces considerably better uncertainty metrics for the soil moisture data than the non-Bayesian D-EESN. The relative difference between these two predictive uncertainties can be seen in Figure 5, where both versions of the two-layer D-EESN model are plotted for 2015. Although the forecast means are very similar in Figure 5, the non-Bayesian D-EESN produces much smaller standard deviations across the spatial domain compared to the Bayesian version. Considering the inherent difficulty in predicting soil moisture six months into the future, the standard deviations for the non-Bayesian model appear unrealistically low. This point is confirmed by the Bayesian model producing a considerably lower CRPS value than the non-Bayesian version.

Figure 5: May 2015 forecast summaries for both the Bayesian and non-Bayesian two-layer D-EESN model. (a) True soil moisture moisture values for each spatial location. (b) Forecasted mean values with a given forecasting method for each spatial location. (c) Forecast standard deviations with a given forecasting method for each spatial location. Note, for the sake of visualization each plot has been standardized by their respective means and standard deviations. We have removed extreme outliers for the sake of visualization (indicated by grey and black grid squares).

Next, the previously defined skill score (SS) in (22) is shown in Figure 6 for the two-layer BD-EESN model, where a climatological forecast is used as the reference model. Values of SS greater than zero represent locations where the BD-EESN model improved upon the climatological forecast, while values less than zero indicate locations where the model did worse than the climatological forecast. Overall, the BD-EESN model outperforms the climatological forecast in the central western part of the domain, and performs worse in the east central and northern part of the domain. In particular, the model does worse in these regions by predicting too little soil moisture in the north central part of the domain, relative to the truth, and over predicts the amount of soil moisture in the east central part of the domain. Critically, across much of the agriculturally important corn belt, including much of Iowa, the BD-EESN model improves upon the climatological forecast.

Figure 6: Skill score (SS) plot for the soil moisture application with the Bayesian two-layer D-EESN model. SS is calculated using (22) from above, where the reference model used here is a climatological forecast. For each forecasting method, the MSPE is calculated by averaging over the out-of-sample periods. Values of SS greater than zero indicate an improvement of the two-layer BD-EESN model over the climatological forecast, while values less than zero indicate locations where the two-layer BD-EESN model performed worse than climatology. We have removed extreme outliers for the sake of visualization (indicated by grey grid squares).

5 Discussion

The models presented here are essentially just regularized spatio-temporal regression models where the inputs (predictors) are stochastically and dynamically transformed. Indeed, the model presented in (16) and (17) might be considered a generalized additive model applied to spatio-temporal data, where the inputs are transformed through a telescoping set of ESN models. But, because of the inherent randomness in the choice of reservoir parameters in the ESNs, we have to do many independent replications of these transformations in order to obtain reproducible results. This is the nature of stochastic transformations. Note that these replications of the deep ESN are given equal weight in (17), but this need not be the case in general. It is also important to note that although the spatio-temporal regression model is not strictly dynamic (in the sense of a spatial field evolving in time), the transformations are inherently dynamic through the ESN recurrence relation. Finally, it is important to note that the multiple levels of transformation allow for different time scales in the predictors – this is the advantage of the deep architecture. How do we know that deep structure is giving multi-time scale predictors? Figure 7 shows the most frequently chosen predictors for the first EOF coefficient in the soil moisture example (). These predictors exhibit different time scales. Although this shows that different time scales are important for the prediction, it is not so clear how to interpret these predictors given the complex nonlinear transformations that generated them in the deep ESN structure. This lack of interpretability is a fundamental problem with many deep models in the machine learning context as well, and is an active area of investigation in statistics and machine learning.

Figure 7: Time series of the six most chosen hidden unit predictors used to predict the first soil moisture EOF coefficient () as chosen via SSVS in the BD-EESN model. The percentage of MCMC iterations that include each predictor is given shown above each series.

Despite the large amount of research into deep models, uncertainty quantification is rarely considered in their application. Given the demonstrated predictive ability of these methods, having the ability to quantify uncertainty with deep models is very powerful and widely applicable. Through the use of an ensemble framework, the deep ESN models for spatio-temporal processes presented here allow for uncertainty quantification. The non-Gaussian Lorenz-96 simulated example showed that the D-EESN improved upon the traditional ESN model, while also producing very robust estimates of the forecast uncertainty. Furthermore, the Bayesian version of the D-EESN model provides a formal framework in which many data types can be considered and multiple levels of uncertainties can be accounted for. The results for the BD-EESN with the soil moisture data illustrated this point by considerably improving upon the uncertainty metrics produced by the D-EESN model. We were also able to show spatially where the deep models improved upon simpler forecast methods, thus giving model developers and resource managers potentially useful information.

As discussed above, there are many more possible choices for the dimension reduction function used with the hidden units from the D-EESN model. One potential choice that has been unexplored in the literature is so-called convolution operators. Deep image classification methods have shown convolution operators to be extremely powerful for slowly learning general (spatial) features (Krizhevsky ., 2012). Its possible that the deep RNN framework could also benefit from such tools, especially in situations where the input has explicit spatial structure.

Long-lead spatio-temporal soil moisture forecasting is inherently a challenging problem. It is not uncommon to treat such difficult forecasting problems as categorial instead of continuous. That is, treating the response as categorical by re-labeling continuous values with qualitative values (e.g., below average, average, above average). Unlike the RNN literature, most of the ESN literature has focused on continuous responses. Given the flexibility of the BD-EESN, a categorical model formulation can be developed and is the subject of future research. Finally, the soil moisture forecasts may also benefit from using other climate indexes or more local variables (such as precipitation) as inputs into the model.


This work was partially supported by the US National Science Foundation (NSF) and the US Census Bureau under NSF grant SES-1132031, funded through the NSF-Census Research Network (NCRN) program, and NSF award DMS-1811745.


  • Antonelo . (2017) antonelo2017echoAntonelo, EA., Camponogara, E.  Foss, B.  2017. Echo State Networks for data-driven downhole pressure estimation in gas-lift oil wells Echo state networks for data-driven downhole pressure estimation in gas-lift oil wells. Neural Networks85106–117.
  • Baran  Lerch (2015) baran2015logBaran, S.  Lerch, S.  2015. Log-normal distribution based Ensemble Model Output Statistics models for probabilistic wind-speed forecasting Log-normal distribution based ensemble model output statistics models for probabilistic wind-speed forecasting. Quarterly Journal of the Royal Meteorological Society1416912289–2299.
  • Barnston . (1999) barnston1999predictiveBarnston, AG., He, Y.  Glantz, MH.  1999. Predictive skill of statistical and dynamical climate models in SST forecasts during the 1997-98 El Niño episode and the 1998 La Niña onset Predictive skill of statistical and dynamical climate models in sst forecasts during the 1997-98 el niño episode and the 1998 la niña onset. Bulletin of the American Meteorological Society802217–243.
  • Belkin  Niyogi (2001) laplacianEigenBelkin, M.  Niyogi, P.  2001. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering Laplacian eigenmaps and spectral techniques for embedding and clustering. MIPS14.
  • Berliner . (2000) berliner2000longBerliner, LM., Wikle, CK.  Cressie, N.  2000. Long-lead prediction of Pacific SSTs via Bayesian dynamic modeling Long-lead prediction of pacific ssts via bayesian dynamic modeling. Journal of climate13223953–3968.
  • Blackmer . (1989) blackmer1989correlationsBlackmer, A., Pottker, D., Cerrato, M.  Webb, J.  1989. Correlations between soil nitrate concentrations in late spring and corn yields in Iowa Correlations between soil nitrate concentrations in late spring and corn yields in iowa. Journal of Production Agriculture22103–109.
  • Carleton . (2008) carleton2008synopticCarleton, AM., Arnold, DL., Travis, DJ., Curran, S.  Adegoke, JO.  2008. Synoptic circulation and land surface influences on convection in the Midwest US “Corn Belt” during the summers of 1999 and 2000. Part I: Composite synoptic environments Synoptic circulation and land surface influences on convection in the midwest us “corn belt” during the summers of 1999 and 2000. part i: Composite synoptic environments. Journal of Climate21143389–3415.
  • Chatzis (2015) chatzis2015sparseChatzis, SP.  2015. Sparse Bayesian Recurrent Neural Networks Sparse bayesian recurrent neural networks. Joint European Conference on Machine Learning and Knowledge Discovery in Databases Joint european conference on machine learning and knowledge discovery in databases ( 359–372).
  • Chorin  Lu (2015) chorin2015discreteChorin, AJ.  Lu, F.  2015. Discrete approach to stochastic parametrization and dimension reduction in nonlinear dynamics Discrete approach to stochastic parametrization and dimension reduction in nonlinear dynamics. Proceedings of the National Academy of Sciences112329804–9809.
  • Cressie  Wikle (2011) CandW2011Cressie, N.  Wikle, C.  2011. Statistics for Spatio-Temporal Data Statistics for spatio-temporal data. New YorkJohn Wiley & Sons.
  • Dixon . (2017) dixon2017deepDixon, MF., Polson, NG.  Sokolov, VO.  2017. Deep Learning for Spatio-Temporal Modeling: Dynamic Traffic Flows and High Frequency Trading Deep learning for spatio-temporal modeling: Dynamic traffic flows and high frequency trading. arXiv preprint arXiv:1705.09851.
  • Drosdowsky (1994) drosdowsky1994analogDrosdowsky, W.  1994. Analog (nonlinear) forecasts of the Southern Oscillation index time series Analog (nonlinear) forecasts of the southern oscillation index time series. Weather and forecasting9178–84.
  • Fan  van den Dool (2004) fan2004climateFan, Y.  van den Dool, H.  2004. Climate Prediction Center global monthly soil moisture data set at 0.5 resolution for 1948 to present Climate prediction center global monthly soil moisture data set at 0.5 resolution for 1948 to present. Journal of Geophysical Research: Atmospheres109.
  • Fischer . (2007) fischer2007soilFischer, EM., Seneviratne, SI., Vidale, PL., Lüthi, D.  Schär, C.  2007. Soil moisture–atmosphere interactions during the 2003 European summer heat wave Soil moisture–atmosphere interactions during the 2003 european summer heat wave. Journal of Climate20205081–5099.
  • Friedman . (2001) friedman2001elementsFriedman, J., Hastie, T.  Tibshirani, R.  2001. The elements of statistical learning The elements of statistical learning. Springer series in statistics New York.
  • Gelman  Rubin (1992) gelman1992inferenceGelman, A.  Rubin, DB.  1992. Inference from iterative simulation using multiple sequences Inference from iterative simulation using multiple sequences. Statistical science457–472.
  • Genest  Rémillard (2008) genest2008validityGenest, C.  Rémillard, B.  2008. Validity of the parametric bootstrap for goodness-of-fit testing in semiparametric models Validity of the parametric bootstrap for goodness-of-fit testing in semiparametric models. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques Annales de l’institut henri poincaré, probabilités et statistiques ( 44).
  • George  McCulloch (1997) george1997approachesGeorge, EI.  McCulloch, RE.  1997. Approaches for Bayesian variable selection Approaches for bayesian variable selection. Statistica sinica339–373.
  • Gladish  Wikle (2014) gladish2014physicallyGladish, DW.  Wikle, CK.  2014. Physically motivated scale interaction parameterization in reduced rank quadratic nonlinear dynamic spatio-temporal models Physically motivated scale interaction parameterization in reduced rank quadratic nonlinear dynamic spatio-temporal models. Environmetrics254230–244.
  • Gneiting  Katzfuss (2014) gneiting2014probabilisticGneiting, T.  Katzfuss, M.  2014. Probabilistic forecasting Probabilistic forecasting. Annual Review of Statistics and Its Application1125–151.
  • Gneiting . (2005) gneiting2005calibratedGneiting, T., Raftery, AE., Westveld III, AH.  Goldman, T.  2005. Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation Calibrated probabilistic forecasting using ensemble model output statistics and minimum crps estimation. Monthly Weather Review13351098–1118.
  • Grooms  Lee (2015) grooms2015frameworkGrooms, I.  Lee, Y.  2015. A framework for variational data assimilation with superparameterization A framework for variational data assimilation with superparameterization. Nonlinear Processes in Geophysics225.
  • Hooten  Wikle (2010) hooten2010statisticalHooten, MB.  Wikle, CK.  2010. Statistical agent-based models for discrete spatio-temporal systems Statistical agent-based models for discrete spatio-temporal systems. Journal of the American Statistical Association105489236–248.
  • Hopfield (1982) hopfield1982neuralHopfield, JJ.  1982. Neural networks and physical systems with emergent collective computational abilities Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences7982554–2558.
  • Jaeger (2001) jaeger2001echoJaeger, H.  2001. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report14834.
  • Jaeger (2007) jaeger2007discoveringJaeger, H.  2007. Discovering multiscale dynamical features with hierarchical echo state networks Discovering multiscale dynamical features with hierarchical echo state networks . Jacobs University Bremen.
  • Jan van Oldenborgh . (2005) jan2005didJan van Oldenborgh, G., Balmaseda, MA., Ferranti, L., Stockdale, TN.  Anderson, DL.  2005. Did the ECMWF seasonal forecast model outperform statistical ENSO forecast models over the last 15 years? Did the ecmwf seasonal forecast model outperform statistical enso forecast models over the last 15 years? Journal of climate18163240–3249.
  • Knaff  Landsea (1997) knaff1997ninoKnaff, JA.  Landsea, CW.  1997. An El Niño–Southern Oscillation climatology and persistence (CLIPER) forecasting scheme An el niño–southern oscillation climatology and persistence (cliper) forecasting scheme. Weather and forecasting123633–652.
  • Kondrashov . (2005) kondrashov2005hierarchyKondrashov, D., Kravtsov, S., Robertson, AW.  Ghil, M.  2005. A hierarchy of data-based ENSO models A hierarchy of data-based enso models. Journal of climate18214425–4444.
  • Kravtsov . (2005) kravtsov2005multilevelKravtsov, S., Kondrashov, D.  Ghil, M.  2005. Multilevel regression modeling of nonlinear processes: Derivation and applications to climatic variability Multilevel regression modeling of nonlinear processes: Derivation and applications to climatic variability. Journal of Climate18214404–4424.
  • Krizhevsky . (2012) krizhevsky2012imagenetKrizhevsky, A., Sutskever, I.  Hinton, GE.  2012. Imagenet classification with deep convolutional neural networks Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems Advances in neural information processing systems ( 1097–1105).
  • Li . (2012) li2012chaoticLi, D., Han, M.  Wang, J.  2012. Chaotic time series prediction based on a novel robust echo state network Chaotic time series prediction based on a novel robust echo state network. IEEE Transactions on Neural Networks and Learning Systems235787–799.
  • Lorenz (1963) lorenz1963deterministicLorenz, EN.  1963. Deterministic nonperiodic flow Deterministic nonperiodic flow. Journal of the atmospheric sciences202130–141.
  • Lorenz (1996) lorenz1996predictabilityLorenz, EN.  1996. Predictability: A problem partly solved Predictability: A problem partly solved. Proc. Seminar on predictability Proc. seminar on predictability ( 1).
  • Lukoševičius (2012) lukovsevivcius2012practicalLukoševičius, M.  2012. A practical guide to applying echo state networks A practical guide to applying echo state networks. Neural networks: tricks of the trade Neural networks: tricks of the trade ( 659–686). Springer.
  • Lukoševičius  Jaeger (2009) lukovsevivcius2009reservoirLukoševičius, M.  Jaeger, H.  2009. Reservoir computing approaches to recurrent neural network training Reservoir computing approaches to recurrent neural network training. Computer Science Review33127–149.
  • Ma . (2017) ma2017deepMa, Q., Shen, L.  Cottrell, GW.  2017. Deep-ESN: A Multiple Projection-encoding Hierarchical Reservoir Computing Framework Deep-esn: A multiple projection-encoding hierarchical reservoir computing framework. arXiv preprint arXiv:1711.05255.
  • McDermott  Wikle (2016) mcdermott2016modelMcDermott, PL.  Wikle, CK.  2016. A model-based approach for analog spatio-temporal dynamic forecasting A model-based approach for analog spatio-temporal dynamic forecasting. Environmetrics27270–82.
  • McDermott  Wikle (20171) mcdermott2017bayesianMcDermott, PL.  Wikle, CK.  20171. Bayesian Recurrent Neural Network Models for Forecasting and Quantifying Uncertainty in Spatial-Temporal Data Bayesian recurrent neural network models for forecasting and quantifying uncertainty in spatial-temporal data. arXiv preprint arXiv:1711.00636.
  • McDermott  Wikle (20172) mcdermott2017ensembleMcDermott, PL.  Wikle, CK.  20172. An Ensemble Quadratic Echo State Network for Nonlinear Spatio-Temporal Forecasting An ensemble quadratic echo state network for nonlinear spatio-temporal forecasting. STAT6315-330.
  • Penland  Magorian (1993) penland1993predictionPenland, C.  Magorian, T.  1993. Prediction of Nino 3 sea surface temperatures using linear inverse modeling Prediction of nino 3 sea surface temperatures using linear inverse modeling. Journal of Climate661067–1076.
  • Philander (1990) philander1990ninoPhilander, S.  1990. El Niño, La Niña, and the Southern Oscillation El niño, la niña, and the southern oscillation. Academic Press.
  • Richardson (2017) richardson2017sparsityRichardson, RA.  2017. Sparsity in nonlinear dynamic spatiotemporal models using implied advection Sparsity in nonlinear dynamic spatiotemporal models using implied advection. Environmetrics.
  • Sheffield . (2004) sheffield2004simulatedSheffield, J., Goteti, G., Wen, F.  Wood, EF.  2004. A simulated soil moisture based drought analysis for the United States A simulated soil moisture based drought analysis for the united states. Journal of Geophysical Research: Atmospheres109.
  • Sheng . (2013) sheng2013predictionSheng, C., Zhao, J., Wang, W.  Leung, H.  2013. Prediction intervals for a noisy nonlinear time series based on a bootstrapping reservoir computing network ensemble Prediction intervals for a noisy nonlinear time series based on a bootstrapping reservoir computing network ensemble. IEEE Transactions on neural networks and learning systems2471036–1048.
  • Sivanandam  Deepa (2007) sivanandam2007introductionSivanandam, S.  Deepa, S.  2007. Introduction to genetic algorithms Introduction to genetic algorithms. Springer Science & Business Media.
  • Smith . (2008) smith2008improvementsSmith, TM., Reynolds, RW., Peterson, TC.  Lawrimore, J.  2008. Improvements to NOAA’s historical merged land–ocean surface temperature analysis (1880–2006) Improvements to noaa’s historical merged land–ocean surface temperature analysis (1880–2006). Journal of Climate21102283–2296.
  • Stern  Davidson (2015) stern2015trendsStern, H.  Davidson, NE.  2015. Trends in the skill of weather prediction at lead times of 1–14 days Trends in the skill of weather prediction at lead times of 1–14 days. Quarterly Journal of the Royal Meteorological Society1416922726–2736.
  • Takens (1981) takens1981detectingTakens, F.  1981. Detecting strange attractors in turbulence Detecting strange attractors in turbulence. Lecture notes in mathematics8981366–381.
  • Tang . (2000) tang2000skillTang, B., Hsieh, WW., Monahan, AH.  Tangang, FT.  2000. Skill comparisons between neural networks and canonical correlation analysis in predicting the equatorial Pacific sea surface temperatures Skill comparisons between neural networks and canonical correlation analysis in predicting the equatorial pacific sea surface temperatures. Journal of Climate131287–293.
  • Timmermann . (2001) timmermann2001empiricalTimmermann, A., Voss, H.  Pasmanter, R.  2001. Empirical dynamical system modeling of ENSO using nonlinear inverse techniques Empirical dynamical system modeling of enso using nonlinear inverse techniques. Journal of physical oceanography3161579–1598.
  • Triefenbach . (2013) triefenbach2013acousticTriefenbach, F., Jalalvand, A., Demuynck, K.  Martens, JP.  2013. Acoustic modeling with hierarchical reservoirs Acoustic modeling with hierarchical reservoirs. IEEE Transactions on Audio, Speech, and Language Processing21112439–2450.
  • Van den Dool . (2003) van2003performanceVan den Dool, H., Huang, J.  Fan, Y.  2003. Performance and analysis of the constructed analogue method applied to US soil moisture over 1981–2001 Performance and analysis of the constructed analogue method applied to us soil moisture over 1981–2001. Journal of Geophysical Research: Atmospheres108D16.
  • Wikle (2015) wikle2015modernWikle, CK.  2015. Modern perspectives on statistics for spatio-temporal data Modern perspectives on statistics for spatio-temporal data. Wiley Interdisciplinary Reviews: Computational Statistics7186–98. 10.1002/wics.1341
  • Wikle  Hooten (2010) wikle2010generalWikle, CK.  Hooten, MB.  2010. A general science-based framework for dynamical spatio-temporal models A general science-based framework for dynamical spatio-temporal models. Test193417–451.
  • D. Wilks (2001) wilks2001skillWilks, D.  2001. A skill score based on economic value for probability forecasts A skill score based on economic value for probability forecasts. Meteorological Applications82209–219.
  • DS. Wilks (2005) wilks2005effectsWilks, DS.  2005. Effects of stochastic parametrizations in the Lorenz’96 system Effects of stochastic parametrizations in the lorenz’96 system. Quarterly Journal of the Royal Meteorological Society131606389–407.
  • Zhao  Giannakis (2016) zhao2016analogZhao, Z.  Giannakis, D.  2016. Analog forecasting with dynamics-adapted kernels Analog forecasting with dynamics-adapted kernels. Nonlinearity299.