1 Introduction
1.1 Background
Recent international efforts to mitigate greenhouse gas emissions, such as the Paris agreement (UNFCCC, 2015) whose goal is to hold global warming below 2 degrees Celsius, reflect the urgent need for seeking renewable and clean energy resources. The integration of renewable energies such a wind power into power systems has become a common goal for numerous countries, and it has been developing on a large scale around the world in the last decade. The World Wind Energy Association (WWEA) in its 2017 halfyear statistics report (WWEA, 2017), claims that the world wind market has reached over 511 gigawatts (GW), representing an annual growth rate of 5.1%. This growth rate is expected to increase considerably as a consequence of the Paris agreement (Moody’s, 2016).
Wind power generation is sustainable and it is emissionfree. Moreover, the cost of a kilowatt (kW) of windpowered electricity is nearly the same as that of coal or nuclear energy (Hering and Genton, 2010). The advantages of wind energy are counterbalanced by several challenges, such as high variability, limited dispatchability, and nonstorability. Wind power production data observed at fine temporal resolutions exhibit successive periods with fluctuations of various dynamic nature and magnitude (Pinson and Madsen, 2012). This high variability makes the wind power a nonsteady, nonconstant source of energy. Moreover, unlike traditional sources of energy, wind power is not fully dispatchable. Wind farms cannot increase their power generation upon request when there is not sufficient wind; they can only reduce the output. Furthermore, wind cannot be stored for future power generation, and windpowered energy should be used as it becomes available.
For wind power producers participating in a liberalized electricity market, the natural fluctuation of the wind resource translates into penalties related to regulation costs. Specifically, in these markets, wind energy producers have to propose bids in advance and are then charged for any imbalance between the actual production and the energy bid (Pinson et al., 2007). Making accurate predictions of the future wind power is therefore paramount to reduce the variability and risk faced by wind energy producers.
Wind power forecasts can be made directly if power data are available (Lenzi et al., 2017; Trombe and Pinson, 2012). However, wind speed forecasting can be more precise than wind power forecasting due to the spatial correlation of wind (Zhu and Genton, 2012). In this paper, we focus on wind speed probabilistic forecasting based on turbines installed at the border between Oregon and Washington, along the Columbia River (see Figure 1
). Wind power forecasts can be easily obtained from the forecasted wind speeds using an estimate of the deterministic cubic power curve typically provided by the turbine manufacturer (see, e.g., Figure 5 in
Hering and Genton, 2010).Over the last decades, statistical models have shown to be very effective in capturing the fluctuating characteristics of wind speed to produce accurate shortterm wind speed forecasts. Purely temporal models are built assuming that wind speed at each time point is partially predicted by the wind speed in its near past. In this context, different time series models have been proposed for shortterm forecasts, such as autoregressive models
(Huang and Chalabi, 1995), autoregressive moving average models (Erdem and Shi, 2011), and autoregressive integrated moving average models (PalomaresSalas et al., 2009). Nevertheless, models that incorporate both temporal and spatial correlations have been found to increase accuracy over conventional time series models. Spatiotemporal models are built under the idea that wind speed characteristics of a region resemble those of neighboring regions. In this framework, the class of regimebased forecasting models describes wind speed conditional on predominant wind patterns in a given region, incorporating offsite information from nearby areas. In this class, the truncated normal regimeswitching (TNRS) regression models assume that wind speed follows a truncated normal distribution with regimedependent mean and variance; see
Gneiting et al. (2006), Hering and Genton (2010), and Kazor and Hering (2015).1.2 A threestage hierarchical Bayesian modeling approach
Our main goal is to develop a flexible model that comprises insite and offsite information to help us forecast future wind speeds at the turbine locations, with a particular focus on extreme wind speeds. The statistical modeling of extremes is challenging, as it usually focuses on the estimation of high (or low) quantiles, where the available data are limited. We here propose two flexible models for both nonextreme (in the bulk) and extreme (in the upper tail) wind speeds, designed to provide forecasts beyond observed wind speed values. Our threestage hierarchical Bayesian approach is composed of (i) a data layer with suitable distributions for extreme and nonextreme wind speeds, (ii) a latent process that drives the spacetime trends and dependence, and (iii) prior distributions. We propose and compare two different models for the latent process. Our models belong to the class of latent Gaussian models, which can be efficiently fitted using the integrated nested Laplace approximation (INLA;
Rue et al., 2009) conveniently implemented in the RINLA package.The data consist of hourly measurements of wind speed and wind direction at 20 towers along the border between Oregon and Washington, and are publicly available at http://transmission.bpa.gov/Business/Operations/Wind/MetData.aspx. Figure 1 maps the location of the 20 towers labeled by acronym. Figure 7 in the Supplementary Material shows wind roses for all the 20 stations, illustrating the different wind regimes among stations. After basic cleaning and preprocessing, we find that each station encompasses between and hourly measurements of nonzero wind speed from January 2012 to December 2014. Figure 5 in the Supplementary Material displays time series plots exhibiting seasonality patterns at almost every location. Autocorrelation plots shown in Figure 7 of the Supplementary Material confirm the presence of seasonal patterns at most of the stations and show that the data are highly autocorrelated, as expected. Figure 2 shows an example of the persistence of high wind speeds. Persistence is the variable’s tendency to maintain its current state and is one of the most important characteristics of climate variables.
The remainder of the article is organized as follows. In Section 2 we develop our modeling strategy, while in Section 3 we explain our Bayesian estimation approach using INLA. Section 4 presents the results of our modeling approaches in terms of wind speed forecasting. Conclusions are given in Section 5.
2 Modeling wind speeds using latent Gaussian models
2.1 Extreme wind speeds
High wind speed events are of particular interest for risk assessment. Therefore, while we aim to produce accurate probabilistic forecasts for all possible values of wind speeds, we want to pay particular attention to the upper tail of the distribution, and so we need a model that is able not only to forecast future means and variances but also resilient to extrapolate beyond observed data. ExtremeValue Theory (EVT) offers a rigorous mathematical framework that justifies such an extrapolation; see Davison and Huser (2015) for a review.
In the context of windpowered energy, a succession of large wind speed values over a period of time may pose a considerable risk to the wind turbines. Therefore a suitable representation for extremes is given by the socalled wind speed exceedances, i.e., wind speeds that exceed a certain high threshold
. A description of the stochastic behavior of these extreme events is given by the conditional probability
, where has distribution . Under broad conditions (Davison and Smith, 1990), it can be shown that as becomes large, the following approximation holds:(2.1) 
where is the generalized Pareto (GP) distribution with scale parameter , shape parameter , and . The shape parameter determines the weight of the upper tail: heavy tails are obtained with , light tails with , and bounded tails with . For sufficiently large we have that
(2.2) 
with . Thus, the GP distribution characterizes the intensity of wind speed exceedances, whereas controls the frequency of wind speed exceedances.
2.2 A threestage model for wind speeds
The statistical modeling of univariate extreme values has been in development since the 1970s. A key component for practical applications has been the development of methodology to account for nonstationarity in the distribution of interest, as advocated by Davison and Smith (1990). Typical approaches to this problem are based on the generalized linear modeling framework, allowing the parameters of a marginal distribution to depend on covariates. Hierarchical Bayesian models for threshold exceedances were introduced by Casson and Coles (1999), Cooley et al. (2007), and Opitz et al. (2018). While Casson and Coles (1999) and Cooley et al. (2007)
use expensive Markov chain Monte Carlo (MCMC) methods to fit their spatial models,
Opitz et al. (2018)perform approximate Bayesian inference using INLA.
Our approach is similar to the models of Cooley et al. (2007) and Opitz et al. (2018), but it differs in spirit and goals: while Cooley et al. (2007)
aimed at spatially interpolating large return levels, and
Opitz et al. (2018)’s objective was to accurately estimate the 99.8% precipitation quantile over time at some observed and unobserved locations, our goal is to produce accurate shortterm wind speed forecasts at the turbine locations. Here, following Opitz et al. (2018), we overcome the computational costs and tricky convergence checks of MCMC methods by using a fast and accurate inference method based on INLA; see Section 3. We now detail our threestage modeling and estimation strategy for wind speed forecasting.Stage 1.
As wind speeds are nonnegative and usually rightskewed, distributions such as the Gamma, Weibull, and truncated normal are commonly chosen to fit the bulk of the wind speed distribution; see, e.g.
Hering and Genton (2010), Kazor and Hering (2015). Here we assume that positive nonextreme wind speeds at location and time ,, can be characterized by a Gamma distribution parametrized in terms of an
quantile and a precision parameter , assumed to be constant in time:where is the quantile function of the Gamma distribution with unit rate and shape , evaluated at the probability .
Stage 2. In Stage 1, the probability of exceeding the high threshold is fixed over positive wind speeds. Here we reestimate this probability to account for the few zero values that were previously excluded. We model exceedance indicators
using the Bernoulli distribution,
, where , and has been previously estimated in Stage 1.Stage 3. Since the tail of the Gamma distribution decays exponentially fast, probabilities associated with extreme events might be underestimated. To correct the probabilities associated with tail events, we use a GP distribution; recall Section 2.1. Threshold exceedances defined as , are characterized by a GP distribution as in (2.1), but parametrized in terms of a shape parameter (constant in time) and a specific timevarying quantile . With this parametrization, the GP distribution is
Note that the goal of the Gamma is twofold: to describe the distribution of nonextreme wind speeds and to obtain a suitable timevarying threshold
to define wind speed exceedances. The second and third stages model the frequency and intensity of extremes, respectively, and are connected to the first stage through the timevarying threshold. The shape parameters for the Gamma and GP likelihoods, treated as hyperparameters, are fixed in time for the model presented in Section
2.3.2, and fixed in space and time (i.e., , ) for the model presented in Section 2.3.3.2.3 Latent Gaussian models and prior specification
2.3.1 Generalities
Latent Gaussian models are a broad and flexible class of models, ranging from generalized linear models to spatial and spatiotemporal models (Rue et al., 2009). This class of models can be specified using a hierarchical model formulation, where the observations (which here represent wind speeds at any location in , where is the study region shown in Figure 1, or exceedance indicators, or threshold exceedances) are assumed to be conditionally independent given a latent Gaussian random field and hyperparameters ,
(2.3) 
The latent Gaussian random field , controlled by the hyperparameters , describes the underlying dependence structure of the data, and its specification in (2.3.1) is key to generate a flexible and versatile class of models. Here, and are the mean and precision matrix of , respectively. We assume that only depends on a linear predictor which has an additive structure with respect to some fixed covariates and random effects, i.e.,
(2.4) 
Here, is the overall intercept, are known covariates with coefficients , and are specific Gaussian processes defined in terms of a set of covariates . If we further assume that
has a Gaussian prior, then the joint distribution of
is Gaussian, and yields the latent field in the hierarchical formulation (2.3.1); see e.g., Rue et al. (2017).To construct a suitable linear predictor for our wind speed forecasting problem, we must take into account the properties detailed in Section 1, i.e., spacetime dependence, seasonality, and persistence. In the following sections, we propose two different models for the linear predictors to capture these characteristics.
2.3.2 Temporal model with offsite information
For the modeling of wind speeds, we use the latent Gaussian modeling framework of Section 2.3.1, and we propose using the following linear predictor, for each fixed location :
(2.5) 
where for each , is the lagged time series of wind speeds at the th neighbor of , and is the set of neighbors of of cardinality . The coefficients (fixed effects) quantify the effect that wind speeds at the th neighbor observed at time lag one have on the response. The random effect corresponds to a zeromean autoregressive Gaussian process of first order, i.e.,
where is a precision parameter. The random effect captures the hourly variation of wind speeds, and is described by a cyclic Gaussian random walk of second order with precision , defined over the 24 hours within a day (Rue and Held, 2005, Ch. 3). Let denote the hour associated to time , then:
Owing to the natural spatiotemporal variability of wind speeds, spatial information is here included in the linear predictor (2.5) in the form of offsite predictors. Although traditional statistical models for wind speed and wind power forecasting consist of onsite observations and/or meteorological forecasts (see e.g., Trombe et al., 2012; Pinson and Madsen, 2012), other studies have explored the potential of offsite observations as new predictors (Gneiting et al., 2006; Zhu and Genton, 2012; Kazor and Hering, 2015). Specifically, lagged wind speeds at nearby stations can be informative of upcoming changes in the station of interest (Trombe and Pinson, 2012). More details on the selection of offsite predictors are given in Section 4.1.
The two random effects in the linear predictor (2.5) account for unobserved heterogeneity within each station. The random effect captures the autocorrelation structure of order one within each time series of wind speeds. Our experiments show that including greater lags do not improve the fit or predictive power of our model. Moreover, the random effect along with the lagged covariates account for temporal dependence and persistence. The random effect represents the variation of wind speeds within a day. Exploratory analysis (not shown) on the hourly behavior of wind speeds shows a daily pattern that varies between stations. The random effect accounts for this diurnal and volatile nature of wind speeds in a less restrictive way than the harmonic terms in the trigonometric direction diurnal model of Hering and Genton (2010), or the regime switching spacetime diurnal model of Gneiting et al. (2006). In both random effects, the precision hyperparameters , , control the strength of dependence among neighboring covariate classes.
The first two layers of our hierarchical model (see representation (2.3.1)) are completed by linking each one of the stages detailed in Section 2.2 to the linear predictor in (2.5) as follows:
Stage 1:  
Stage 2:  
Stage 3:  (2.6) 
The third and final layer of our hierarchical model is given by the prior distributions for the hyperparameters, namely and . When little expert knowledge is available, a common practice is to assume noninformative priors. As an alternative, informative priors can be proposed using Penalized Complexity (PC) priors (Simpson et al., 2017). In this framework, model components (in our case, and ) are considered to be flexible extensions of simpler and less flexible base models. Priors for the model components parameters are then developed in such a way that they shrink the components towards their base models. Simpson et al. (2017)
propose to use the KullbackLeibler divergence to measure the squared distance from the base model to its flexible extension, and to penalize this distance at a constant rate. For our wind speed forecasting problem, we assume fairly informative PC priors for the correlation hyperparameter of the AR(1) process, and the precision hyperparameter of the random walk of order two. Specifically,
and , wheredenotes the standard deviation of the temporally aggregated wind speeds. A diffuse prior is assumed for
, the precision hyperparameter of the AR(1). The prior distribution for the Gamma shape is slightly informative and is defined through a Gamma distribution with shape 10 and rate 1, which gives a high probability to values between 5 and 15. Finally, a strong PC prior is assumed for the shape parameter of the GP distribution; since large values of the shape parameter are usually unrealistic, we here assume that .2.3.3 Spacetime model using stochastic partial differential equations
The expression for the linear predictor in (2.5) corresponds to a temporal model that introduces spatial information using lagged offsite predictors. We also explore a proper spacetime model, which explicitly accounts for spatial dependence amongst wind speeds at different stations. Specifically, we assume that the spacetime dependence between wind speeds at different wind towers can be described by a spatiotemporal term that varies in time according to a firstorder autoregressive structure. Specifically, we assume that
where , and is a zeromean, temporally independent Gaussian field, that is completely determined by its covariance function
where is the stationary Mátern correlation function (see, e.g., Handcock and Stein, 1993). This gives rise to our second linear predictor:
(2.7) 
where is an intercept, is the cyclic random effect described in Section 2.3.2, capturing the subdaily variations that are common to all the stations, and is its precision parameter.
The latent process in (2.7
) has dense covariance and (possibly) precision matrices, which implies that any attempt to make Bayesian inference can be computationally demanding. To avoid this issue, we use the Stochastic Partial Differential Equations (SPDE) approach introduced by
Lindgren et al. (2011). The SPDE approach consists of constructing a continuous approximation to the Gaussian field by using a continuous SPDE model defined on the entire study area. It can be shown that this SPDE has a Gaussian field with a Matérn covariance function as stationary solution (Whittle, 1954, 1963). Under certain conditions (see, e.g., Bakka et al., 2018), the continuous SPDE solution has the Markovian property. This property produces sparse precision matrices that can be easily factorized, and are the focus of the INLA methodology. An approximate discrete solution of the SPDE in a bounded domain, defined in our case by the location of the towers, is obtained using a finite element method, that allows for flexible boundaries and different levels of accuracy for the discretization; see Section A.3 of the Supplementary Material for details on the mesh that is used to discretize the study region. For more details on spatial modeling using the SPDE approach, see the recent review by Bakka et al. (2018).Since the Cascade Mountains illustrated in Figure 1 naturally divide the study region in two subregions, we consider two submodels for , one for stations to the East and another for stations to the West of the mountains. We fit each set of stations separately but using the same latent structure specified in (2.7).
The link between this new linear predictor and the model likelihoods is the same as in (2.3.2). The Gaussian field has two parameters, namely the marginal variance and the range of dependence (the smoothness parameter is kept fixed and equal to 2). PC priors on these parameters are chosen such that the variance is shrinked towards zero, whereas the range is shrinked towards infinity (Fuglstad et al., 2018). Specifically, we set and , where is the standard deviation of the temporally aggregated wind speeds, and is the median of the distance between stations. For stations to the East of the Cascade Mountains, km, and for stations to the West, km. A PC prior is also chosen for the correlation coefficient of the autoregressive term in , specifically . Prior for the precision of is the same as before. Priors for the shape parameters of the Gamma and GP distributions, which are now fixed over space and time (i.e., , ), are also the same as before; see Section 2.3.2.
To easily distinguish between our threestage model using the first or second linear predictor, in the following we will refer to the model using the first linear predictor, , as the temporal model with offsite predictors, whereas the model using the second linear predictor, , will be referred to as the SPDE model.
3 Inference based on INLA
Here we describe the form of the joint posterior distribution for each of the three stages detailed in Section 2.2. In the following, we remove the spatial component from the models for ease of notation. Let y
denote the vector of observations for any of the three stages detailed above, with associated hyperparameters
. As above, let be the latent Gaussian random field, , and . Then, the joint posterior distribution of parameters and hyperparameters at any location , for any of the three stages, can be written as(3.1) 
where and are the mean and precision matrix of , respectively, and for Stages 1 and 2, and is equal to the number of exceedances in Stage 3. The main objectives of the statistical inference are to extract from (3) the marginal posterior distributions for each of the elements of the linear predictor vector, and for each hyperparameter, i.e.,
(3.2) 
from which predictive distributions may be derived.
In a Bayesian framework, model estimation is typically performed using simulationbased techniques, such as MCMC methods. Alternatively, approximate methods can be used to cope with the computation of high dimensional integrals needed to obtain posterior distributions. One of such approaches that has become increasingly popular in the last decade is the integrated nested Laplace approximation (INLA; Rue et al., 2009), where posterior distributions of interest are numerically approximated using the Laplace approximation, therefore avoiding the usually complex updating schemes, long running times, and diagnostic convergence checks of simulationbased MCMC. INLA is designed for latent Gaussian models and therefore, it can be successfully used in a wide variety of applications (see for instance Riebler et al., 2012; Lombardo et al., 2018; Krainski et al., 2019). The RINLA package (Bivand et al., 2015) is a convenient interface to the INLA methodology. A wide variety of models are already implemented, and the package is continuously maintained and updated. In particular, our work has motivated the INLA implementation of the Gamma and Weibull quantile regressions, both now available to the users. Details regarding the numerical approximations to (3.2) are given by INLA are provided in Section A.1 of the Supplementary Material.
4 Wind speed probabilistic forecasting results
4.1 Automatic offsite predictor selection based on wind direction
An important step to fit our temporal model with offsite predictors at each station , is to select a suitable set of offsite predictors . There are different approaches for integrating offsite predictors into forecasting models (Larson and Westrick, 2006; Damousis et al., 2004). Here, we develop a datadriven approach for identifying dominant wind directions, which we then use subsequently to automatically choose the offsite predictors. Specifically, for each station , we fit a mixture of von Mises circular distributions to the time series of wind directions, . The von Mises density with location parameter and concentration parameter is given by , where is the modified Bessel function of order 0. For a mixture of von Mises distributions, we identify the dominant wind directions with the locations parameters , , and construct the sets of angles , with . Then, the set of offsite predictors for the station corresponds to all the stations whose angle with lies in , such that their distance to is less than some maximum distance . We selected the number of components for the mixture of von Mises distributions via the Bayesian Information Criterion (BIC), also guided by the wind roses displayed in Figure 6 of the Supplementary Material. The angle depends on the geographical conditions and the distance between stations. We choose for all stations except Mt. Hebo (HEB) where . The maximum distance between stations is km, which corresponds to the mean distance among all stations. To illustrate our approach, Figure 3 displays the fitted dominant wind directions for Biddle Buttle and Roosevelt stations (see Figure 8 of the Supplementary Material for the fits at all stations). The number of offsite predictors for all the stations varies between 2 and 10, while the distance between offsite predictors and ranges between 13.3km and 175.3km, for all .
4.2 Posterior predictive distributions
Here we briefly explain how to obtain posterior predictive distributions for the 1hour, 2hour, and 3hour ahead probabilistic forecasts of hourly wind speeds, produced by our threestage hierarchical Bayesian model, using the linear predictors introduced in Section
2.3. For the SPDE model, we use a rolling training period of length 5 days, whereas for the temporal model, we multiply this period by the number of stations in each side of the Cascade Mountains, as a way to balance the effective sample sizes of the SPDE and the temporal models. We generate samples from the posterior predictive distribution, for each station, each forecasting time horizon, and each linear predictor, as follows: we extract the fitted linear predictors and hyperparameters for each stage, and use (2.3.2) to obtain samples for the Gamma, Bernoulli, and GP predictive distributions. We replace Gamma samples by threshold exceedances whenever the threshold is exceeded, i.e., whenever the associated Bernoulli sample is equal to 1. In other words, the tail of the Gamma distribution is corrected by the GP distribution in the presence of exceedances.4.3 Forecast evaluation for extreme and nonextreme wind speeds
Here we consider performance measures that describe the ability of our threestage model to forecast extreme and nonextreme wind speeds. For nonextreme wind speeds, we consider the continuous ranked probability score (CRPS) introduced by Gneiting et al. (2007). The CRPS is a robust proper scoring rule that quantifies the calibration and sharpness of the forecast (Gneiting et al., 2007). It is defined as
(4.1) 
where is the predictive distribution at station and time , and is the observed wind speed. The average CRPS for the hour ahead forecast is
(4.2) 
To assess the predictive performance in the upper tail of the distribution, we use the quantile loss function and the thresholdweighted continuous ranked probability score (twCRPS)
(Gneiting and Ranjan, 2011; Lerch et al., 2017). The quantile loss function measures the performance of a model to estimate a specific quantile , and it is defined asThe twCRPS is defined as
where is a nonnegative weight function on the real line. For , the twCRPS reduces to the CRPS in (4.1). As suggested by Lerch and Thorarinsdottir (2013), since our interest is on the right tail of the wind speed distribution, we may set , for some . The average twCRPS for hour ahead forecast is defined as in (4.2).
Table 1 shows performance measures for onehour ahead forecasts using the first and second proposed linear predictors, respectively. Throughout all the fits, we set the probabilities defined in Section 2.2 as and . The performance measures are the mean CRPS, mean absolute error (MAE), mean twCRPS using an indicator weight function with equal to the 90%, 95%, and 98% quantile of the wind speed distribution at each station, and mean quantile loss (MQL). We can see that the temporal model with offsite predictors performs slightly better than the SPDE model at predicting average wind speed values (CRPS is better at 12 out of 20 stations, whereas MAE is better at 11 out of 20 stations); on the other hand, the SPDE model is much better at predicting extreme wind speeds. The SPDE model overcomes the temporal model with offsite predictors in terms of the twCRPS at almost all stations, whereas the MQL is better at 13 out of 20 stations. This may be due to the difficulty in estimating the GP shape parameter at each station using the temporal model, while a single shape parameter is assumed in the SPDE model, reducing dramatically the estimated posterior predictive uncertainty by borrowing strength across all stations. An additional advantage of the SPDE model is that it can be also used to forecast wind speed at unobserved stations, which it cannot be done with the temporal model with offsite predictors.
twCRPS  
Station  CRPS  MAE  MQL  
AUG  0.90/0.81  1.83/0.83  0.32/0.20  0.14/0.10  0.07/0.03  0.75/0.55 
BID  0.69/2.79  1.89/3.49  0.31/0.02  0.17/0.01  0.08/0.00  0.64/2.96 
BUT  0.87/1.05  1.89/1.04  0.12/0.01  0.07/0.00  0.04/0.00  0.80/0.20 
CNK  0.96/0.76  1.93/1.47  0.17/0.18  0.09/0.17  0.04/0.12  0.96/0.78 
FOR  0.35/2.07  0.62/3.08  0.24/0.02  0.11/0.01  0.04/0.00  0.33/0.23 
GDH  0.70/0.71  1.23/1.42  0.27/0.22  0.11/0.07  0.05/0.01  0.61/0.59 
HOO  0.60/0.40  0.92/0.02  0.27/0.03  0.11/0.01  0.04/0.00  0.54/0.51 
HOR  0.73/0.76  1.22/1.25  0.26/0.22  0.12/0.15  0.06/0.58  0.65/0.64 
KEN  1.05/1.10  1.91/1.84  0.29/0.00  0.17/0.00  0.06/0.00  0.81/0.59 
MAR  1.01/2.12  2.23/3.82  0.25/0.02  0.14/0.01  0.07/0.00  0.96/2.05 
MEG  0.64/1.50  1.16/2.59  0.35/0.02  0.15/0.01  0.06/0.00  0.59/1.38 
HEB  0.52/1.22  1.07/2.41  0.35/0.02  0.17/0.01  0.08/0.00  0.49/0.73 
NAS  0.66/1.46  1.22/2.79  0.30/0.02  0.14/0.01  0.06/0.00  0.60/1.56 
ROO  1.00/0.75  2.05/1.51  0.18/0.01  0.10/0.00  0.06/0.00  0.86/0.64 
SML  0.87/0.55  1.61/1.32  0.30/0.01  0.14/0.00  0.03/0.00  0.80/0.63 
SHA  0.79/0.42  1.28/0.89  0.35/0.34  0.15/0.13  0.06/0.03  0.68/0.55 
SUN  1.06/0.27  1.83/1.80  0.29/0.22  0.17/0.16  0.09/0.06  0.98/0.78 
TIL  0.49/1.08  0.84/1.79  0.34/0.01  0.21/0.00  0.08/0.00  0.48/0.64 
TRO  0.56/0.77  1.21/1.59  0.26/0.01  0.13/0.00  0.05/0.00  0.54/0.68 
WAS  0.78/0.73  1.33/1.59  0.36/0.01  0.10/0.00  0.04/0.00  0.72/0.38 
We assess the calibration of our probabilistic forecasts using reliability diagrams (see, e.g., Lenzi et al., 2017). Reliability refers to the ability of the model to match the observation frequencies. The diagram is constructed as follows: for every station, we compute the nominal coverage rate, which is the proportion of times that the cumulative distribution of our threestage model is below a certain threshold. Our model is well calibrated if this proportion is close to the observed frequencies. Figure 4 (a) show reliability diagrams for the temporal model with offsite predictors (red line) and the SPDE model (green line). Both models tend to overestimate the wind speed quantiles smaller than the median. The temporal model with offsite predictors underestimates the wind speed values larger than the median, whereas the SPDE model is better calibrated at higher quantiles.
Finally, to compare the forecast performance of our two models, we compute predictive measures of fit for extreme wind speeds. Figure 4
(b) and (c) shows Probability Integral Transform (PIT) plots based on the GP distribution, using the temporal model with offsite predictors and the SPDE model. The PIT is the probability of a new response being less than the observed response using a model based on the rest of the data. If the model assumptions are correct, we expect the PIT values to be uniformly distributed. Although there is still room for improvement, we can see that both models perform decently well at predicting the severity of extreme events, and that the SPDE model seems slightly better.
5 Conclusion
Wind power is getting increasing attention from many countries as a source of clean energy with little environmental impact. A recent report from The European Power Sector (AgoraEnergiewende and Sandbag, 2018) shows that renewable sources of energy like wind, solar and biomass, overtook coal for the first time. In particular, wind generation increased considerably by 19% during 2017, due to good wind conditions and huge investments in wind power plants.
As cities powered by electricity generated by wind are now a reality, new and accurate methods to forecast wind speed, and therefore wind power, are required. In this work, we explored a threestage Bayesian model designed to forecast extreme wind speeds. Our model corrects the tail of the Gamma distribution by a generalized Pareto distribution in the presence of exceedances. Each stage of our model belongs to the class of latent Gaussian models, for which the integrated nested Laplace approximation (INLA; Rue et al., 2009) method is wellsuited. Considering an additive structure, we proposed two types of linear predictors describing the spatiotemporal dynamics of wind speeds. The first linear predictor includes offsite information in terms of lagged wind speeds from neighboring stations. To select the neighbors, we propose an automatic method to identify dominant wind direction patterns based on mixtures of von Mises circular distributions. The second linear predictor considers a spatiotemporal term with Matérn covariance structure, that varies in time according to a firstorder autoregressive dynamic. This model is fitted using the SPDE approach, which links Gaussian random fields with dense precision matrices with Gaussian Markov Random fields, characterized by sparse precision matrices. In terms of extreme wind speed forecasting, both models seem to perform decently well, although the SPDE model seems to be better calibrated at higher quantiles, probably because it better exploits spatial information. It would be interesting to explore the potential of the SPDE model to predict extreme wind speeds at unobserved locations, which could be helpful for optimal design of wind farms.
Thanks to the very powerful and fast INLA approach, we can implement in a reasonable amount of time complex hierarchical spatial models that are well suited to wind speed data. Finally, by selecting a suitable distribution in the first stage, our threestage model procedure for exceedances can be easily adapted to model and forecast other types of environmental data.
It would be interesting to extend our approach to cases with lighter Weibulllike tails. One option could be to pretransform the data as , with some , and then proceed identically, or alternatively to fit directly a generalized Gamma distribution. Similarly, it would be interesting to fit the GP distribution with shape parameter (i.e., not only ). Nevertheless, designing suitable shrinkage priors for is still an open question.
Appendix A Supplementary Material
a.1 Inference for our threestage model using INLA
From a computational point of view, the main goal of INLA is to accurately approximate the integrals in Section 4 of the main paper. Here we discuss general concepts behind this numerical approximation, and the reader is referred to Rue et al. (2009), Blangiardo and Cameletti (2015), and Rue et al. (2017) for more mathematical details.
We need to compute from which also all the relevant marginals can be obtained, and .

Computation of . Note that
(A.1) where is the Gaussian approximation of and is its mode. This Gaussian approximation can be written as
where is the location of the mode, , and the vector contains in the diagonal the negative second derivatives of the loglikelihood at the mode with respect to ; see Rue and Held (2005), Section 4.4.1. The proportional sign in (A.1) comes from the fact that the normalizing constant for is unknown. Rue and Martino (2007) show that the approximation is particularly accurate for a wide range of latent Gaussian models.

Computation of . Although we could start from the previous result and take the Gaussian marginals derived from (this is the Gaussian approximation method in INLA), the error in this approximation is considerably high (Rue and Martino, 2007). To cope with this issue, Rue et al. (2009) propose to use the Laplace approximation one more time to approximate (this is the nested part in INLA). Specifically, we can write
(A.2) As before, is the Gaussian approximation of evaluated at its mode . Although this approximation (Laplace approximation method in INLA) typically works well (Blangiardo et al., 2013), it involves the location of the mode and the factorization of a matrix many times for each . Alternatively, we can use a Taylor expansion on the numerator and denominator of the expression (A.2), thus correcting the Gaussian approximation for location and skewness with a lower computational cost (this is the simplified Laplace approximation method in INLA).
The INLA strategy can be summarized as follows:

Explore the marginal joint posterior to locate the mode.

Conduct a grid search to produce a set of points and integration weights to obtain .

Approximate the marginal posterior using interpolation based on .

For each , evaluate the conditional posteriors on a grid of values for . Obtain marginal posteriors by numerical integration (this is the integrated part in INLA).
As a result, the approximated posterior marginals of interest returned by INLA have the following form:
a.2 Exploratory analysis
a.2.1 Time series plots

























a.2.2 Wind roses

























a.2.3 Autocorrelation plots

























a.2.4 Dominant wind directions for the temporal model with offsite predictors

























a.3 Discretization of the study region for the SPDE model
Figure 9 shows the mesh that is used to discretize the study region, split into two parts for the West and the East side of the Cascade Mountains.
Acknowledgements
We thank Amanda Hering for helpful discussion and suggestions, and for providing the wind speed data. We also extend our thanks to Thomas Opitz for helpful discussion, and to Amanda Lenzi for her constructive suggestions regarding the INLA implementation. Support from the KAUST Supercomputing Laboratory and access to Shaheen is also gratefully acknowledged. This publication is based upon work supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSRCRG20173434.
References
 AgoraEnergiewende and Sandbag (2018) AgoraEnergiewende and Sandbag (2018) The European Power Sector in 2017. State of Affairs and Review of Current Developments. Press Release, EU, December 17, 2008, available at: https://sandbag.org.uk/wpcontent/uploads/2018/01/EUpowersectorreport2017.pdf .
 Bakka et al. (2018) Bakka, H., Rue, H., Fuglstad, G. A., Riebler, A., Bolin, D., Illian, J., Krainski, E., Simpson, D. and Lindgren, F. (2018) Spatial modelling with RINLA: A review. WIREs Computational Statistics (arxiv:1802.06350) (Invited extended review, to appear).
 Bivand et al. (2015) Bivand, R., GómezRubio, V. and Rue, H. (2015) Spatial data analysis with RINLA with some extensions. Journal of Statistical Software 63(1), 1–31.
 Blangiardo and Cameletti (2015) Blangiardo, M. and Cameletti, M. (2015) Spatial and spatiotemporal Bayesian models with RINLA. John Wiley & Sons.
 Blangiardo et al. (2013) Blangiardo, M., Cameletti, M., Baio, G. and Rue, H. (2013) Spatial and spatiotemporal models with RINLA. Spatial and spatiotemporal epidemiology 4, 33–49.
 Casson and Coles (1999) Casson, E. and Coles, S. (1999) Spatial regression models for extremes. Extremes 1(4), 449–468.
 Cooley et al. (2007) Cooley, D., Nychka, D. and Naveau, P. (2007) Bayesian spatial modeling of extreme precipitation return levels. Journal of the American Statistical Association 102(479), 824–840.
 Damousis et al. (2004) Damousis, I. G., Alexiadis, M. C., Theocharis, J. B. and Dokopoulos, P. S. (2004) A fuzzy model for wind speed prediction and power generation in wind parks using spatial correlation. IEEE Transactions on Energy Conversion 19(2), 352–361.
 Davison and Huser (2015) Davison, A. C. and Huser, R. (2015) Statistics of extremes. Annual Review of Statistics and its Application 2, 203–235.
 Davison and Smith (1990) Davison, A. C. and Smith, R. L. (1990) Models for exceedances over high thresholds. Journal of the Royal Statistical Society. Series B (Methodological) 52(3), 393–442.
 Erdem and Shi (2011) Erdem, E. and Shi, J. (2011) ARMA based approaches for forecasting the tuple of wind speed and direction. Applied Energy 88(4), 1405–1414.
 Fuglstad et al. (2018) Fuglstad, G.A., Simpson, D., Lindgren, F. and Rue, H. (2018) Constructing priors that penalize the complexity of Gaussian random fields. Journal of the American Statistical Association pp. 1–8.
 Gneiting et al. (2007) Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007) Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(2), 243–268.
 Gneiting et al. (2006) Gneiting, T., Larson, K., Westrick, K., Genton, M. G. and Aldrich, E. (2006) Calibrated probabilistic forecasting at the stateline wind energy center: The regimeswitching space–time method. Journal of the American Statistical Association 101(475), 968–979.
 Gneiting and Ranjan (2011) Gneiting, T. and Ranjan, R. (2011) Comparing density forecasts using thresholdand quantileweighted scoring rules. Journal of Business & Economic Statistics 29(3), 411–422.
 Handcock and Stein (1993) Handcock, M. S. and Stein, M. L. (1993) A Bayesian analysis of kriging. Technometrics 35(4), 403–410.
 Hering and Genton (2010) Hering, A. S. and Genton, M. G. (2010) Powering up with spacetime wind forecasting. Journal of the American Statistical Association 105(489), 92–104.
 Huang and Chalabi (1995) Huang, Z. and Chalabi, Z. (1995) Use of timeseries analysis to model and forecast wind speed. Journal of Wind Engineering and Industrial Aerodynamics 56(2–3), 311–322.
 Kazor and Hering (2015) Kazor, K. and Hering, A. S. (2015) The role of regimes in shortterm wind speed forecasting at multiple wind farms. Stat 4(1), 271–290.
 Krainski et al. (2019) Krainski, E. T., GómezRubio, V., Bakka, H., Lenzi, A., CastroCamilo, D., Simpson, D., Lindgren, F. and Rue, H. (2019) Advanced Spatial Modeling with Stochastic Partial Differential Equations using R and INLA. CRC press. Github version www.rinla.org/spdebook.
 Larson and Westrick (2006) Larson, K. A. and Westrick, K. (2006) Shortterm wind forecasting using offsite observations. Wind Energy: An International Journal for Progress and Applications in Wind Power Conversion Technology 9(1–2), 55–62.
 Lenzi et al. (2017) Lenzi, A., Pinson, P., Clemmensen, L. H. and Guillot, G. (2017) Spatial models for probabilistic prediction of wind power with application to annualaverage and high temporal resolution data. Stochastic Environmental Research and Risk Assessment 31(7), 1615–1631.
 Lerch and Thorarinsdottir (2013) Lerch, S. and Thorarinsdottir, T. L. (2013) Comparison of nonhomogeneous regression models for probabilistic wind speed forecasting. Tellus A: Dynamic Meteorology and Oceanography 65(1), 21206.
 Lerch et al. (2017) Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F., Gneiting, T. et al. (2017) Forecaster’s dilemma: Extreme events and forecast evaluation. Statistical Science 32(1), 106–127.
 Lindgren et al. (2011) Lindgren, F., Rue, H. and Lindström, J. (2011) An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(4), 423–498.
 Lombardo et al. (2018) Lombardo, L., Opitz, T. and Huser, R. (2018) Point processbased modeling of multiple debris flow landslides using INLA: an application to the 2009 Messina disaster. Stochastic Environmental Research and Risk Assessment 32(7), 2179–2198.
 Moody’s (2016) Moody’s (2016) Global wind turbine manufacturers: COP 21 agreement will support wind turbine manufacturers’ sales momentum. Available at hhttps://www.moodys.com .
 Opitz et al. (2018) Opitz, T., Huser, R., Bakka, H. and Rue, H. (2018) INLA goes extreme: Bayesian tail regression for the estimation of high spatiotemporal quantiles. Extremes 21(3), 441–462.

PalomaresSalas et al. (2009)
PalomaresSalas, J., De La Rosa, J., Ramiro, J., Melgar, J., Aguera, A. and Moreno, A. (2009) Arima vs. neural networks for wind speed forecasting.
In Computational Intelligence for Measurement Systems and Applications, 2009. CIMSA’09. IEEE International Conference on, pp. 129–133.  Pinson et al. (2007) Pinson, P., Chevallier, C. and Kariniotakis, G. N. (2007) Trading wind generation from shortterm probabilistic forecasts of wind power. IEEE Transactions on Power Systems 22(3), 1148–1156.
 Pinson and Madsen (2012) Pinson, P. and Madsen, H. (2012) Adaptive modelling and forecasting of offshore wind power fluctuations with Markovswitching autoregressive models. Journal of forecasting 31(4), 281–313.
 Riebler et al. (2012) Riebler, A., Held, L., Rue, H. et al. (2012) Estimation and extrapolation of time trends in registry data—borrowing strength from related populations. The Annals of Applied Statistics 6(1), 304–333.
 Rue and Held (2005) Rue, H. and Held, L. (2005) Gaussian Markov random fields: theory and applications. CRC press.
 Rue and Martino (2007) Rue, H. and Martino, S. (2007) Approximate Bayesian inference for hierarchical Gaussian Markov random field models. Journal of statistical planning and inference 137(10), 3177–3192.
 Rue et al. (2009) Rue, H., Martino, S. and Chopin, N. (2009) Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal statistical society: Series B (Statistical Methodology) 71(2), 319–392.
 Rue et al. (2017) Rue, H., Riebler, A., Sørbye, S. H., Illian, J. B., Simpson, D. P. and Lindgren, F. K. (2017) Bayesian computing with INLA: a review. Annual Review of Statistics and Its Application 4, 395–421.
 Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., Sørbye, S. H. et al. (2017) Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science 32(1), 1–28.
 Trombe and Pinson (2012) Trombe, P.J. and Pinson, P. (2012) Highresolution forecasting of wind power generation with regimeswitching models and offsite observations. DTU Informatics.
 Trombe et al. (2012) Trombe, P.J., Pinson, P. and Madsen, H. (2012) A general probabilistic forecasting framework for offshore wind power fluctuations. Energies 5(3), 621–657.
 UNFCCC (2015) UNFCCC, V. (2015) Adoption of the Paris Agreement. Proposal by the President (Draft Decision). United Nations Office, Geneva, Switzerland .
 Whittle (1954) Whittle, P. (1954) On stationary processes in the plane. Biometrika pp. 434–449.
 Whittle (1963) Whittle, P. (1963) Stochasticprocesses in several dimensions. Bulletin of the International Statistical Institute 40(2), 974–994.
 WWEA (2017) WWEA (2017) The WWEA halfyear report. europa.eu/rapid/pressrelease_IP081998_en.pdf .
 Zhu and Genton (2012) Zhu, X. and Genton, M. G. (2012) Shortterm wind speed forecasting for power system operations. International Statistical Review 80(1), 2–23.
Comments
There are no comments yet.