post-processing experiments with neural networks
Ensemble weather predictions require statistical post-processing of systematic errors to obtain reliable and accurate probabilistic forecasts. Traditionally, this is accomplished with distributional regression models in which the parameters of a predictive distribution are estimated from a training period. We propose a flexible alternative based on neural networks that can incorporate nonlinear relationships between arbitrary predictor variables and forecast distribution parameters that are automatically learned in a data-driven way rather than requiring pre-specified link functions. In a case study of 2-meter temperature forecasts at surface stations in Germany, the neural network approach significantly outperforms benchmark post-processing methods while being computationally more affordable. Key components to this improvement are the use of auxiliary predictor variables and station-specific information with the help of embeddings. Furthermore, the trained neural network can be used to gain insight into the importance of meteorological variables thereby challenging the notion of neural networks as uninterpretable black boxes. Our approach can easily be extended to other statistical post-processing and forecasting problems. We anticipate that recent advances in deep learning combined with the ever-increasing amounts of model and observation data will transform the post-processing of numerical weather forecasts in the coming decade.READ FULL TEXT VIEW PDF
Quantifying uncertainty in weather forecasts typically employs ensemble
Accurate and reliable probabilistic forecasts of hydrological quantities...
Currently all major meteorological centres generate ensemble forecasts u...
A new probabilistic post-processing method for wind vectors is presented...
Predictive hydrological uncertainty can be quantified by using ensemble
Modern weather forecasts are commonly issued as consistent multi-day for...
Meteorological ensembles are a collection of scenarios for future weathe...
post-processing experiments with neural networks
Numerical weather prediction based on physical models of the atmosphere has improved continuously since its inception more than four decades ago (Bauer et al., 2015). In particular, the emergence of ensemble forecasts — simulations with varying initial conditions and/or model physics — added another dimension by quantifying the flow-dependent uncertainty. Yet despite these advances the raw forecasts continue to exhibit systematic errors which need to be corrected using statistical post-processing methods (Hemri et al., 2014). Considering of the ever-increasing social and economical value of numerical weather prediction — for example in the renewable energy industry — producing accurate and calibrated probabilistic forecasts is an urgent challenge.
Most post-processing methods correct systematic errors in the raw ensemble forecast by learning a function that relates the response variable of interest to predictors. From a machine learning perspective, post-processing can be viewed as a supervised learning task. For the purpose of this study we will consider post-processing in a narrower distributional regression framework where the aim is to model the conditional distribution of the weather variable of interest given a set of predictors. The two most prominent approaches for probabilistic forecasts, Bayesian model averaging(BMA; Raftery et al., 2005) and non-homogeneous regression, also referred to as ensemble model output statistics (EMOS; Gneiting et al., 2005)
, rely on parametric forecast distributions. This means one has to specify a predictive distribution and estimate its parameters, for example the mean and the standard deviation in the case of a Gaussian distribution. In the EMOS framework the distribution parameters are connected to summary statistics of the ensemble predictions through suitable link functions which are estimated by minimizing a probabilistic loss function over a training dataset. Including additional predictors, such as forecasts of cloud cover or humidity, is not straightforward in this framework and requires elaborate approaches to avoid overfitting(Messner et al., 2016), a term that describes the inability of a model to generalize to data outside the training dataset. We propose an alternative approach based on modern machine learning methods, which is capable of including arbitrary predictors and learns nonlinear dependencies in a data-driven way.
Much work over the past years has been spent on flexible machine learning techniques for statistical modeling and forecasting (McGovern et al., 2017)2001), for instance, can model nonlinear relationships including arbitrary predictors while being robust to overfitting. They have been used for classification and prediction of precipitation (Gagne et al., 2014), severe wind (Lagerquist et al., 2017) and hail (Gagne et al., 2017)
. In a post-processing context, quantile regression forest models have been proposed byTaillardat et al. (2016).
Neural networks are a flexible and user-friendly machine learning algorithm that can model arbitrary nonlinear functions (Nielsen, 2015). They consist of several layers of interconnected nodes which are modulated with simple nonlinearities (Figure 2; Section 42015), but also biology, physics and chemistry (Angermueller et al., 2016; Goh et al., 2017) have been transformed by neural networks. In the atmospheric sciences, neural networks have been used to detect extreme weather in climate datasets (Liu et al., 2016) and parameterize sub-grid processes in global circulation models (Gentine et al., 2018). Neural networks have also been used for forecasting solar irradiances (Wang et al., 2012; Chu et al., 2013) and damaging winds (Lagerquist et al., 2017). However, the complexity of the neural networks used in these studies was limited.
Here, we demonstrate how neural networks can be used for probabilistic post-processing of ensemble forecasts in the distributional regression framework. The presented model architecture allows for the incorporation of various features that are relevant for correcting systematic deficiencies of ensemble predictions, and to estimate the network parameters by optimizing the continuous ranked probability score — a mathematically principled loss functions for probabilistic forecasts. Specifically, we explore a case study of 2-meter temperature forecasts at surface stations in Germany with data from 2007–2016. We compare different neural network configurations to benchmark post-processing methods for varying training period lengths. We further use the trained neural networks to gain meteorological insight into the problem at hand. Our ultimate goal is to present an efficient, multi-purpose approach to statistical post-processing and probabilistic forecasting. To the best of our knowledge, this study is the first to tackle ensemble post-processing using neural networks.
The remainder of the paper is structured as follows: Section 2 describes the forecast and observation data as well as the notation used throughout the study. In Section 3 we describe the benchmark post-processing models, followed by a description of the neural network techniques in Section 4. The main results are presented in Section 5. In Section 6 we explore the relative importance of the predictor variables. A discussion of possible extensions follows in Section 7 before a conclusion in Section 8.
For this study, we focus on 2-meter temperature forecasts at surface stations in Germany at a forecast lead time of 48 h. The forecasts are taken from the Interactive Grand Global Ensemble (TIGGE) dataset111available at http://apps.ecmwf.int/datasets/data/tigge/, see https://github.com/slerch/ppnn/tree/master/data_retrieval (Bougeault et al., 2010). In particular, we use the global European Centre for Medium-Range Weather Forecasts (ECMWF) 50-member ensemble forecasts initialized at 00 UTC every day. The data in the TIGGE archive is upscaled to a
grid which corresponds to a horizontal grid spacing of around 35/55 km (zonal/meridional). For comparison with the station observations, the gridded data were bilinearly interpolated to the observation locations. In addition to the target variable, we retrieved several auxiliary predictor variables (Table1). These were chosen broadly based on meteorological intuition. For each variable, we reduced the 50-member ensemble to its mean and standard deviation.
Ensemble predictions are available from 3 January 2007 to 31 December 2016 every day. To ensure our verification procedure mimics operational conditions we set aside the year 2016 as a validation set. For model estimation we use two training periods, 2007–2015 and 2015 only, to assess the importance of training sample size.
|Ensemble predictions (mean and standard deviation)|
|cape||Convective available potential energy|
|tcc||Total cloud cover|
|sshf||Sensible heat flux|
|slhf||Latent heat flux|
|d2m||2-meter dew point temperature|
|ssr||Short wave radiation flux|
|str||Long wave radiation flux|
|v_pl500||V-wind at 500 hPa|
|u_pl500||U-wind at 500 hPa|
|u_pl850||U-wind at 850 hPa|
|v_pl850||V-wind at 850 hPa|
|gh_pl500||Geopotential at 500 hPa|
|q_pl850||Specific humidity at 850 hPa|
|station_alt||Altitude of station|
|orog||Altitude of model grid point|
|station_lat||Latitude of station|
|station_lon||Longitude of station|
The forecasts are evaluated at 537 weather stations in Germany (see Figure 1222All maps in this article were produced using the R package ggmap (Kahle and Wickham, 2013).). 2-meter temperature data are available from the Climate Data Center of the German weather service (Deutscher Wetterdienst, DWD)333available at https://www.dwd.de/DE/klimaumwelt/cdc/cdc_node.html. Several stations have periods of missing data which are omitted from the analysis. During the evaluation period in calendar year 2016, observations are available at 499 stations.
After removing missing observations, the 2016 validation set contains 182 218 samples, the 2007–2015 training set contains 1 626 724 samples and the 2015 training set contains 180 849 samples.
We now introduce the notation that is used throughout the rest of the paper. An observation of 2-meter temperature at station and time will be denoted by . For each and , the 50-member ECMWF ensemble forecast of variable is given by with mean value and standard deviation . The mean values and standard deviations of all variables in the upper part of Table 1
are combined with station-specific features in the lower part, and aggregated into a vector of predictors. Further, we write to denote the vector of predictors that only contains mean value and standard deviations of the 2-meter temperature forecasts.
In the general EMOS framework proposed by Gneiting et al. (2005), the conditional distribution of the weather variable of interest, , given ensemble predictions , is modeled by a single parametric forecast distribution with parameters ,
The parameters vary over space and time, and depend on the ensemble predictions through suitable link functions ,
Here, we are interested in modeling the conditional distribution of temperature and follow Gneiting et al. (2005) who introduce a model based on ensemble predictions of temperature, , only, where the forecast distribution is Gaussian with mean and standard deviation , i.e.,
and where the link functions for mean and standard deviation are affine functions of the ensemble mean and standard deviation, respectively,
Over the past decade, the EMOS framework has been extended from temperature to other weather variables including wind speed (Thorarinsdottir and Gneiting, 2010; Lerch and Thorarinsdottir, 2013; Baran and Lerch, 2015; Scheuerer and Möller, 2015) and precipitation (Messner et al., 2014; Scheuerer, 2014; Scheuerer and Hamill, 2015).
The model parameters (or EMOS coefficients) are estimated by minimizing the mean continuous ranked probability score (CRPS) as function of the parameters over a training set. The CRPS is an example of a proper scoring rule, i.e., a mathematically principled loss function for distribution forecasts, and is a standard choice in meteorological applications. Details on the mathematical background of proper scoring rules and their use for model estimation are provided in Appendix B.
Training sets are often considered to be comprised of the most recent days only. However, as we did not find substantial differences in predictive performance (see Section 5), we estimate the coefficients over a fixed training set, they thus do not vary over time and we denote them by . Estimation is usually either performed locally, i.e., considering only forecast cases from the station of interest, or globally by pooling together forecasts and observations from all stations. We refer to the corresponding EMOS models by EMOS-loc and EMOS-gl, respectively. The parameters of the global model do not depend on the station , and are thus unable to correct location-specific deficiencies of the ensemble forecasts. Alternative approaches where training sets are selected based on similarities of weather situations or observation station characteristics were proposed by Junk et al. (2015) and Lerch and Baran (2017). EMOS-gl and EMOS-loc are implemented in R with the help of the scoringRules package (Jordan et al., 2017).
Extending the EMOS framework to allow for including additional predictor variables is non-trivial as the increased number of parameters can result in overfitting. Messner et al. (2017) proposed a boosting algorithm for this purpose. In this approach components of the link function in (2) are chosen to be an affine function for the mean and an exponential transformation of an affine function for the standard deviation ,
Here, and denote coefficient vectors corresponding to the vector of predictors extended by a constant. As for the standard EMOS models, the coefficient vectors are estimated over fixed training periods and thus do not depend on , we suppress the index in the following.
The boosting algorithm proceeds iteratively by updating the coefficient of the predictor that improves the current model fit most. As the coefficient vectors are initialized as , only the most important variables will have non-zero coefficients if the algorithm is stopped before convergence. The contributions of the different predictors are assessed by computing average correlations to partial derivatives of the loss function with respect to and over the training set. If the current model fit is improved, the coefficient vectors are updated by a pre-defined step size into the direction of steepest descent of linear approximations of the gradients.
We denote local EMOS models with an additional boosting step by EMOS-loc-bst. The tuning parameters of the algorithm were chosen by fitting models for a variety of choices and picking the configuration with the best out-of-sample predictions (Table A2) based on implementations in the R package crch (Messner et al., 2016). Note, however, that the results are not very sensitive to the exact choice of tuning parameters. For the local model considered here, the station-specific features in the lower part of Table 1 are not relevant and are excluded from . Boosting-based variants of global EMOS models have also been tested, but result in worse forecasts.
The boosting-based EMOS-loc-bst model differs from the standard EMOS models (EMOS-gl and EMOS-loc) in several aspects. First, the boosting step allows to include covariate information from predictor variables other than temperature forecasts. Second, the parameters are estimated by maximum likelihood estimation, i.e., by minimizing the mean logarithmic score by contrast to minimum CRPS estimation, see Appendix B for details. Further, the affine link function for the standard deviation in (3) is replaced by an affine function for the logarithm of the standard deviation in (4).
Parametric distributional regression models such as the EMOS methods described above require the choice of a suitable parametric family . While the conditional distribution of temperature can be well approximated by a Gaussian distribution, this poses a limitation for other weather variables such as wind speed or precipitation where the choice is less obvious (see, e.g., Baran and Lerch, 2018).
Non-parametric distributional regression approaches provide alternatives that circumvent the choice of parametric family. For example, quantile regression approaches approximate the conditional distribution by a set of quantiles. In the context of post-processing ensemble forecasts, Taillardat et al. (2016) proposed a quantile regression forest (QRF) model based on the work of Meinshausen (2006) that allows to include additional predictor variables.
The QRF model is based on the idea of generating random forests from classification and regression trees (Breiman et al., 1984)
. These are binary decision trees obtained by iteratively splitting the training data into two groups according to some threshold for one of the predictors, chosen such that every split minimizes the sum of the variance of the response variable in each of the resulting groups. The splitting procedure is iterated until a stopping criterion is reached. The final groups (or terminal leaves) thus contain subsets of the training observations based on the predictor values, and out of sample forecasts at stationand time can be obtained by proceeding through the decision tree according to the corresponding predictor values . Random forest models (Breiman, 2001)
increase stability of the predictions by averaging over many random decision trees generated by selecting a random subset of the predictors at each candidate split in conjunction with bagging, i.e., bootstrap aggregation of random subsamples of training sets. In the quantile regression forest approach, each tree provides an approximation of the distribution of the variable of interest given by the empirical cumulative distribution function (CDF) of the observation values in the terminal leaf associated with the current predictor values. Quantile forecasts can then be computed from the combined forecast distribution which is obtained by averaging over all tree-based empirical CDFs.
We implement a local version of the QRF model where separate models are estimated for each station based on training sets that only contain past forecasts and observations from that specific station. As discussed by Taillardat et al. (2016), the predicted quantiles are necessarily restricted to the range of observed values in the training period by construction which may be disadvantageous in case of shorter training periods. However, global variants of the QRF model did not result in improved forecast performance even with only one year of training data, we will thus restrict attention to the local QRF model. The models are implemented using the quantregForest package (Meinshausen, 2017) for R. Tuning parameters are chosen as for the EMOS-loc-bst model (Table A2).
The QRF approach has recently been extended into several directions. Athey et al. (2016) propose a generalized version of random forest-based quantile regression based on theoretical considerations (GRF) which has been tested but did not result in improved forecast performance. Taillardat et al. (2017) combine QRF (and GRF) models and parametric distributional regression by fitting a parametric CDF to the observations in the terminal leaves instead of using the empirical CDF. Schlosser et al. (2018) combine parametric distributional regression and random forests for parameter estimation in the framework of a generalized additive model for location, scale and shape.
In this section we will give a brief introduction to neural networks. For a more detailed treatment the interested reader is referred to more comprehensive resources (e.g., Nielsen, 2015; Goodfellow et al., 2016). The network techniques are implemented using the Python
libraries Keras(Chollet et al., 2015)
and TensorFlow(Abadi et al., 2016).
Neural networks consist of several layers of nodes (Figure 2), each of which is a weighted sum of all nodes from the previous layer plus a bias term:
The first layer contains the input values, or features, while the last layer represents the output values, or targets. In the layers in-between, called hidden layers, each node value is passed through a nonlinear activation function. For this study, we use a rectified linear unit (ReLU):
This activation function allows the neural network to represent nonlinear functions. The weights and biases are optimized to reduce a loss function using stochastic gradient descent (SGD). Here we employ an SGD version called Adam(Kingma and Ba, 2014).
In this study we use networks without a hidden layer and with a single hidden layer (Figure 2). The former, which we will call fully-connected networks (FCN), model the outputs as a linear combination of the inputs. The latter, called neural networks (NN) here, are capable of representing nonlinear relationships. Introducing additional hidden layers to neural networks did not improve the predictions as additional model complexity increases the potential of overfitting.
Neural networks can be applied to a range of problems, such as regression and classification. The main difference between those are the contents and activation function of the output layer, and the loss function. Here we use the neural network for the distributional regression task of post-processing ensemble forecasts. Our output layer represents the distribution parameters and of the Gaussian predictive distribution. No activation function is applied. The corresponding probabilistic forecast describes the conditional distribution of the observation given the predictors as input features. As a loss function we use the closed form expression of the CRPS for a Gaussian distribution, see equation (B2). This is a non-standard choice in the neural network literature (D’Isanto and Polsterer (2018) is the only previous study to our knowledge) but provides a mathematically principled choice for the distributional regression problem at hand (see Appendix B for the mathematical background). Other probabilistic neural network approaches include quantile regression (Taylor, 2000) and distribution-to-distribution regression (Kou et al., 2018).
The simplest network model is a fully connected model based on predictors , i.e., mean and standard deviation of ensemble predictions of temperature only (denoted by FCN). Apart from additional connections for the mean and standard deviation to ensemble standard deviation and mean, respectively, the FCN model is conceptually equivalent to EMOS-gl, but differs in the parameter estimation approaches. A neural network with a hidden layer for the -input did not show any improvements over the simple linear model suggesting that there is no nonlinear relationships to exploit. Additional information from auxiliary variables can be taken into account by considering the entire vector of predictors as input features. The corresponding fully connected and neural network models are referred to as FCN-aux and NN-aux.
To enable the networks to learn station-specific information we use embeddings, a common technique in natural language processing and recommender systems. An embedding is a mapping from a discrete object, in our case the station ID, to a vector of real numbers. The elements of this vector with length are also referred to as latent features. In total, therefore, the embedding matrix has dimension , where is the number of stations. The latent features are concatenated with the predictors, or , and are updated along with the weights and biases during training. This allows the algorithm to learn a specific set of numbers for each station. Here we use as larger values did not improve the predictions.
The fully connected network with input features and embeddings is abbreviated by FCN-emb. As with FCN, adding a hidden layer did not improve the results. Fully connected and neural networks with both, station embeddings and auxiliary inputs , are denoted by FCN-aux-emb and NN-aux-emb.
Neural networks with a large number of parameters, i.e., weights and biases, can suffer from overfitting. One way to reduce overfitting is to stop training early. When to stop can be guessed by taking out a subset (20%) from the training set (2007–2015 or 2015) and checking when the score on this separate dataset stops improving. This gives a good approximation when to stop training on the full training set without using the actual 2016 validation set during training. Other common regularization techniques to prevent overfitting, such as dropout or weight decay (L2 regularization), were not successful in our case.
Finally, we train ensembles of ten neural networks with different random initial parameters for each configuration and average over the forecast distribution parameter estimates to obtain . For the more complex network models this helps to stabilize the parameter estimates by reducing the variability due to random variations between model runs and slightly improves the forecasts.
The CRPS values averaged over all stations and the entire 2016 validation period are summarized in Table 2. For the 2015 training period, EMOS-gl gives a 13% relative improvement compared to the raw ECMWF ensemble forecasts in terms of mean CRPS.444Note that standard EMOS models are often implemented with rolling training periods, i.e., parameter estimation is based on the preceding days. Such implementations have been tested, but only result in marginal improvements. For the longer training period from 2007–2015, we have also implemented EMOS models that are estimated based on a centered window around the current day from all previous years. For the local model, this results in a slightly improved mean CRPS of 0.88, but does not improve the forecasts of EMOS-gl. As expected, FCN which mimics the design of EMOS-gl achieves a very similar score. Adding local station information in EMOS-loc and FCN-emb improves the global score by another 10%. While EMOS-loc estimates a separate model for each station, FCN-emb can be seen as a global network-based implementation of EMOS-loc. Adding covariate information through auxiliary variables results in an improvement for the fully connected models similar to that of adding station information. Combining auxiliary variables and station embeddings in FCN-emb-aux improves the mean CRPS further to 0.88 but the effects do not stack linearly. Adding covariate information in EMOS models using boosting (EMOS-loc-bst) outperforms FCN-emb-aux by 3%. Allowing for nonlinear interactions of station information and auxiliary variables using a neural network (NN-aux-emb) achieves the best results, improving the best benchmark technique (EMOS-loc-bst) by 3% for a total improvement compared to the raw ensemble of 29%. The QRF model is unable to compete with most of the post-processing models for the 2015 training period.
The relative scores and model rankings for the 2007–2015 training period closely match those of the 2015 period. For the linear models (EMOS-gl, EMOS-loc and all FCN) more data does not improve the score by much. For EMOS-loc-bst and the neural network models, however, the skill is increased by 4–5%. This suggests that longer training periods are most efficiently exploited by more complex, nonlinear models. QRF improves the most, now being among the best models, which indicates a minimum data amount required for this method to work. This is likely due to the limitation of predicted quantiles to the range of observed values in the training data, see Section 3.3.
|Model||Description||Mean CRPS for|
|Benchmark post-processing methods|
|EMOS-loc-bst||Local EMOS with boosting||0.85||0.80|
|QRF||Local quantile regression forest||0.95||0.81|
|Neural network models|
|FCN||Fully connected network||1.01||1.01|
|FCN-aux||…with auxiliary predictors||0.92||0.91|
|FCN-emb||…with station embeddings||0.91||0.91|
|FCN-aux-emb||…with both of the above||0.88||0.87|
|NN-aux||1-hidden layer neural network with auxiliary predictors||0.90||0.86|
|NN-aux-emb||…and station embeddings||0.82||0.78|
To assess calibration, verification rank and probability integral transform (PIT) histograms of raw and post-processed forecasts are shown in Figures C1 and C2. The raw ensemble forecasts are under-dispersed as indicated by the U-shaped verification rank histogram, i.e., observations tend to fall outside the range of the ensemble too frequently. By contrast, all post-processed forecast distributions are substantially better calibrated and the corresponding PIT histograms show much smaller deviations from uniformity. All models show a slight overprediction of high temperatures and, with the exception of QRF, an underprediction of low values. The linear EMOS and FCN models as well as QRF are further slightly overdispersive as indicated by inverse U-shaped upper parts of the histogram.
shows the station-wise distribution of the continuous ranked probability skill score (CRPSS), which measures the probabilistic skill relative to a reference model. Positive values indicate an improvement over the reference. Compared to the raw ensemble, forecasts at most stations are improved by all post-processing methods with only a few negative outliers. Compared to EMOS-loc only FCN-aux-emb, the neural network models and EMOS-loc-bst show improvements at the majority of stations. Corresponding plots with the three best-performing models as reference experiments are provided in FigureC3. It is interesting to note that the network models, with the exception of FCN and FCN-emb, have more outliers, particularly for negative values compared to the EMOS methods and QRF which have very few negative outliers. This might be due to a few stations with strongly location-specific error characteristics that the locally estimated benchmark models are better able capture. Training with data from 2007 to 2015 alleviates this somewhat.
Figure 4 shows maps with the best-performing models in terms of mean CRPS for each station. For the majority of stations NN-aux-emb provides the best predictions. The variability of station-specific best models is greater for the 2015 training period compared to 2007–2015. The top three models for the 2015 period are NN-aux-emb (best at 65.9% of stations), EMOS-loc-bst (16.0%) and NN-aux (7.2%), and for 2007–2015 NN-aux-emb (73.5%), EMOS-loc-bst (12.4%) and QRF (7.4%). At coastal and offshore locations, particularly for the shorter training period, the benchmark methods tend to outperform the network methods. Ensemble forecast errors at these location likely have a strong location-specific component that might be easier to capture for the locally estimated EMOS and QRF methods.
Additionally, we evaluated the statistical significance of the differences between the competing post-processing methods using a combination of Diebold-Mariano tests (Diebold and Mariano, 1995) and a Benjamini and Hochberg (1995) procedure to account for temporal and spatial dependencies of forecast errors. We thereby follow suggestions of Wilks (2016), mathematical details are deferred to Appendix B.3. The results provided in Appendix C.4 generally indicate high ratios of stations with significant score differences in favor of the neural network models. Even when compared to the second best-performing model, EMOS-loc-bst, NN-aux-emb is significantly better at 30% of the stations and worse at only 2% or less for both training periods.
While a direct comparison of computation times for the different methods is difficult, even the most complex network methods are a factor of two or more faster than EMOS-loc-bst. This includes creating an ensemble of ten different model realizations. QRF is by far the slowest method, being roughly ten times slower than EMOS-loc-bst. Complex neural networks benefit substantially from running on a graphics processing unit (GPU) compared to running on the core processing unit (CPU; roughly six times slower for NN-aux-emb). Neural network-ready GPUs are now widely available in many scientific computing environments or via cloud computing555For example, see https://colab.research.google.com/. For more details on the computational methods and results see Appendix C.3.
To assess the relative importance of all features we use a technique called permutation importance that was first described in the context of random forests (Breiman, 2001). We randomly shuffle each predictor/feature in the validation set one at a time and observe the increase in mean CRPS compared to the unpermuted features. While unable to capture colinearities between features this method does not require re-estimating the model with each individual feature omitted.
Consider a random permutation of station and time indices and let denote the vector of predictors where variable is permuted according to , i.e., a vector with -th entry
The importance of input feature is computed as mean CRPS difference
where we average over the entire evaluation set and denotes the conditional forecast distribution given a vector of predictors.
We picked three network setups to investigate how feature importance changes by adding station embeddings and a nonlinear layer (Figure 5). For the linear model without station embeddings (FCN-aux) station altitude and orography, the altitude of the model grid cell, are the most important predictors after the mean temperature forecast. This makes sense since our interpolation from the forecast model grid to the station does not adjust for the height of the surface station. The only other features with significant importance are the mean shortwave radiation flux and 850 hPa specific humidity. Adding station embeddings (FCN-aux-emb) reduces the significance of the station altitude information which now seems to be encoded in the latent embedding features. The nonlinearity added by the hidden layer in NN-aux-emb increases the sensitivity to permuting input features overall and distributes the feature importance more evenly. In particular, we note an increase in the importance of the station altitude and orography but also the sensible and latent heat flux and total cloud cover.
The most important features, apart from the obvious mean forecast temperature and station altitude, seem to be indicative of insolation, either directly like the shortwave flux or indirectly like the 850 hPa humidity. It is interesting that the latter seems to be picked by the algorithms as a proxy for cloud cover rather than the direct cloud cover feature. Curiously, the temperature standard deviation is not an important feature for the post-processing models. We suspect that this is a consequence of the low correlation between the raw ensemble standard deviation and the forecast error ( on the test set) and the general under-dispersion (mean spread-error ratio of 0.51). The post-processing algorithms almost double the spread to achieve a spread-error ratio of 0.95. The correlation of the raw and post-processed ensemble spreads is 0.39 suggesting that the post-processing is mostly an additive correction to the ensemble spread.
Note that this method of assessing feature importance is in principle possible for boosting- and QRF-based models. However, for the local implementations of the algorithm the importance changes from station to station, making interpretation more difficult.
Here we discuss some approaches we attempted that failed to improve our results, as well as directions for future research.
Having to describe the distribution of the target variable in parametric techniques is a non-trivial task. For temperature, a Gaussian distribution is a good approximation but for other variables, such as wind speed or precipitation, finding a distribution that fits the data is a substantial challenge (e.g., Taillardat et al., 2016; Baran and Lerch, 2018)
. Ideally, a machine learning algorithm would learn to predict the full probability distribution rather than distribution parameters only. One way to achieve this is to approximate the forecast distribution by a combination of uniform distributions and predicting the probability of the temperature being within pre-specified bins. Initial experiments indicate that the neural network is able to produce a good approximation of a Gaussian distribution but the skill was comparable only to the raw ensemble. This suggests that for target variables which are well approximated by a parametric distribution, utilizing these distributions is advantageous. One direction for future research is to apply this approach to more complex variables.
Standard EMOS models are often estimated based on so-called “rolling” training windows with data from previous days only in order to incorporate temporal dependencies of ensemble forecast errors. For neural networks, one way to incorporate temporal dependencies is to use convolutional or recurrent neural networks(Schmidhuber, 2015) which can process sequences as an input. In our tests, this leads to more overfitting without an improvement to the validation score. For other datasets, however, we believe that these approaches are worth revisiting.
One popular way to combat overfitting in machine learning algorithms is data augmentation. In the example of image recognition models, the training images are randomly rotated, flipped, zoomed, etc. to artificially increase the sample size (e.g., Krizhevsky et al., 2012). We tried a similar approach by adding random noise of a reasonable scale to the input features, but found no improvement in the validation score. A potential alternative to adding random noise might be augmenting the forecasts for a station with data from neighboring stations or grid points.
Similarly to rolling training windows for the traditional EMOS models, we tried updating the neural network each day during the validation period with the data from the previous time step, but found no improvements. This supports our observation that rolling training windows only bring marginal improvements for the benchmark EMOS models. Such an online learning approach could be more relevant in an operational setting, however, where model versions might change frequently or it is too expensive to re-estimate the entire post-processing model every time new data becomes available.
We have restricted the set of predictors to observation station characteristics and summary statistics (mean and standard deviation) of ensemble predictions of several weather variables. Recently, flexible distribution-to-distribution regression network models have been proposed in the machine learning literature (e.g., Oliva et al., 2013; Kou et al., 2018). Adaptations of such approaches might enable the use of the entire ensemble forecast of each predictor variable as input feature. However, training of these substantially more complex models likely requires longer training periods than were possible in our study.
Another possible extension would be to post-process forecasts on the entire two-dimensional grid, rather than individual stations locations, for example by using convolutional neural networks. This adds computational complexity and probably requires more training data but could provide information about the large-scale weather patterns and help to produce spatially consistent predictions.
We have considered probabilistic forecasts of a single weather variable at a single location and look-ahead time only. However, many applications require accurate models of cross-variable, spatial and temporal dependence structures, and much recent work has been focused on multivariate post-processing methods (e.g., Schefzik et al., 2013). Extending the neural network based approaches to multivariate forecast distributions accounting for such dependencies presents a promising starting point for future research.
In this study we demonstrated how neural networks can be used for distributional regression post-processing of ensemble weather forecasts. Our neural network models significantly outperform state-of-the-art post-processing techniques while being computationally more efficient. The main advantages of using neural networks are the capability of capturing nonlinear relations between arbitrary predictors and distribution parameters without having to specify appropriate link functions, and the ease of adding station information in a global model by using embeddings. The network model parameters are estimated by optimizing the CRPS, a non-standard choice in the machine learning literature tailored to probabilistic forecasting. Furthermore, the rapid pace of development in the deep learning community provides flexible and efficient modeling techniques and software libraries. The presented approach can therefore be easily applied to other problems.
The building blocks of our network model architecture provide general insight into the relative importance of model properties for post-processing ensemble forecasts. Specifically, the results indicate that encoding local information is very important for providing skillful probabilistic temperature forecasts. Further, including covariate information via auxiliary variables improves the results considerably, particularly when allowing for nonlinear relations of predictors and forecast distribution parameters. Ideally, any post-processing model should thus strive to incorporate all of these aspects.
We also showed that a trained machine learning model can be used to gain meteorological insight. In our case, it allowed us to identify the variables most important for correcting systematic temperature forecast errors of the ensemble. In this context, neural networks are somewhat interpretable and give us more information than we originally asked for. While a direct interpretation of the individual parameters of the model is intractable, this challenges the common notion of neural networks as pure black boxes.
Because of their flexibility neural networks are ideally suited to handle the increasing amounts of model and observation data as well as the diverse requirements for correcting multifaceted aspects of systematic ensemble forecast errors. We anticipate, therefore, that they will provide a valuable addition to the modeler’s toolkit for many areas of statistical post-processing and forecasting.
The research leading to these results has been done within the subprojects A6 “Representing forecast uncertainty using stochastic physical parameterizations” and C7 “Statistical postprocessing and stochastic physics for ensemble predictions” of the Transregional Collaborative Research Center SFB/TRR 165 “Waves to Weather” funded by the German Research Foundation (DFG). Sebastian Lerch is grateful for infrastructural support by the Klaus Tschira Foundation. The authors thank Tilmann Gneiting, Alexander Jordan, and Maxime Taillardat for helpful discussions and for providing code. The initial impetus for this work stems from a meeting with Kai Polsterer who presented a probabilistic neural network based approach to astrophysical image data analysis.
Using artificial intelligence to improve real-time decision-making for high-impact weather.Bulletin of the American Meteorological Society, 98, 2073–2090.
Extending extended logistic regression: Extended versus separate versus ordered versus censored.Monthly Weather Review, 142, 3003–3014.
Statistical postprocessing of ensemble precipitation forecasts by fitting censored, shifted gamma distributions.Monthly Weather Review, 143, 4578–4596.
|EMOS-loc-bst||maximum number of iterations||1 000|
|stopping criterion for boosting algorithm||AIC|
|QRF||number of trees||1 000|
|minimum size of terminal leaves||10|
|number of variables randomly sampled as||25|
|candidates at each split|
|Model||Number of||Epochs||Learning rate||Batch size||Hidden||Embedding|
|FCN||6||30 (15)||0.1 (0.1)||4 096 (4 096)|
|FCN-aux||82||30 (10)||0.02 (0.02)||1024 (1 024)|
|FCN-emb||1 084||30 (10)||0.02 (0.02)||1 024 (1 024)||2 (2)|
|FCN-aux-emb||1 160||30 (10)||0.02 (0.02)||1 024 (1 024)||2 (2)|
|NN-aux||3 326||(10)||(0.02)||(1 024)||(64)||(2)|
|NN-aux-emb||24 116||30 (10)||0.01 (0.002)||1 024 (4 096)||50 (512)||2 (2)|
For the purpose of the present Section, denote a generic probabilistic forecast for 2-meter temperature at station and time by . Note that
may be a parametric forecast distribution represented by CDF or probability density function (PDF), an ensemble forecastor a set of quantiles. We may choose to suppress the index at times for ease of notation.
As argued by Gneiting et al. (2007), probabilistic forecasts should generally aim to maximize sharpness subject to calibration. In a nutshell, a forecast is called calibrated if the realizing observation cannot be distinguished from a random draw from the forecast distribution. Calibration thus refers to the statistical consistency between forecast distribution and observation. By contrast, sharpness is a property of the forecast only and refers to the concentration of the predictive distribution. The calibration of ensemble forecasts can be assessed via verification rank (VR) histograms summarizing the distribution of ranks of the observation when it is pooled with the ensemble forecast (Hamill, 2001; Gneiting et al., 2007; Wilks, 2011). For continuous forecast distributions histograms of the PIT provide analogs of verification rank histograms. Calibrated forecasts result in uniform VR and PIT histograms, and deviations from uniformity indicate specific systematic errors such as biases or an under-representation of the forecast uncertainty.
For comparative model assessment, proper scoring rules allow simultaneous evaluation of calibration and sharpness (Gneiting and Raftery, 2007). A scoring rule assigns a numerical score to a pair of probabilistic forecast and corresponding realizing observation , and is called proper relative to a class of forecast distributions if
i.e., if the expected score is optimized if the true distribution of the observation is issued as forecast. Here, scoring rules are considered to be negatively oriented with smaller scores indicating better forecasts
Popular examples of proper scoring rule include the logarithmic score (LogS; Good, 1952)
where denotes the observation and denotes the PDF of the forecast distribution, and the continuous ranked probability score (CRPS; Matheson and Winkler, 1976)
denotes the CDF of the forecast distribution with finite first moment andis an indicator function that is 1 if and 0 otherwise. The integral in (B1) can be computed analytically for ensemble forecasts and a variety of continuous forecast distributions (see, e.g., Jordan et al., 2017). Specifically, the CRPS of a Gaussian distribution with mean value and standard deviation can be computed as
where and denote CDF and PDF of a standard Gaussian distribution, respectively (Gneiting et al., 2005).
Apart from forecast evaluation, proper scoring rules can also be used for parameter estimation. Following the generic optimum score estimation framework of Gneiting and Raftery (2007, Section 9.1), the parameters of a forecast distribution are determined by optimizing the value of a proper scoring rule, on average over a training sample. Optimum score estimation based on the LogS then corresponds to classical maximum likelihood estimation, whereas optimum score estimation based on the CRPS is often employed as a more robust alternative in meteorological applications. Analytical closed-form solutions of the CRPS, for example for a Gaussian distribution in (B2), allow for computing analytical gradient functions that can be leveraged in numerical optimization, see Jordan et al. (2017) for details.
In practical applications, scoring rules are usually computed as averages over stations and/or time periods. To assess the relative improvement over a reference forecast , we further introduce the continuous ranked probability skill score
which is positively oriented and can be interpreted as relative improvement over the reference. The CRPSS is usually computed as skill score of CRPS averages.
Formal statistical tests of equal forecast performance for assessing statistical significance of score differences have been widely used in the economic literature. Consider two forecasts and , with corresponding mean scores for over a test , where we assume that the forecast was issued time steps before the observation was recorded. Diebold and Mariano (1995)
propose the test statistic
where is an estimator of the asymptotic standard deviation of the score difference of and . Under standard regularity conditions,
asymptotically follows a standard normal distribution under the null hypothesis of equal predictive performance ofand . Thereby, negative values of indicate superior predictive performance of , whereas positive values indicate superior performance of . To account for temporal dependencies of score differences, we use the square root of the sample auto-covariance up to lag as estimator following Diebold and Mariano (1995). We employ Diebold-Mariano tests on an observation station-level, i.e., the mean CRPS values are determined by averaging over all scores at the specific station of interest,
where denotes days in the evaluation period.
Compared to previous uses of Diebold-Mariano tests in post-processing applications (e.g., Baran and Lerch, 2016) we further account for spatial dependencies of score differences at the different stations. Following suggestions of Wilks (2016) we apply a Benjamini and Hochberg (1995) procedure to control the false discovery rate at level . In a nutshell, the algorithm requires a higher standard in order to reject a local null hypothesis of equal predictive performance by selecting a threshold -value based on the set of ordered local -values . Particularly, is the largest that is not larger than , where is the number of tests, i.e., the number of stations in the evaluation set.
Table C1 shows computation times required for training the different post-processing models for both training sets. As noted before, the computation times are not directly comparable due to implementations in different programming languages and hardware environments. The computation times for the benchmark models, implemented in R using the crch (Messner et al., 2016), quantregForest (Meinshausen, 2017) and scoringRules (Jordan et al., 2017) packages, were obtained on a standard laptop computer, whereas the network models were implemented with the Python libraries Keras (Chollet et al., 2015) and TensorFlow (Abadi et al., 2016), and run on a single GPU (Nvidia Tesla K20). Computation times on a regular CPU are roughly 6 times longer for the most complex networks. For the simple networks the difference is negligible. Note that the inference time, i.e., the time to make a prediction after the model has been trained, is on the order of a few seconds for all models. Further, note that all computation times reported here are substantially lower compared to the computational costs of generating the raw ensemble forecast.
|Model||Computation time (min)|
|with training data from|
Pair-wise one-sided Diebold-Mariano tests are applied to all possible comparisons of forecast models at each of the 499 stations individually. To account for multiple hypothesis testing and spatial correlations of score differences, we apply a Benjamini-Hochberg procedure to the corresponding -values when aggregating the results by determining the ratio of stations with significant score differences, see Appendix B.3 for details.
Table C2 summarizes pair-wise Diebold-Mariano tests by showing the ratio of stations with statistically significant CRPS differences after applying a Benjamini-Hochberg procedure for a nominal level of . Generally, the results indicate large numbers of stations with significant differences of the network models when compared to standard EMOS approaches. NN-aux-emb shows the highest ratios of significant score differences over any competitor, and is significantly outperformed at very few station and only by the best-performing alternatives.
Training with 2015 data
Training with 2007-2015 data