Neural networks for post-processing ensemble weather forecasts

05/23/2018 ∙ by Stephan Rasp, et al. ∙ 8

Ensemble weather predictions require statistical post-processing of systematic errors to obtain reliable and accurate probabilistic forecasts. Traditionally, this is accomplished with distributional regression models in which the parameters of a predictive distribution are estimated from a training period. We propose a flexible alternative based on neural networks that can incorporate nonlinear relationships between arbitrary predictor variables and forecast distribution parameters that are automatically learned in a data-driven way rather than requiring pre-specified link functions. In a case study of 2-meter temperature forecasts at surface stations in Germany, the neural network approach significantly outperforms benchmark post-processing methods while being computationally more affordable. Key components to this improvement are the use of auxiliary predictor variables and station-specific information with the help of embeddings. Furthermore, the trained neural network can be used to gain insight into the importance of meteorological variables thereby challenging the notion of neural networks as uninterpretable black boxes. Our approach can easily be extended to other statistical post-processing and forecasting problems. We anticipate that recent advances in deep learning combined with the ever-increasing amounts of model and observation data will transform the post-processing of numerical weather forecasts in the coming decade.



There are no comments yet.


page 13

Code Repositories


post-processing experiments with neural networks

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Numerical weather prediction based on physical models of the atmosphere has improved continuously since its inception more than four decades ago (Bauer et al., 2015). In particular, the emergence of ensemble forecasts — simulations with varying initial conditions and/or model physics — added another dimension by quantifying the flow-dependent uncertainty. Yet despite these advances the raw forecasts continue to exhibit systematic errors which need to be corrected using statistical post-processing methods (Hemri et al., 2014). Considering of the ever-increasing social and economical value of numerical weather prediction — for example in the renewable energy industry — producing accurate and calibrated probabilistic forecasts is an urgent challenge.

Most post-processing methods correct systematic errors in the raw ensemble forecast by learning a function that relates the response variable of interest to predictors. From a machine learning perspective, post-processing can be viewed as a supervised learning task. For the purpose of this study we will consider post-processing in a narrower distributional regression framework where the aim is to model the conditional distribution of the weather variable of interest given a set of predictors. The two most prominent approaches for probabilistic forecasts, Bayesian model averaging

(BMA; Raftery et al., 2005) and non-homogeneous regression, also referred to as ensemble model output statistics (EMOS; Gneiting et al., 2005)

, rely on parametric forecast distributions. This means one has to specify a predictive distribution and estimate its parameters, for example the mean and the standard deviation in the case of a Gaussian distribution. In the EMOS framework the distribution parameters are connected to summary statistics of the ensemble predictions through suitable link functions which are estimated by minimizing a probabilistic loss function over a training dataset. Including additional predictors, such as forecasts of cloud cover or humidity, is not straightforward in this framework and requires elaborate approaches to avoid overfitting

(Messner et al., 2016), a term that describes the inability of a model to generalize to data outside the training dataset. We propose an alternative approach based on modern machine learning methods, which is capable of including arbitrary predictors and learns nonlinear dependencies in a data-driven way.

Much work over the past years has been spent on flexible machine learning techniques for statistical modeling and forecasting (McGovern et al., 2017)

. Random forests

(Breiman, 2001), for instance, can model nonlinear relationships including arbitrary predictors while being robust to overfitting. They have been used for classification and prediction of precipitation (Gagne et al., 2014), severe wind (Lagerquist et al., 2017) and hail (Gagne et al., 2017)

. In a post-processing context, quantile regression forest models have been proposed by

Taillardat et al. (2016).

Neural networks are a flexible and user-friendly machine learning algorithm that can model arbitrary nonlinear functions (Nielsen, 2015). They consist of several layers of interconnected nodes which are modulated with simple nonlinearities (Figure 2; Section 4

). Over the past decade many fields, most notably computer vision and natural language processing

(LeCun et al., 2015), but also biology, physics and chemistry (Angermueller et al., 2016; Goh et al., 2017) have been transformed by neural networks. In the atmospheric sciences, neural networks have been used to detect extreme weather in climate datasets (Liu et al., 2016) and parameterize sub-grid processes in global circulation models (Gentine et al., 2018). Neural networks have also been used for forecasting solar irradiances (Wang et al., 2012; Chu et al., 2013) and damaging winds (Lagerquist et al., 2017). However, the complexity of the neural networks used in these studies was limited.

Here, we demonstrate how neural networks can be used for probabilistic post-processing of ensemble forecasts in the distributional regression framework. The presented model architecture allows for the incorporation of various features that are relevant for correcting systematic deficiencies of ensemble predictions, and to estimate the network parameters by optimizing the continuous ranked probability score — a mathematically principled loss functions for probabilistic forecasts. Specifically, we explore a case study of 2-meter temperature forecasts at surface stations in Germany with data from 2007–2016. We compare different neural network configurations to benchmark post-processing methods for varying training period lengths. We further use the trained neural networks to gain meteorological insight into the problem at hand. Our ultimate goal is to present an efficient, multi-purpose approach to statistical post-processing and probabilistic forecasting. To the best of our knowledge, this study is the first to tackle ensemble post-processing using neural networks.

The remainder of the paper is structured as follows: Section 2 describes the forecast and observation data as well as the notation used throughout the study. In Section 3 we describe the benchmark post-processing models, followed by a description of the neural network techniques in Section 4. The main results are presented in Section 5. In Section 6 we explore the relative importance of the predictor variables. A discussion of possible extensions follows in Section 7 before a conclusion in Section 8.

Python (Python Software Foundation, 2017) and R (R Core Team, 2017) code for reproducing the results is available at

2 Data and notation

2.1 Forecast data

For this study, we focus on 2-meter temperature forecasts at surface stations in Germany at a forecast lead time of 48 h. The forecasts are taken from the Interactive Grand Global Ensemble (TIGGE) dataset111available at, see (Bougeault et al., 2010). In particular, we use the global European Centre for Medium-Range Weather Forecasts (ECMWF) 50-member ensemble forecasts initialized at 00 UTC every day. The data in the TIGGE archive is upscaled to a

grid which corresponds to a horizontal grid spacing of around 35/55 km (zonal/meridional). For comparison with the station observations, the gridded data were bilinearly interpolated to the observation locations. In addition to the target variable, we retrieved several auxiliary predictor variables (Table

1). These were chosen broadly based on meteorological intuition. For each variable, we reduced the 50-member ensemble to its mean and standard deviation.

Ensemble predictions are available from 3 January 2007 to 31 December 2016 every day. To ensure our verification procedure mimics operational conditions we set aside the year 2016 as a validation set. For model estimation we use two training periods, 2007–2015 and 2015 only, to assess the importance of training sample size.

Feature Description
Ensemble predictions (mean and standard deviation)
t2m 2-meter temperature
cape Convective available potential energy
sp Surface pressure
tcc Total cloud cover
sshf Sensible heat flux
slhf Latent heat flux
u10 10-meter U-wind
v10 10-meter V-wind
d2m 2-meter dew point temperature
ssr Short wave radiation flux
str Long wave radiation flux
sm Soil moisture
v_pl500 V-wind at 500 hPa
u_pl500 U-wind at 500 hPa
u_pl850 U-wind at 850 hPa
v_pl850 V-wind at 850 hPa
gh_pl500 Geopotential at 500 hPa
q_pl850 Specific humidity at 850 hPa
Station-specific information
station_alt Altitude of station
orog Altitude of model grid point
station_lat Latitude of station
station_lon Longitude of station
Table 1: Abbreviations and description of all features. Detailed definitions are available at

2.2 Observation data

The forecasts are evaluated at 537 weather stations in Germany (see Figure 1222All maps in this article were produced using the R package ggmap (Kahle and Wickham, 2013).). 2-meter temperature data are available from the Climate Data Center of the German weather service (Deutscher Wetterdienst, DWD)333available at Several stations have periods of missing data which are omitted from the analysis. During the evaluation period in calendar year 2016, observations are available at 499 stations.

After removing missing observations, the 2016 validation set contains 182 218 samples, the 2007–2015 training set contains 1 626 724 samples and the 2015 training set contains 180 849 samples.

Figure 1: Locations of DWD surface observation stations. The gray scale values of the points indicate the altitude in meters.

2.3 Notation

We now introduce the notation that is used throughout the rest of the paper. An observation of 2-meter temperature at station and time will be denoted by . For each and , the 50-member ECMWF ensemble forecast of variable is given by with mean value and standard deviation . The mean values and standard deviations of all variables in the upper part of Table 1

are combined with station-specific features in the lower part, and aggregated into a vector of predictors

. Further, we write to denote the vector of predictors that only contains mean value and standard deviations of the 2-meter temperature forecasts.

3 Benchmark post-processing techniques

3.1 Ensemble model output statistics

In the general EMOS framework proposed by Gneiting et al. (2005), the conditional distribution of the weather variable of interest, , given ensemble predictions , is modeled by a single parametric forecast distribution with parameters ,


The parameters vary over space and time, and depend on the ensemble predictions through suitable link functions ,


Here, we are interested in modeling the conditional distribution of temperature and follow Gneiting et al. (2005) who introduce a model based on ensemble predictions of temperature, , only, where the forecast distribution is Gaussian with mean and standard deviation , i.e.,

and where the link functions for mean and standard deviation are affine functions of the ensemble mean and standard deviation, respectively,


Over the past decade, the EMOS framework has been extended from temperature to other weather variables including wind speed (Thorarinsdottir and Gneiting, 2010; Lerch and Thorarinsdottir, 2013; Baran and Lerch, 2015; Scheuerer and Möller, 2015) and precipitation (Messner et al., 2014; Scheuerer, 2014; Scheuerer and Hamill, 2015).

The model parameters (or EMOS coefficients) are estimated by minimizing the mean continuous ranked probability score (CRPS) as function of the parameters over a training set. The CRPS is an example of a proper scoring rule, i.e., a mathematically principled loss function for distribution forecasts, and is a standard choice in meteorological applications. Details on the mathematical background of proper scoring rules and their use for model estimation are provided in Appendix B.

Training sets are often considered to be comprised of the most recent days only. However, as we did not find substantial differences in predictive performance (see Section 5), we estimate the coefficients over a fixed training set, they thus do not vary over time and we denote them by . Estimation is usually either performed locally, i.e., considering only forecast cases from the station of interest, or globally by pooling together forecasts and observations from all stations. We refer to the corresponding EMOS models by EMOS-loc and EMOS-gl, respectively. The parameters of the global model do not depend on the station , and are thus unable to correct location-specific deficiencies of the ensemble forecasts. Alternative approaches where training sets are selected based on similarities of weather situations or observation station characteristics were proposed by Junk et al. (2015) and Lerch and Baran (2017). EMOS-gl and EMOS-loc are implemented in R with the help of the scoringRules package (Jordan et al., 2017).

3.2 Boosting for predictor selection in EMOS models

Extending the EMOS framework to allow for including additional predictor variables is non-trivial as the increased number of parameters can result in overfitting. Messner et al. (2017) proposed a boosting algorithm for this purpose. In this approach components of the link function in (2) are chosen to be an affine function for the mean and an exponential transformation of an affine function for the standard deviation ,


Here, and denote coefficient vectors corresponding to the vector of predictors extended by a constant. As for the standard EMOS models, the coefficient vectors are estimated over fixed training periods and thus do not depend on , we suppress the index in the following.

The boosting algorithm proceeds iteratively by updating the coefficient of the predictor that improves the current model fit most. As the coefficient vectors are initialized as , only the most important variables will have non-zero coefficients if the algorithm is stopped before convergence. The contributions of the different predictors are assessed by computing average correlations to partial derivatives of the loss function with respect to and over the training set. If the current model fit is improved, the coefficient vectors are updated by a pre-defined step size into the direction of steepest descent of linear approximations of the gradients.

We denote local EMOS models with an additional boosting step by EMOS-loc-bst. The tuning parameters of the algorithm were chosen by fitting models for a variety of choices and picking the configuration with the best out-of-sample predictions (Table A2) based on implementations in the R package crch (Messner et al., 2016). Note, however, that the results are not very sensitive to the exact choice of tuning parameters. For the local model considered here, the station-specific features in the lower part of Table 1 are not relevant and are excluded from . Boosting-based variants of global EMOS models have also been tested, but result in worse forecasts.

The boosting-based EMOS-loc-bst model differs from the standard EMOS models (EMOS-gl and EMOS-loc) in several aspects. First, the boosting step allows to include covariate information from predictor variables other than temperature forecasts. Second, the parameters are estimated by maximum likelihood estimation, i.e., by minimizing the mean logarithmic score by contrast to minimum CRPS estimation, see Appendix B for details. Further, the affine link function for the standard deviation in (3) is replaced by an affine function for the logarithm of the standard deviation in (4).

3.3 Quantile regression forests

Parametric distributional regression models such as the EMOS methods described above require the choice of a suitable parametric family . While the conditional distribution of temperature can be well approximated by a Gaussian distribution, this poses a limitation for other weather variables such as wind speed or precipitation where the choice is less obvious (see, e.g., Baran and Lerch, 2018).

Non-parametric distributional regression approaches provide alternatives that circumvent the choice of parametric family. For example, quantile regression approaches approximate the conditional distribution by a set of quantiles. In the context of post-processing ensemble forecasts, Taillardat et al. (2016) proposed a quantile regression forest (QRF) model based on the work of Meinshausen (2006) that allows to include additional predictor variables.

The QRF model is based on the idea of generating random forests from classification and regression trees (Breiman et al., 1984)

. These are binary decision trees obtained by iteratively splitting the training data into two groups according to some threshold for one of the predictors, chosen such that every split minimizes the sum of the variance of the response variable in each of the resulting groups. The splitting procedure is iterated until a stopping criterion is reached. The final groups (or terminal leaves) thus contain subsets of the training observations based on the predictor values, and out of sample forecasts at station

and time can be obtained by proceeding through the decision tree according to the corresponding predictor values . Random forest models (Breiman, 2001)

increase stability of the predictions by averaging over many random decision trees generated by selecting a random subset of the predictors at each candidate split in conjunction with bagging, i.e., bootstrap aggregation of random subsamples of training sets. In the quantile regression forest approach, each tree provides an approximation of the distribution of the variable of interest given by the empirical cumulative distribution function (CDF) of the observation values in the terminal leaf associated with the current predictor values

. Quantile forecasts can then be computed from the combined forecast distribution which is obtained by averaging over all tree-based empirical CDFs.

We implement a local version of the QRF model where separate models are estimated for each station based on training sets that only contain past forecasts and observations from that specific station. As discussed by Taillardat et al. (2016), the predicted quantiles are necessarily restricted to the range of observed values in the training period by construction which may be disadvantageous in case of shorter training periods. However, global variants of the QRF model did not result in improved forecast performance even with only one year of training data, we will thus restrict attention to the local QRF model. The models are implemented using the quantregForest package (Meinshausen, 2017) for R. Tuning parameters are chosen as for the EMOS-loc-bst model (Table A2).

The QRF approach has recently been extended into several directions. Athey et al. (2016) propose a generalized version of random forest-based quantile regression based on theoretical considerations (GRF) which has been tested but did not result in improved forecast performance. Taillardat et al. (2017) combine QRF (and GRF) models and parametric distributional regression by fitting a parametric CDF to the observations in the terminal leaves instead of using the empirical CDF. Schlosser et al. (2018) combine parametric distributional regression and random forests for parameter estimation in the framework of a generalized additive model for location, scale and shape.

4 Neural networks

In this section we will give a brief introduction to neural networks. For a more detailed treatment the interested reader is referred to more comprehensive resources (e.g., Nielsen, 2015; Goodfellow et al., 2016). The network techniques are implemented using the Python

libraries Keras

(Chollet et al., 2015)

and TensorFlow

(Abadi et al., 2016).

Neural networks consist of several layers of nodes (Figure 2), each of which is a weighted sum of all nodes from the previous layer plus a bias term:


The first layer contains the input values, or features, while the last layer represents the output values, or targets. In the layers in-between, called hidden layers, each node value is passed through a nonlinear activation function. For this study, we use a rectified linear unit (ReLU):

This activation function allows the neural network to represent nonlinear functions. The weights and biases are optimized to reduce a loss function using stochastic gradient descent (SGD). Here we employ an SGD version called Adam

(Kingma and Ba, 2014).

In this study we use networks without a hidden layer and with a single hidden layer (Figure 2). The former, which we will call fully-connected networks (FCN), model the outputs as a linear combination of the inputs. The latter, called neural networks (NN) here, are capable of representing nonlinear relationships. Introducing additional hidden layers to neural networks did not improve the predictions as additional model complexity increases the potential of overfitting.

Figure 2: Schematic of a fully-connected (left) and a neural network with one hidden layer (right). In both cases, data flows from left to right. Orange nodes and connections illustrate station embeddings, blue ones auxiliary input variables. Mathematical operations are to be understood as element-wise operations for vector objects.

4.1 Neural networks for ensemble post-processing

Neural networks can be applied to a range of problems, such as regression and classification. The main difference between those are the contents and activation function of the output layer, and the loss function. Here we use the neural network for the distributional regression task of post-processing ensemble forecasts. Our output layer represents the distribution parameters and of the Gaussian predictive distribution. No activation function is applied. The corresponding probabilistic forecast describes the conditional distribution of the observation given the predictors as input features. As a loss function we use the closed form expression of the CRPS for a Gaussian distribution, see equation (B2). This is a non-standard choice in the neural network literature (D’Isanto and Polsterer (2018) is the only previous study to our knowledge) but provides a mathematically principled choice for the distributional regression problem at hand (see Appendix B for the mathematical background). Other probabilistic neural network approaches include quantile regression (Taylor, 2000) and distribution-to-distribution regression (Kou et al., 2018).

The simplest network model is a fully connected model based on predictors , i.e., mean and standard deviation of ensemble predictions of temperature only (denoted by FCN). Apart from additional connections for the mean and standard deviation to ensemble standard deviation and mean, respectively, the FCN model is conceptually equivalent to EMOS-gl, but differs in the parameter estimation approaches. A neural network with a hidden layer for the -input did not show any improvements over the simple linear model suggesting that there is no nonlinear relationships to exploit. Additional information from auxiliary variables can be taken into account by considering the entire vector of predictors as input features. The corresponding fully connected and neural network models are referred to as FCN-aux and NN-aux.

4.2 Station embeddings

To enable the networks to learn station-specific information we use embeddings, a common technique in natural language processing and recommender systems. An embedding is a mapping from a discrete object, in our case the station ID, to a vector of real numbers. The elements of this vector with length are also referred to as latent features. In total, therefore, the embedding matrix has dimension , where is the number of stations. The latent features are concatenated with the predictors, or , and are updated along with the weights and biases during training. This allows the algorithm to learn a specific set of numbers for each station. Here we use as larger values did not improve the predictions.

The fully connected network with input features and embeddings is abbreviated by FCN-emb. As with FCN, adding a hidden layer did not improve the results. Fully connected and neural networks with both, station embeddings and auxiliary inputs , are denoted by FCN-aux-emb and NN-aux-emb.

4.3 Further network details

Neural networks with a large number of parameters, i.e., weights and biases, can suffer from overfitting. One way to reduce overfitting is to stop training early. When to stop can be guessed by taking out a subset (20%) from the training set (2007–2015 or 2015) and checking when the score on this separate dataset stops improving. This gives a good approximation when to stop training on the full training set without using the actual 2016 validation set during training. Other common regularization techniques to prevent overfitting, such as dropout or weight decay (L2 regularization), were not successful in our case.

Finally, we train ensembles of ten neural networks with different random initial parameters for each configuration and average over the forecast distribution parameter estimates to obtain . For the more complex network models this helps to stabilize the parameter estimates by reducing the variability due to random variations between model runs and slightly improves the forecasts.

5 Results

Tuning parameters for all benchmark and network models are listed in Tables A2 and A2. Details on the employed evaluation methods are provided in Appendix B.

5.1 General results

The CRPS values averaged over all stations and the entire 2016 validation period are summarized in Table 2. For the 2015 training period, EMOS-gl gives a 13% relative improvement compared to the raw ECMWF ensemble forecasts in terms of mean CRPS.444Note that standard EMOS models are often implemented with rolling training periods, i.e., parameter estimation is based on the preceding days. Such implementations have been tested, but only result in marginal improvements. For the longer training period from 2007–2015, we have also implemented EMOS models that are estimated based on a centered window around the current day from all previous years. For the local model, this results in a slightly improved mean CRPS of 0.88, but does not improve the forecasts of EMOS-gl. As expected, FCN which mimics the design of EMOS-gl achieves a very similar score. Adding local station information in EMOS-loc and FCN-emb improves the global score by another 10%. While EMOS-loc estimates a separate model for each station, FCN-emb can be seen as a global network-based implementation of EMOS-loc. Adding covariate information through auxiliary variables results in an improvement for the fully connected models similar to that of adding station information. Combining auxiliary variables and station embeddings in FCN-emb-aux improves the mean CRPS further to 0.88 but the effects do not stack linearly. Adding covariate information in EMOS models using boosting (EMOS-loc-bst) outperforms FCN-emb-aux by 3%. Allowing for nonlinear interactions of station information and auxiliary variables using a neural network (NN-aux-emb) achieves the best results, improving the best benchmark technique (EMOS-loc-bst) by 3% for a total improvement compared to the raw ensemble of 29%. The QRF model is unable to compete with most of the post-processing models for the 2015 training period.

The relative scores and model rankings for the 2007–2015 training period closely match those of the 2015 period. For the linear models (EMOS-gl, EMOS-loc and all FCN) more data does not improve the score by much. For EMOS-loc-bst and the neural network models, however, the skill is increased by 4–5%. This suggests that longer training periods are most efficiently exploited by more complex, nonlinear models. QRF improves the most, now being among the best models, which indicates a minimum data amount required for this method to work. This is likely due to the limitation of predicted quantiles to the range of observed values in the training data, see Section 3.3.

Model Description Mean CRPS for
training period
2015 2007–2015
Raw ensemble 1.16 1.16
Benchmark post-processing methods
EMOS-gl Global EMOS 1.01 1.00
EMOS-loc Local EMOS 0.90 0.90
EMOS-loc-bst Local EMOS with boosting 0.85 0.80
QRF Local quantile regression forest 0.95 0.81
Neural network models
FCN Fully connected network 1.01 1.01
FCN-aux …with auxiliary predictors 0.92 0.91
FCN-emb …with station embeddings 0.91 0.91
FCN-aux-emb …with both of the above 0.88 0.87
NN-aux 1-hidden layer neural network with auxiliary predictors 0.90 0.86
NN-aux-emb …and station embeddings 0.82 0.78
Table 2: Mean CRPS for raw and post-processed ECMWF ensemble forecasts, averaged over all available observations in calendar year 2016. The lowest (i.e., best) values are marked in bold font.

To assess calibration, verification rank and probability integral transform (PIT) histograms of raw and post-processed forecasts are shown in Figures C1 and C2. The raw ensemble forecasts are under-dispersed as indicated by the U-shaped verification rank histogram, i.e., observations tend to fall outside the range of the ensemble too frequently. By contrast, all post-processed forecast distributions are substantially better calibrated and the corresponding PIT histograms show much smaller deviations from uniformity. All models show a slight overprediction of high temperatures and, with the exception of QRF, an underprediction of low values. The linear EMOS and FCN models as well as QRF are further slightly overdispersive as indicated by inverse U-shaped upper parts of the histogram.

5.2 Station-by-station results

CRPSS relative to raw ensemble forecasts

CRPSS relative to EMOS-loc

Figure 3: Boxplots of station-wise mean continuous ranked probability skill score of all post-processing models using the raw ensemble (top row) and EMOS-loc (bottom row) as reference forecast. A dot within each box represents the mean CRPSS at one of the observation stations. The CRPSS is computed so that positive values indicate an improvement of the model specified on the horizontal axis over the reference. Similar plots with different reference models are provided in Appendix C.2.

Figure 3

shows the station-wise distribution of the continuous ranked probability skill score (CRPSS), which measures the probabilistic skill relative to a reference model. Positive values indicate an improvement over the reference. Compared to the raw ensemble, forecasts at most stations are improved by all post-processing methods with only a few negative outliers. Compared to EMOS-loc only FCN-aux-emb, the neural network models and EMOS-loc-bst show improvements at the majority of stations. Corresponding plots with the three best-performing models as reference experiments are provided in Figure

C3. It is interesting to note that the network models, with the exception of FCN and FCN-emb, have more outliers, particularly for negative values compared to the EMOS methods and QRF which have very few negative outliers. This might be due to a few stations with strongly location-specific error characteristics that the locally estimated benchmark models are better able capture. Training with data from 2007 to 2015 alleviates this somewhat.

Figure 4 shows maps with the best-performing models in terms of mean CRPS for each station. For the majority of stations NN-aux-emb provides the best predictions. The variability of station-specific best models is greater for the 2015 training period compared to 2007–2015. The top three models for the 2015 period are NN-aux-emb (best at 65.9% of stations), EMOS-loc-bst (16.0%) and NN-aux (7.2%), and for 2007–2015 NN-aux-emb (73.5%), EMOS-loc-bst (12.4%) and QRF (7.4%). At coastal and offshore locations, particularly for the shorter training period, the benchmark methods tend to outperform the network methods. Ensemble forecast errors at these location likely have a strong location-specific component that might be easier to capture for the locally estimated EMOS and QRF methods.

Additionally, we evaluated the statistical significance of the differences between the competing post-processing methods using a combination of Diebold-Mariano tests (Diebold and Mariano, 1995) and a Benjamini and Hochberg (1995) procedure to account for temporal and spatial dependencies of forecast errors. We thereby follow suggestions of Wilks (2016), mathematical details are deferred to Appendix B.3. The results provided in Appendix C.4 generally indicate high ratios of stations with significant score differences in favor of the neural network models. Even when compared to the second best-performing model, EMOS-loc-bst, NN-aux-emb is significantly better at 30% of the stations and worse at only 2% or less for both training periods.

Figure 4: Observation station locations color-coded by the best performing model (in terms of mean CRPS over calendar year 2016) for models trained on data from 2015 (left), and on data from 2007–2015 (right). Point shapes indicate the type of model.

5.3 Computational aspects

While a direct comparison of computation times for the different methods is difficult, even the most complex network methods are a factor of two or more faster than EMOS-loc-bst. This includes creating an ensemble of ten different model realizations. QRF is by far the slowest method, being roughly ten times slower than EMOS-loc-bst. Complex neural networks benefit substantially from running on a graphics processing unit (GPU) compared to running on the core processing unit (CPU; roughly six times slower for NN-aux-emb). Neural network-ready GPUs are now widely available in many scientific computing environments or via cloud computing555For example, see For more details on the computational methods and results see Appendix C.3.

6 Feature importance

To assess the relative importance of all features we use a technique called permutation importance that was first described in the context of random forests (Breiman, 2001). We randomly shuffle each predictor/feature in the validation set one at a time and observe the increase in mean CRPS compared to the unpermuted features. While unable to capture colinearities between features this method does not require re-estimating the model with each individual feature omitted.

Consider a random permutation of station and time indices and let denote the vector of predictors where variable is permuted according to , i.e., a vector with -th entry

The importance of input feature is computed as mean CRPS difference

where we average over the entire evaluation set and denotes the conditional forecast distribution given a vector of predictors.

Figure 5: Feature importance for the 15 most important predictors. Note that the values for t2m_mean are divided by ten. See Table 1 for variable abbreviations and descriptions.

We picked three network setups to investigate how feature importance changes by adding station embeddings and a nonlinear layer (Figure 5). For the linear model without station embeddings (FCN-aux) station altitude and orography, the altitude of the model grid cell, are the most important predictors after the mean temperature forecast. This makes sense since our interpolation from the forecast model grid to the station does not adjust for the height of the surface station. The only other features with significant importance are the mean shortwave radiation flux and 850 hPa specific humidity. Adding station embeddings (FCN-aux-emb) reduces the significance of the station altitude information which now seems to be encoded in the latent embedding features. The nonlinearity added by the hidden layer in NN-aux-emb increases the sensitivity to permuting input features overall and distributes the feature importance more evenly. In particular, we note an increase in the importance of the station altitude and orography but also the sensible and latent heat flux and total cloud cover.

The most important features, apart from the obvious mean forecast temperature and station altitude, seem to be indicative of insolation, either directly like the shortwave flux or indirectly like the 850 hPa humidity. It is interesting that the latter seems to be picked by the algorithms as a proxy for cloud cover rather than the direct cloud cover feature. Curiously, the temperature standard deviation is not an important feature for the post-processing models. We suspect that this is a consequence of the low correlation between the raw ensemble standard deviation and the forecast error ( on the test set) and the general under-dispersion (mean spread-error ratio of 0.51). The post-processing algorithms almost double the spread to achieve a spread-error ratio of 0.95. The correlation of the raw and post-processed ensemble spreads is 0.39 suggesting that the post-processing is mostly an additive correction to the ensemble spread.

Note that this method of assessing feature importance is in principle possible for boosting- and QRF-based models. However, for the local implementations of the algorithm the importance changes from station to station, making interpretation more difficult.

7 Discussion

Here we discuss some approaches we attempted that failed to improve our results, as well as directions for future research.

Having to describe the distribution of the target variable in parametric techniques is a non-trivial task. For temperature, a Gaussian distribution is a good approximation but for other variables, such as wind speed or precipitation, finding a distribution that fits the data is a substantial challenge (e.g., Taillardat et al., 2016; Baran and Lerch, 2018)

. Ideally, a machine learning algorithm would learn to predict the full probability distribution rather than distribution parameters only. One way to achieve this is to approximate the forecast distribution by a combination of uniform distributions and predicting the probability of the temperature being within pre-specified bins. Initial experiments indicate that the neural network is able to produce a good approximation of a Gaussian distribution but the skill was comparable only to the raw ensemble. This suggests that for target variables which are well approximated by a parametric distribution, utilizing these distributions is advantageous. One direction for future research is to apply this approach to more complex variables.

Standard EMOS models are often estimated based on so-called “rolling” training windows with data from previous days only in order to incorporate temporal dependencies of ensemble forecast errors. For neural networks, one way to incorporate temporal dependencies is to use convolutional or recurrent neural networks

(Schmidhuber, 2015) which can process sequences as an input. In our tests, this leads to more overfitting without an improvement to the validation score. For other datasets, however, we believe that these approaches are worth revisiting.

One popular way to combat overfitting in machine learning algorithms is data augmentation. In the example of image recognition models, the training images are randomly rotated, flipped, zoomed, etc. to artificially increase the sample size (e.g., Krizhevsky et al., 2012). We tried a similar approach by adding random noise of a reasonable scale to the input features, but found no improvement in the validation score. A potential alternative to adding random noise might be augmenting the forecasts for a station with data from neighboring stations or grid points.

Similarly to rolling training windows for the traditional EMOS models, we tried updating the neural network each day during the validation period with the data from the previous time step, but found no improvements. This supports our observation that rolling training windows only bring marginal improvements for the benchmark EMOS models. Such an online learning approach could be more relevant in an operational setting, however, where model versions might change frequently or it is too expensive to re-estimate the entire post-processing model every time new data becomes available.

We have restricted the set of predictors to observation station characteristics and summary statistics (mean and standard deviation) of ensemble predictions of several weather variables. Recently, flexible distribution-to-distribution regression network models have been proposed in the machine learning literature (e.g., Oliva et al., 2013; Kou et al., 2018). Adaptations of such approaches might enable the use of the entire ensemble forecast of each predictor variable as input feature. However, training of these substantially more complex models likely requires longer training periods than were possible in our study.

Another possible extension would be to post-process forecasts on the entire two-dimensional grid, rather than individual stations locations, for example by using convolutional neural networks. This adds computational complexity and probably requires more training data but could provide information about the large-scale weather patterns and help to produce spatially consistent predictions.

We have considered probabilistic forecasts of a single weather variable at a single location and look-ahead time only. However, many applications require accurate models of cross-variable, spatial and temporal dependence structures, and much recent work has been focused on multivariate post-processing methods (e.g., Schefzik et al., 2013). Extending the neural network based approaches to multivariate forecast distributions accounting for such dependencies presents a promising starting point for future research.

8 Conclusion

In this study we demonstrated how neural networks can be used for distributional regression post-processing of ensemble weather forecasts. Our neural network models significantly outperform state-of-the-art post-processing techniques while being computationally more efficient. The main advantages of using neural networks are the capability of capturing nonlinear relations between arbitrary predictors and distribution parameters without having to specify appropriate link functions, and the ease of adding station information in a global model by using embeddings. The network model parameters are estimated by optimizing the CRPS, a non-standard choice in the machine learning literature tailored to probabilistic forecasting. Furthermore, the rapid pace of development in the deep learning community provides flexible and efficient modeling techniques and software libraries. The presented approach can therefore be easily applied to other problems.

The building blocks of our network model architecture provide general insight into the relative importance of model properties for post-processing ensemble forecasts. Specifically, the results indicate that encoding local information is very important for providing skillful probabilistic temperature forecasts. Further, including covariate information via auxiliary variables improves the results considerably, particularly when allowing for nonlinear relations of predictors and forecast distribution parameters. Ideally, any post-processing model should thus strive to incorporate all of these aspects.

We also showed that a trained machine learning model can be used to gain meteorological insight. In our case, it allowed us to identify the variables most important for correcting systematic temperature forecast errors of the ensemble. In this context, neural networks are somewhat interpretable and give us more information than we originally asked for. While a direct interpretation of the individual parameters of the model is intractable, this challenges the common notion of neural networks as pure black boxes.

Because of their flexibility neural networks are ideally suited to handle the increasing amounts of model and observation data as well as the diverse requirements for correcting multifaceted aspects of systematic ensemble forecast errors. We anticipate, therefore, that they will provide a valuable addition to the modeler’s toolkit for many areas of statistical post-processing and forecasting.


The research leading to these results has been done within the subprojects A6 “Representing forecast uncertainty using stochastic physical parameterizations” and C7 “Statistical postprocessing and stochastic physics for ensemble predictions” of the Transregional Collaborative Research Center SFB/TRR 165 “Waves to Weather” funded by the German Research Foundation (DFG). Sebastian Lerch is grateful for infrastructural support by the Klaus Tschira Foundation. The authors thank Tilmann Gneiting, Alexander Jordan, and Maxime Taillardat for helpful discussions and for providing code. The initial impetus for this work stems from a meeting with Kai Polsterer who presented a probabilistic neural network based approach to astrophysical image data analysis.


  • Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y. and Zheng, X. (2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265–283.
  • Angermueller et al. (2016) Angermueller, C., Pärnamaa, T., Parts, L. and Stegle, O. (2016). Deep learning for computational biology. Molecular Systems Biology, 12, 878.
  • Athey et al. (2016) Athey, S., Tibshirani, J. and Wager, S. (2016). Generalized random forests. Preprint, available at
  • Baran and Lerch (2015) Baran, S. and Lerch, S. (2015). Log-normal distribution based Ensemble Model Output Statistics models for probabilistic wind-speed forecasting. Quarterly Journal of the Royal Meteorological Society, 141, 2289–2299.
  • Baran and Lerch (2016) Baran, S. and Lerch, S. (2016). Mixture EMOS model for calibrating ensemble forecasts of wind speed. Environmetrics, 27, 116–130.
  • Baran and Lerch (2018) Baran, S. and Lerch, S. (2018). Combining predictive distributions for the statistical post-processing of ensemble forecasts. International Journal of Forecasting, 34, 477–496.
  • Bauer et al. (2015) Bauer, P., Thorpe, A. and Brunet, G. (2015). The quiet revolution of numerical weather prediction. Nature, 525, 47–55.
  • Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.
  • Bougeault et al. (2010) Bougeault, P., Toth, Z., Bishop, C., Brown, B., Burridge, D., Chen, D. H., Ebert, B., Fuentes, M., Hamill, T. M., Mylne, K., Nicolau, J., Paccagnella, T., Park, Y.-Y., Parsons, D., Raoult, B., Schuster, D., Dias, P. S., Swinbank, R., Takeuchi, Y., Tennant, W., Wilson, L. and Worley, S. (2010). The THORPEX Interactive Grand Global Ensemble. Bulletin of the American Meteorological Society, 91, 1059–1072.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
  • Breiman et al. (1984) Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group, Belmont.
  • Chollet et al. (2015) Chollet, F. et al. (2015). Keras.
  • Chu et al. (2013) Chu, Y., Pedro, H. T. C. and Coimbra, C. F. M. (2013). Hybrid intra-hour DNI forecasts with sky image processing enhanced by stochastic learning. Solar Energy, 98, 592–603.
  • Diebold and Mariano (1995) Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business and Economic Statistics, 13, 253–263.
  • D’Isanto and Polsterer (2018) D’Isanto, A. and Polsterer, K. L. (2018). Photometric redshift estimation via deep learning-generalized and pre-classification-less, image based, fully probabilistic redshifts. Astronomy & Astrophysics, 609, A111.
  • Gagne et al. (2017) Gagne, D. J., McGovern, A., Haupt, S. E., Sobash, R. A., Williams, J. K. and Xue, M. (2017). Storm-based probabilistic hail forecasting with machine learning applied to convection-allowing ensembles. Weather and Forecasting, 32, 1819–1840.
  • Gagne et al. (2014) Gagne, D. J., McGovern, A. and Xue, M. (2014). Machine learning enhancement of storm-scale ensemble probabilistic quantitative precipitation forecasts. Weather and Forecasting, 29, 1024–1043.
  • Gentine et al. (2018) Gentine, P., Pritchard, M. S., Rasp, S., Reinaudi, G. and Yacalis, G. (2018). Could machine learning break the convection parameterization deadlock? Geophysical Research Letters. In review.
  • Gneiting et al. (2007) Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B, 69, 243–268.
  • Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.
  • Gneiting et al. (2005) Gneiting, T., Raftery, A. E., Westveld, A. H. and Goldman, T. (2005). Calibrated Probabilistic Forecasting Using Ensemble Model Output Statistics and Minimum CRPS Estimation. Monthly Weather Review, 133, 1098–1118.
  • Goh et al. (2017) Goh, G. B., Hodas, N. O. and Vishnu, A. (2017). Deep learning for computational chemistry. Journal of Computational Chemistry, 38, 1291–1307.
  • Good (1952) Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society Series B, 14, 107–114.
  • Goodfellow et al. (2016) Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press.
  • Hamill (2001) Hamill, T. M. (2001). Interpretation of rank histograms for verifying ensemble forecasts. Monthly Weather Review, 129, 550–560.
  • Hemri et al. (2014) Hemri, S., Scheuerer, M., Pappenberger, F., Bogner, K. and Haiden, T. (2014). Trends in the predictive performance of raw ensemble weather forecasts. Geophysical Research Letters, 41, 9197–9205.
  • Jordan et al. (2017) Jordan, A., Krüger, F. and Lerch, S. (2017). Evaluating probabilistic forecasts with scoringRules. Preprint, available at
  • Junk et al. (2015) Junk, C., Delle Monache, L. and Alessandrini, S. (2015). Analog-based ensemble model output statistics. Monthly Weather Review, 143, 2909–2917.
  • Kahle and Wickham (2013) Kahle, D. and Wickham, H. (2013). ggmap: Spatial visualization with ggplot2. The R Journal, 5, 144–161.
  • Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. Preprint, available at
  • Kou et al. (2018) Kou, C., Lee, H. K. and Ng, T. K. (2018). Distribution regression network. Preprint, available at
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger, eds.). Curran Associates, Inc., 1097–1105.
  • Lagerquist et al. (2017) Lagerquist, R., McGovern, A. and Smith, T. (2017). Machine learning for real-time prediction of damaging straight-line convective wind. Weather and Forecasting, 32, 2175–2193.
  • LeCun et al. (2015) LeCun, Y., Bengio, Y. and Hinton, G. (2015). Deep learning. Nature, 521, 436–444.
  • Lerch and Baran (2017) Lerch, S. and Baran, S. (2017). Similarity-based semilocal estimation of post-processing models. Journal of the Royal Statistical Society Series C, 66, 29–51.
  • Lerch and Thorarinsdottir (2013) Lerch, S. and Thorarinsdottir, T. L. (2013). Comparison of non-homogeneous regression models for probabilistic wind speed forecasting. Tellus A, 65, 21206.
  • Liu et al. (2016) Liu, Y., Racah, E., Correa, J., Khosrowshahi, A., Lavers, D., Kunkel, K., Wehner, M. and Collins, W. (2016). Application of deep convolutional neural networks for detecting extreme weather in climate datasets. Preprint, available at
  • Matheson and Winkler (1976) Matheson, J. E. and Winkler, R. L. (1976). Scoring rules for continuous probability distributions. Management Science, 22, 1087–1096.
  • McGovern et al. (2017) McGovern, A., Elmore, K. L., Gagne, D. J., Haupt, S. E., Karstens, C. D., Lagerquist, R., Smith, T. and Williams, J. K. (2017).

    Using artificial intelligence to improve real-time decision-making for high-impact weather.

    Bulletin of the American Meteorological Society, 98, 2073–2090.
  • Meinshausen (2006) Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7, 983–999.
  • Meinshausen (2017) Meinshausen, N. (2017). quantregForest: Quantile Regression Forests. R package version 1.3-7, URL
  • Messner et al. (2014) Messner, J. W., Mayr, G. J., Wilks, D. S. and Zeileis, A. (2014).

    Extending extended logistic regression: Extended versus separate versus ordered versus censored.

    Monthly Weather Review, 142, 3003–3014.
  • Messner et al. (2016) Messner, J. W., Mayr, G. J. and Zeileis, A. (2016). Heteroscedastic censored and truncated regression with crch. The R Journal, 8, 173–181.
  • Messner et al. (2017) Messner, J. W., Mayr, G. J. and Zeileis, A. (2017). Nonhomogeneous boosting for predictor selection in ensemble postprocessing. Monthly Weather Review, 145, 137–147.
  • Nielsen (2015) Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press.
  • Oliva et al. (2013) Oliva, J., Póczos, B. and Schneider, J. (2013). Distribution to distribution regression. In ICML’13: Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (S. Dasgupta and D. McAllester, eds.)., 1049–1057.
  • Python Software Foundation (2017) Python Software Foundation (2017). Python Software, Version 3.6.4. Beaverton, OR. URL
  • R Core Team (2017) R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL
  • Raftery et al. (2005) Raftery, A. E., Gneiting, T., Balabdaoui, F. and Polakowski, M. (2005). Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review, 133, 1155–1174.
  • Schefzik et al. (2013) Schefzik, R., Thorarinsdottir, T. L. and Gneiting, T. (2013). Uncertainty quantification in complex simulation models using ensemble copula coupling. Statistical Science, 28, 616–640.
  • Scheuerer (2014) Scheuerer, M. (2014). Probabilistic quantitative precipitation forecasting using ensemble model output statistics. Quarterly Journal of the Royal Meteorological Society, 140, 1086–1096.
  • Scheuerer and Hamill (2015) Scheuerer, M. and Hamill, T. M. (2015).

    Statistical postprocessing of ensemble precipitation forecasts by fitting censored, shifted gamma distributions.

    Monthly Weather Review, 143, 4578–4596.
  • Scheuerer and Möller (2015) Scheuerer, M. and Möller, D. (2015). Probabilistic wind speed forecasting on a grid based on ensemble model output statistics. Annals of Applied Statistics, 9, 1328–1349.
  • Schlosser et al. (2018) Schlosser, L., Hothorn, T., Stauffer, R. and Zeileis, A. (2018). Distributional regression forests for probabilistic precipitation forecasting in complex terrain. Preprint, available at
  • Schmidhuber (2015) Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.
  • Taillardat et al. (2017) Taillardat, M., Fougères, A.-L., Naveau, P. and Mestre, O. (2017). Forest-based methods and ensemble model output statistics for rainfall ensemble forecasting. Preprint, available at
  • Taillardat et al. (2016) Taillardat, M., Mestre, O., Zamo, M. and Naveau, P. (2016). Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Monthly Weather Review, 144, 2375–2393.
  • Taylor (2000) Taylor, J. W. (2000). A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19, 299–311.
  • Thorarinsdottir and Gneiting (2010) Thorarinsdottir, T. L. and Gneiting, T. (2010). Probabilistic forecasts of wind speed: Ensemble model output statistics by using heteroscedastic censored regression. Journal of the Royal Statistical Society Series A, 173, 371–388.
  • Wang et al. (2012) Wang, F., Mi, Z., Su, S. and Zhao, H. (2012). Short-term solar irradiance forecasting model based on artificial neural network using statistical feature parameters. Energies, 5, 1355–1370.
  • Wilks (2011) Wilks, D. S. (2011). Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier Academic Press.
  • Wilks (2016) Wilks, D. S. (2016). “The stippling shows statistically significant grid points”: How research results are routinely overstated and overinterpreted, and what to do about it. Bulletin of the American Meteorological Society, 97, 2263–2273.

Appendix A Hyperparameters for benchmark and network models

Model Parameter Value
EMOS-gl none
EMOS-loc none
EMOS-loc-bst maximum number of iterations 1 000
step size 0.05
stopping criterion for boosting algorithm AIC
QRF number of trees 1 000
minimum size of terminal leaves 10
number of variables randomly sampled as 25
 candidates at each split
Table A2: Hyperparameters for network models. Values in parentheses indicate settings for the longer training period from 2007–2015. Parameters refers to all learnable values: weights, biases and latent embedding features. An epoch refers to one pass through all training samples. Batch size refers to the number of random training samples considered per gradient update in the SGD optimization.
Model Number of Epochs Learning rate Batch size Hidden Embedding
parameters nodes size
FCN 6 30 (15) 0.1 (0.1) 4 096 (4 096)
FCN-aux 82 30 (10) 0.02 (0.02)  1024 (1 024)
FCN-emb 1 084 30 (10) 0.02 (0.02) 1 024 (1 024) 2 (2)
FCN-aux-emb 1 160 30 (10) 0.02 (0.02) 1 024 (1 024) 2 (2)
NN-aux 3 326 (10) (0.02) (1 024) (64) (2)
NN-aux-emb 24 116 30 (10) 0.01 (0.002) 1 024 (4 096) 50 (512) 2 (2)
Table A1: Hyperparameters for benchmark models. AIC denotes the Akaike information criterion.

Appendix B Forecast evaluation

For the purpose of the present Section, denote a generic probabilistic forecast for 2-meter temperature at station and time by . Note that

may be a parametric forecast distribution represented by CDF or probability density function (PDF), an ensemble forecast

or a set of quantiles. We may choose to suppress the index at times for ease of notation.

b.1 Calibration and sharpness

As argued by Gneiting et al. (2007), probabilistic forecasts should generally aim to maximize sharpness subject to calibration. In a nutshell, a forecast is called calibrated if the realizing observation cannot be distinguished from a random draw from the forecast distribution. Calibration thus refers to the statistical consistency between forecast distribution and observation. By contrast, sharpness is a property of the forecast only and refers to the concentration of the predictive distribution. The calibration of ensemble forecasts can be assessed via verification rank (VR) histograms summarizing the distribution of ranks of the observation when it is pooled with the ensemble forecast (Hamill, 2001; Gneiting et al., 2007; Wilks, 2011). For continuous forecast distributions histograms of the PIT provide analogs of verification rank histograms. Calibrated forecasts result in uniform VR and PIT histograms, and deviations from uniformity indicate specific systematic errors such as biases or an under-representation of the forecast uncertainty.

b.2 Proper scoring rules

For comparative model assessment, proper scoring rules allow simultaneous evaluation of calibration and sharpness (Gneiting and Raftery, 2007). A scoring rule assigns a numerical score to a pair of probabilistic forecast and corresponding realizing observation , and is called proper relative to a class of forecast distributions if

i.e., if the expected score is optimized if the true distribution of the observation is issued as forecast. Here, scoring rules are considered to be negatively oriented with smaller scores indicating better forecasts

Popular examples of proper scoring rule include the logarithmic score (LogS; Good, 1952)

where denotes the observation and denotes the PDF of the forecast distribution, and the continuous ranked probability score (CRPS; Matheson and Winkler, 1976)



denotes the CDF of the forecast distribution with finite first moment and

is an indicator function that is 1 if and 0 otherwise. The integral in (B1) can be computed analytically for ensemble forecasts and a variety of continuous forecast distributions (see, e.g., Jordan et al., 2017). Specifically, the CRPS of a Gaussian distribution with mean value and standard deviation can be computed as


where and denote CDF and PDF of a standard Gaussian distribution, respectively (Gneiting et al., 2005).

Apart from forecast evaluation, proper scoring rules can also be used for parameter estimation. Following the generic optimum score estimation framework of Gneiting and Raftery (2007, Section 9.1), the parameters of a forecast distribution are determined by optimizing the value of a proper scoring rule, on average over a training sample. Optimum score estimation based on the LogS then corresponds to classical maximum likelihood estimation, whereas optimum score estimation based on the CRPS is often employed as a more robust alternative in meteorological applications. Analytical closed-form solutions of the CRPS, for example for a Gaussian distribution in (B2), allow for computing analytical gradient functions that can be leveraged in numerical optimization, see Jordan et al. (2017) for details.

In practical applications, scoring rules are usually computed as averages over stations and/or time periods. To assess the relative improvement over a reference forecast , we further introduce the continuous ranked probability skill score

which is positively oriented and can be interpreted as relative improvement over the reference. The CRPSS is usually computed as skill score of CRPS averages.

b.3 Statistical tests of equal predictive performance

Formal statistical tests of equal forecast performance for assessing statistical significance of score differences have been widely used in the economic literature. Consider two forecasts and , with corresponding mean scores for over a test , where we assume that the forecast was issued time steps before the observation was recorded. Diebold and Mariano (1995)

propose the test statistic

where is an estimator of the asymptotic standard deviation of the score difference of and . Under standard regularity conditions,

asymptotically follows a standard normal distribution under the null hypothesis of equal predictive performance of

and . Thereby, negative values of indicate superior predictive performance of , whereas positive values indicate superior performance of . To account for temporal dependencies of score differences, we use the square root of the sample auto-covariance up to lag as estimator following Diebold and Mariano (1995). We employ Diebold-Mariano tests on an observation station-level, i.e., the mean CRPS values are determined by averaging over all scores at the specific station of interest,

where denotes days in the evaluation period.

Compared to previous uses of Diebold-Mariano tests in post-processing applications (e.g., Baran and Lerch, 2016) we further account for spatial dependencies of score differences at the different stations. Following suggestions of Wilks (2016) we apply a Benjamini and Hochberg (1995) procedure to control the false discovery rate at level . In a nutshell, the algorithm requires a higher standard in order to reject a local null hypothesis of equal predictive performance by selecting a threshold -value based on the set of ordered local -values . Particularly, is the largest that is not larger than , where is the number of tests, i.e., the number of stations in the evaluation set.

Appendix C Additional results

c.1 Calibration assessment

Figure C1: Verification rank and PIT histograms for raw and post-processed ensemble forecasts based on models estimated using data from 2015, aggregated over all forecast cases during the evaluation period in calendar year 2016.
Figure C2: Verification rank and PIT histograms for raw and post-processed ensemble forecasts based on models estimated using data from 2007–2015, aggregated over all forecast cases during the evaluation period in calendar year 2016..

c.2 CRPSS results for alternative benchmark models

CRPSS relative to EMOS-loc-boost

CRPSS relative to QRF

CRPSS relative to NN-aux-emb

Figure C3: As Figure 3, but with different reference models.

c.3 Details on computational aspects

Table C1 shows computation times required for training the different post-processing models for both training sets. As noted before, the computation times are not directly comparable due to implementations in different programming languages and hardware environments. The computation times for the benchmark models, implemented in R using the crch (Messner et al., 2016), quantregForest (Meinshausen, 2017) and scoringRules (Jordan et al., 2017) packages, were obtained on a standard laptop computer, whereas the network models were implemented with the Python libraries Keras (Chollet et al., 2015) and TensorFlow (Abadi et al., 2016), and run on a single GPU (Nvidia Tesla K20). Computation times on a regular CPU are roughly 6 times longer for the most complex networks. For the simple networks the difference is negligible. Note that the inference time, i.e., the time to make a prediction after the model has been trained, is on the order of a few seconds for all models. Further, note that all computation times reported here are substantially lower compared to the computational costs of generating the raw ensemble forecast.

Model Computation time (min)
with training data from
2015 2007–2015
Benchmark models
EMOS-loc 1
EMOS-loc-bst 14 48
QRF 8 430
Network models
FCN-aux 2
FCN-emb 3
FCN-aux-emb 3
NN-aux 4 25
NN-aux-emb 9 16
Table C1: Computation times (in minutes) for estimating post-processing models with the two training sets and computing out-of-sample forecasts for the evaluation period.

c.4 Statistical significance of score differences

Pair-wise one-sided Diebold-Mariano tests are applied to all possible comparisons of forecast models at each of the 499 stations individually. To account for multiple hypothesis testing and spatial correlations of score differences, we apply a Benjamini-Hochberg procedure to the corresponding -values when aggregating the results by determining the ratio of stations with significant score differences, see Appendix B.3 for details.

Table C2 summarizes pair-wise Diebold-Mariano tests by showing the ratio of stations with statistically significant CRPS differences after applying a Benjamini-Hochberg procedure for a nominal level of . Generally, the results indicate large numbers of stations with significant differences of the network models when compared to standard EMOS approaches. NN-aux-emb shows the highest ratios of significant score differences over any competitor, and is significantly outperformed at very few station and only by the best-performing alternatives.

Training with 2015 data

-gl -loc -loc-bst -aux -emb -aux-emb -aux -aux-emb
Ens. 00.6 00.0 00.0 00.6 00.6 00.8 00.0 00.8 00.8 00.6
EMOS-gl 83.2 00.2 00.0 10.4 10.2 03.0 00.2 00.6 02.0 00.2
EMOS-loc 96.2 71.3 00.0 50.5 71.9 17.4 24.8 05.2 09.6 01.4
EMOS-loc-bst 93.8 72.7 40.5 89.8 74.3 41.7 49.1 21.0 30.5 02.0
QRF 54.7 22.0 03.6 00.0 22.4 08.0 03.6 03.4 05.2 00.2
FCN 83.0 07.4 00.2 00.0 10.4 03.0 00.2 00.6 02.0 00.2
FCN-aux 83.2 60.3 17.2 01.8 47.5 62.3 19.0 01.0 00.4 00.2
FCN-emb 89.4 67.1 01.0 00.0 44.1 68.1 11.4 00.8 06.4 00.6
FCN-aux-emb 86.6 78.8 53.1 07.6 69.1 79.6 55.1 58.5 27.1 00.2
NN-aux 87.2 69.5 25.9 02.0 57.5 70.7 22.8 30.9 08.0 00.4
NN-aux-emb 93.6 89.4 67.1 30.3 92.2 90.2 67.3 72.7 43.5 64.9

Training with 2007-2015 data

-gl -loc -loc-bst -aux -emb -aux-emb -aux -aux-emb
Ens. 00.6 00.0 00.0 00.0 00.6 00.8 00.0 00.8 00.8 00.0
EMOS-gl 86.8 00.2 00.0 00.2 02.6 03.0 00.2 00.6 00.2 00.0
EMOS-loc 98.8 72.7 00.0 00.2 71.7 17.2 17.4 03.6 06.8 00.6
EMOS-loc-bst 99.4 98.0 91.4 21.0 97.8 82.0 94.2 70.3 49.7 01.4
QRF 98.6 94.2 79.2 01.4 94.2 57.7 84.4 38.1 33.5 01.2
FCN 87.8 11.0 00.2 00.0 00.2 03.2 00.2 00.6 00.2 00.0
FCN-aux 87.6 65.5 24.2 00.0 00.4 65.5 26.7 00.8 01.4 00.0
FCN-emb 93.4 71.3 00.0 00.0 0.2 70.5 12.0 01.2 04.6 00.0
FCN-aux-emb 91.2 82.8 60.3 00.0 00.6 81.8 58.1 64.1 16.4 00.0
NN-aux 95.6 84.8 54.5 01.4 9.8 84.8 72.9 58.5 34.5 00.0
NN-aux-emb 98.8 97.8 95.2 29.9 52.9 97.6 92.0 96.0 91.0 74.5
Table C2: Ratio of stations (in %) where pair-wise Diebold-Mariano tests indicate statistically significant CRPS differences after applying a Benjamini-Hochberg procedure to account for multiple testing for a nominal level of of the corresponding one-sided tests. The -entry in the -th row and -th column indicates the ratio of stations where the null hypothesis of equal predictive performance of the corresponding one-sided Diebold-Mariano test is rejected in favor of the model in the -th row when compared to the model in the -th column. The remainder of the sum of - and -entry to 100% is the ratio of stations where the score differences are not significant.