Deep Learning for Post-Processing Ensemble Weather Forecasts

by   Peter Grönquist, et al.
ETH Zurich

Quantifying uncertainty in weather forecasts typically employs ensemble prediction systems, which consist of many perturbed trajectories run in parallel. These systems are associated with a high computational cost and often include statistical post-processing steps to inexpensively improve their raw prediction qualities. We propose a mixed prediction and post-processing model based on a subset of the original trajectories. In the model, we implement methods from deep learning to account for non-linear relationships that are not captured by current numerical models or other post-processing methods. Applied to global data, our mixed models achieve a relative improvement of the ensemble forecast skill of over 13 extreme weather events on selected case studies, where we see an improvement in predictions by up to 26 computational costs of ensemble prediction systems can potentially be reduced, allowing weather forecasting pipelines to run higher resolution trajectories, and resulting in even more accurate raw ensemble forecasts.



There are no comments yet.


page 3

page 4

page 9

page 10

page 11


Computing the ensemble spread from deterministic weather predictions using conditional generative adversarial networks

Ensemble prediction systems are an invaluable tool for weather forecasti...

Statistical post-processing of dual-resolution ensemble forecasts

The computational cost as well as the probabilistic skill of ensemble fo...

Ensemble Models with Trees and Rules

In this article, we have proposed several approaches for post processing...

Neural networks for post-processing ensemble weather forecasts

Ensemble weather predictions require statistical post-processing of syst...

Rapid adjustment and post-processing of temperature forecast trajectories

Modern weather forecasts are commonly issued as consistent multi-day for...

Convolutional autoencoders for spatially-informed ensemble post-processing

Ensemble weather predictions typically show systematic errors that have ...

Visualization of Model Parameter Sensitivity along Trajectories in Numerical Weather Predictions

Numerical weather prediction models rely on parameterizations for subgri...

Code Repositories


Deep Learning for Post-Processing Ensemble Weather Forecasts

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Operational weather predictions have a large impact on society. They influence individuals on a daily basis, and in more severe cases, save lives and property by predicting extreme events such as tropical cyclones. However, developing reliable weather prediction systems is a difficult task due to the complexity of the Earth System and the chaotic behaviour of its components: Small errors introduced by observations, their assimilation, and the forecast model configuration escalate chaotically, leading to a significant loss in forecast skill within a week. Numerical Weather Prediction (NWP) is based on computer models solving complex partial differential equations at limited resolution. To be useful, weather forecasts try to estimate the uncertainties in predictions using ensemble simulations, where a forecast model is run a number of times from slightly different initial conditions, parameter values, and stochastic forcing. The resulting spread of predictions among ensemble members provides an estimate for the prediction uncertainty. This enables us to estimate the probability of, for example, precipitation for a specific location and time of day as well as the probability of a tropical cyclone hitting a large city.

In this paper we will focus on post-processing ensemble predictions performed at the European Center for Medium-Range Weather Forecasts (ECMWF)[1]

using deep neural networks. ECMWF runs an operational forecast that consists of one high resolution (9 km grid) deterministic forecast (HRES), and an ensemble (ENS) with 51 members at a lower resolution (18 km), of which one is the unperturbed control trajectory. Each ensemble member starts from slightly different initial conditions and uses a different stochastic forcing in the physical parameterization schemes of subgrid-scale processes — so-called stochastic parameterization schemes. While ensemble methods have become a standard tool for numerical weather predictions, there is an ongoing discussion on how many ensemble members should be used. Larger ensembles allow for a better sampling of the probability density function (PDF) of predictions. However, computing power is limited and forecasts are bound by strict operational time windows of a couple of hours. Smaller ensembles would therefore allow individual members to run at higher resolution, likely resulting in better forecasts by each ensemble member 


The demand for ever more precise and dependable forecasts has led NWP methods to rank amongst the scientific domains with the most significant demand for supercomputing time [3, 4, 5, 6]. As such, the NWP field is constantly looking for new methods to improve accuracy and reduce the computational cost of its models. This is where the recent advances in deep neural networks (DNNs) [7] become relevant. The breadth of tasks and efficient inference DNNs enable has made them a very attractive option for improving weather forecasts [8, 9, 10, 11]. Related studies have also shown their capabilities for predicting chaotic behavior [12, 13]. However, the full potential of these methods remains unexplored in many areas of NWP.

We use convolutional neural networks (CNNs) 

[14] and locally connected networks (LCNs) [15]

to both improve forecast skill and reduce the computational requirements for NWP. Firstly, we reduce the number of ensemble members required by predicting spread (standard deviation) values quantifying ensemble uncertainty. Secondly, we perform bias correction on the reduced ensemble mean. Finally, we combine spread prediction and output bias correction to improve forecast skill scores. The reduced number of ensemble forecasts required allows NWP to be run at a fraction of the cost of additional trajectories. Prediction time is further reduced by making use of high throughput graphics processing units (GPUs) for DNN inference.

Our code and data are publicly available111

1.1 Related Work

There have been many works leveraging the modelling capabilities of neural networks (NNs) for NWP. Early attempts at applying shallow NNs (multi-layer perceptrons) showed success in emulating physical processes and saving computational power 

[16]. Since then, building on recent DNN developments, much effort has gone into applying NNs to weather nowcasting[17, 18, 19, 20]. Nowcasting focuses on the emulation of physical processes for short term (up to six hours), high-resolution forecasts. Other works have also shown the significant capabilities of DNNs to predict longer ranging forecasts and extreme weather patterns[21, 22, 23, 24, 25, 26].

In contrast, we focus on the post-processing of operational medium-range ensemble forecasts and the prediction of extreme weather events. Post-processing ensemble outputs has been a long standing effort in the weather forecasting community. Methods such as Ensemble Model Output Statistics (EMOS)[27] and Bayesian Model Averaging (BMA)[28] currently allow for improvements of the raw ensemble forecast skill. Hamill and Whitaker [29] show initial explorations of those techniques on reforecast datasets, also used in this paper, for temperature at 850 hPa (T850) and geopotential at 500 hPa (Z500). Advances in neural networks have only recently reached the field of ensemble models in weather forecasting, focusing on its application to specific weather stations [9, 30]

or global interpolations 

[31]. We expand on this work by applying DNNs on the novel task of improving the forecast skill for global predictions, specifically extreme weather forecasts, while reducing their computational costs.

2 Data

The quality of ensemble forecasts has improved significantly over the last decades and ensemble predictions are using increased resolutions and numbers of trajectories. Learning from past ensemble predictions would therefore lead to inconsistencies as the correction for mean and spread would need to adjust for changes in the quality of predictions over time. To address this, reforecasts [32] apply current state-of-the-art forecast models to past measurements.

We use data from reforecast ensemble experiments at ECMWF. These are routinely generated to provide an estimate for the climatological background state of the atmosphere, which is required for data assimilation[33, 34]. The reforecast experiments run a 10-member ensemble (ENS10) and an unperturbed control experiment on a cubic octahedral reduced Gaussian grid with approximately 18 km grid-spacing (Tco639) and 91 vertical levels. Simulations are performed with the same system for 1999-2017 with two forecast simulations starting each week. This provides a very large dataset with consistent forecast quality.

To fully train our networks and evaluate their forecast skill we also need ground truth weather conditions at specific forecast lead times. For this, we use the fifth major ECMWF ReAnalysis (ERA5)[35], which includes data on weather from 1979 up to the present222Available for download under Compared to reforecasts, reanalysis datasets are produced by applying a constant stream of observations through state-of-the-art data assimilation on state-of-the-art forecast models used for decade-long simulations. In ERA5, this process generates hourly measurements for over 300 parameters.

2.1 Data Selection

We use the ENS10 dataset for our forecasts and make use of ERA5’s constant data assimilation product as ground truth, given the ENS10 forecast lead times. ENS10 and ERA5 both provide global data, which we interpolate to a latitude/longitude grid with a 0.5 degree resolution. We do this to avoid the native grid that was used within simulations, as it is unstructured in the longitudinal direction. While the use of latitude/longitude grids does lead to over-saturation of gridpoints in the poles, it simplifies our models; we leave the use of unstructured grids to future research. We also focus on a single pressure level for each model. When predicting temperature at 850 hPa (T850), we provide all input fields at 850 hPa. Similarly, when predicting geopotential at 500 hPa (Z500) all input fields are at 500 hPa. The years 1999-2013 are used for training, 2014-15 for validation, and 2016-17 for testing. Since the datasets are re-forecasts and a reanalysis, there is no difference in data assimilation and predictions between older and more recent dates, and therefore the selection of consecutive years should not have a major impact. The effects of climate change on the uncertainty of forecasts are currently being explored [36]. For our selected parameters and years, these effects are low. It is, however, important to use complete years, as different seasons demonstrate different weather patterns. We target forecasts with a lead time of 48 hours, and use the reduced ensemble forecasts for 0, 24, and 48 hour lead times as inputs.

2.2 Data Preprocessing

Figure 1: Inputs and ground truths to our neural networks.

As the datasets consist of several terabytes of data, we set up a data preprocessing pipeline to enable faster training. We first select the relevant inputs and labels to each of our respective models from the data provided in GRIB[37] format. We then convert the data from a 16-bit fixed point format to 32-bit floating point. This simplifies and speeds up training, while not impacting the results. Finally, we standardize our features and save them in TFRecord format, which will be randomly shuffled at training time. The resulting inputs and training targets can be seen in Figure 1. We base our model inputs on the first five ENS10 trajectories as we observed no significant differences in the average means or spreads when using different selections.

For DNNs to learn local weather patterns, it is important to keep local spatial differences in variability (coherence) when normalizing meteorological data [23]. However, if only one value per mean and standard deviation are applied to scale values on the whole globe, there will be massive differences for specific regions, e.g., different means and standard deviations closer to the poles compared to the equatorial region. This can lead to poor accuracy when applying CNNs that are translation-invariant. At the same time, just applying gridpoint-wise standardization will result in losing important information, otherwise represented through the coherence.

To remedy this problem, we apply a heuristic we refer to as Local Area-wise Standardization (LAS). First we apply a moving average and moving standard deviation filter on our training set. We use a step size of one and a filter size of

(the largest CNN filter size we apply in our DNNs). Then, as our mean and standard deviation maps now have reduced dimensions, we pad them using the edge values for latitudes and a wrap around for longitudes. Finally, we apply a Gaussian filter with a large standard deviation (10) to the padded result. This upscaling method allows for the coherence to be kept between singular grid points.

Using LAS we notice a relative improvement in our DNN results of around 15% for spread prediction, as well as faster convergence times compared to applying the same standardization for all grid points. However, as expected, we see no difference when using it with our LCNs, which are not translation-invariant (see Section 33.2).

3 Neural Networks

We develop separate neural networks for our uncertainty quantification and output bias correction tasks: An Inception-style residual network [38] and a U-Net [39] with added LCN, respectively. While prior work in weather uncertainty prediction [31] used a single 3D U-Net [40], we find different network architectures performed better for our tasks.

Residual connections [41], which pass unmodified features between non-adjacent layers, are also crucial for our results. Indeed, Chen et al. [42]

demonstrated that applying successive residual connections has many similarities to ordinary differential equations. We also considered recurrent neural networks, but do not apply them here due to the short input sequences and lack of improvement in prior work 


3.1 Uncertainty Quantification Model

Figure 2: Spread Prediction Network for Uncertainty Quantification. All layers marked Conv have a kernel dimension of

and are meant to reduce the number of filters. For other convolutional layers we use Batch Normalization (BN) and ReLU activations on the outputs.

There are many uncertainties present in NWP models and data. Data, or aleatoric, uncertainty stems from observational measurement noise, while model uncertainty comes from structural (e.g., data assimilation, forecast model) and parametric uncertainties. In addition to these inherent uncertainties, we also introduce a structural uncertainty by assuming that the forecast variables follow Gaussian distributions, a common NWP assumption. Our DNN is able to address data and structural uncertainties; however, we cannot address parametric uncertainties, as the data assimilation pipelines and forecast models we use are fixed and used as prediction labels. More specifically, to reduce computational requirements, our DNN initially aims to predict the full ensemble spread using only a subset of NWP ensemble trajectories. The architecture is summarized in Figure 


The non-linear nature of our DNN model, which introduces a previously unused structure in NWP and statistical post-processing, combined with the forecasts of the reduced NWP ensemble, helps address the original structural uncertainty. Such a design also allows our model to take into account deterministic forecasts, which it would otherwise struggle to learn. Additionally, as DNNs are relatively robust to noise, they are naturally able to account for data uncertainty. We also perform a minimal post-processing on NWP output, which reduces the number of required parameters and extracts basic features. This is used for all subsequent steps.

The core of the DNN is based on the ResNet architecture [41]. We use ten Inception-style layers [38] with residual connections; we did not see any improvement with more layers. Each layer is composed of three parallel convolutions, allowing the network to learn differently-sized receptive fields. These use dilated convolutions, which perform similarly to full convolutional kernels, at a reduced computational cost. We also perform a channel-wise concatenation of the post-processed NWP output to the input of each Inception-style layer. This allows the network to prioritize between different lead times from NWP forecast spreads and its own outputs. Using auxiliary losses for different lead times and depths performed worse than pure NWP predictions.

Finally, the output of the last Inception-style layer is combined with the NWP spread through a weighted mean. This guarantees the network performs at least as well as the NWP spread used as input during training. However, this does not guarantee success for unseen examples; we validate this experimentally. By combining this model with our bias correction model and training with ERA5 data as the ground truth (see Section 33.3), it is possible to also account for parametric uncertainty. This allows our combined networks to cover all types of uncertainty.

3.2 Output Bias Correction Model

Figure 3: Output Bias Correction model, based on a three-level U-Net and added LCN structure.

Our output bias correction model, summarized in Figure 3, corrects for biases in NWP forecasts that might arise through local trends in weather patterns. It is trained using the mean ensemble predictions with a 48-hour lead time and ERA5 data as ground-truth. Since the forecast can resemble the ground-truth, a straightforward predictor will closely resemble the identity function. Prior research [41] suggests that approximating an identity mapping with several non-linear layers is difficult. We therefore train our model to predict the difference between the NWP prediction and the ground-truth.

The network is based on a U-Net structure. It repeatedly applies several layers of convolution per level, followed by downscaling. These levels are repeated several times, before being upscaled in a similar, level-wise manner. Residual connections are also used between the down- and upscaling sides. We make three key changes to adapt the standard U-Net to our task. First, instead of up-convolution, we use bilinear interpolation to upscale, followed by a

convolution with stride 1. This is due to checkerboard artifacts that are known to appear when only using a simple deconvolution operation 

[43]. Second, we reduce the number of levels in the U-Net, from five levels to three, as we found using additional levels resulted in overfitting on our data. Finally, we reduced the number of filters in each convolution by half, as we observed no additional performance improvements by using more.

As we aim to predict the bias emerging from specific regional patterns, the translational invariance of regular convolution hinders performance. We therefore use a locally-connected network (LCN) as the last layer. LCNs perform a similar operation to regular convolution, but instead of sharing a filter, an independent filter is used for each output neuron. When training, we apply L1 regularization on the difference between all adjacent filters in an LCN, to encourage adjacent filters to learn similar weights; for an infinite regularization parameter, the LCN converges to a convolutional layer. This helps avoid overfitting.

In order to remain computationally efficient, our best models first use a U-Net to perform feature extraction and then apply an LCN to obtain our final output bias correction. As the U-Net is able to learn long-range dependencies, a single LCN with

kernels is sufficient to learn the gridpoint-wise dependencies, and we observed no improvements by using larger filter sizes.

3.3 Metrics

As our models solve different tasks, we need to use different metrics to evaluate them. When training, we treat both uncertainty quantification and output bias correction as regressions, and aim to predict extreme cases (outliers). We therefore train the networks on mean-squared error (MSE) and evaluate them with root mean-squared error (RMSE). However, when predicting the spread, the results lack the sharp edges that exist in the original forecasts. In computer vision, this problem is mitigated using the Structural SIMilarity (SSIM) metric 

[44]. There can be infinitely many solutions to the task of minimizing RMSE. While remaining within the realm of these solutions, the SSIM measures the structural similarity between two images, with being a perfect match. Therefore, we use the negative mean SSIM of our prediction compared to the full ENS10 ensemble as our training loss for the uncertainty quantification model.

To then gain an understanding of the forecast skill of our combined predictions and of the ENS10 forecasts, we use the Continuous Ranked Probability Score (CRPS)[45]

. CRPS, generally used to measure whether ensemble methods represent uncertainty correctly, is the integral of the square of the difference between the Cumulative Distribution Function (CDF) of the probabilistic predictions

and the ground truth (see Figure 4):

Figure 4: Example CRPS (green area) for a temperature prediction.

Here, is the indicator function (equivalent to the CDF of a deterministic value). In the following, we do not only combine the uncertainty and bias correction networks to calculate CRPS, but also train a combination of both in a network that is optimized to reduce the CRPS. We achieve this by replacing the labels of our uncertainty quantification network, which were previously the spread values of the full ENS10 trajectories, by the difference between the ground truth and the output bias corrected forecast . With our assumption of the forecasts being of a Gaussian distribution, , the error function and standard deviation we then set the CRPS loss as follows:

Finally, the relative CRPS improvement of a prediction over the original raw ensemble is defined as the Continuous Ranked Probability Skill Score (CRPSS):

3.4 Implementation

Our networks are implemented with TensorFlow 


. Layers are initialized with a truncated normal distribution. We train with the Adam optimizer 

[47] and L2 regularization. Our models have not been fine-tuned extensively, and there is further potential for improvement.

The uncertainty networks are trained for 4,725 update steps with a batch size of 2, requiring about four hours on one Nvidia V100 32 GB GPU (provided by the Swiss National Supercomputing Center). The bias correction networks are trained for the same wall time, taking about 25,000 update steps with a batch size of 2. We use early stopping on the validation dataset to identify the best parameters. Training can be done once and the resulting networks used until the ensemble prediction system is upgraded. Using the same GPU, inference for one parameter and forecast on a global grid takes approximately 0.31 seconds per network.

4 Results

Notation Description
B{} Output bias correction NN trained with trajectories
U{} Spread prediction NN trained with trajectories
E{} Ensemble with trajectories

Gridpoint-wise linear regression from

C Uncertainty NN trained on CRPS
G Ground truth data from ERA5
Table 1: Notation for our model configurations and ground-truth data.

We primarily train our models to predict T850 but also evaluate their prediction capacity on Z500. First, our uncertainty quantification and bias correction are evaluated separately on the global RMSE for the spread of ENS10 forecasts or the ERA5 ground-truth respectively. All results are for a forecast lead time of 48 hours. In addition to our DNNs, we train linear regression models on ensemble trajectories as another baseline (see Table 1).

(a) T850
(b) Z500
(c) T850
(d) Z500
Figure 5: Notched boxplots for the global RMSE of the Uncertainty Quantification and Output Bias Correction Networks each day of our test set (2016-17). For the x-axis .

Figure 5 (a, b) shows the improvement in spread prediction of our uncertainty network using five ensemble trajectories, compared to simply using the five trajectories. There are significant improvements for both temperature and geopotential. Figure 5 (c, d) shows our output bias correction for predicting deviating weather patterns given a forecast mean and no measure of uncertainty. We see improvements for T850, but our network struggles to provide a strong global improvement for Z500, which we analyze more thoroughly through case studies (see Section 44.1).

T850 E3 E4 E5 E6 E7 E8 E9 Lin5 U3 U4 U5
Abs. 0.35 0.28 0.23 0.19 0.15 0.11 0.07 0.21 0.26 0.23 0.19
Rel. - - - - - - - 9.0% 26.6% 18.7% 19.5%
Table 2: T850 average RMSE towards E10 spread for different ensemble sizes and models for our test set (2016-17). Abs.: Absolute values. Rel.: Relative improvement over the original forecast.
Parameter Lin10 UN0 UN1 UN2 UN0-LCN UN1-LCN UN2-LCN UN1-LCN-reg
T850 4.8% 6.3% 7.1% 6.7% 7.6% 7.7% 7.6% 7.9%
Z500 2.1% 1.2% 1.6% 1.0% 2.6% 2.5% 2.4% 2.3%
Table 3: Ablation study of our output bias correction, measured by the relative RMSE improvement results over the ensemble mean forecast. UN: U-Net with -levels. UN-LCN: U-Net with -levels followed by an LCN. -reg: with L1 regularization on the LCN.

Table 2 shows our analysis of the final results for the uncertainty quantification network using diverse trajectory numbers on T850. With a very low number of trajectories we can see that our model is already able to predict uncertainty to some degree. Also, while there is understandably a decrease in performance with a growing number of trajectories, the networks still consistently outperform the baselines. This is even the case for a relatively high number of trajectories (half). We also present the results of our ablation studies on the output bias correction networks in Table 3. It shows that T850 correction benefits from convolution operations, whereas Z500 correction benefits most from using a locally connected structure. Further supporting this, we observe that L1 regularization does not have as significant an impact for Z500, suggesting that locally independent weights are more important.

(a) T850
(b) Z500
Figure 6: Notched boxplots for the global average CRPS values for each day in our test set (2016-17), for all our networks and raw ensemble combinations.

RMSE is insufficient to measure forecast skill, as it does not encompass both mean and spread. We therefore also consider CRPS (lower scores are better). We omit EMOS methods, as there is no previous work for these specific features on our datasets. Instead we use E5 and E10, the raw five- and ten-member ensembles,

CRPSS Z500 T850
B5U5C towards E10 0.0650 0.0969
B5U5C towards E5 0.0972 0.1334
Table 4: Continuous Ranked Probability Skill Scores over our test set (2016-17).

as our baselines. The results are presented in Figure 6. To measure our improvements, we specifically focus on the CRPSS of our combined network trained on CRPS, as compared to the raw ensemble outputs in Table 4. Our models, both those trained on CRPS and SSIM, also outperform the full ensemble forecast CRPS for both T850 and Z500, despite it using ten trajectories. The major source of improvement for our DNNs, especially for T850, is a reduction in extreme values (outliers), which are indicative of extreme weather events and forecast busts.

4.1 Extreme Weather Forecasts

Thus far, our results have only demonstrated improvements on average values. They do not cover the performance for uncertainty quantification of extreme events, where forecast reliability is essential. We now present three cases of extreme weather phenomena within our test set, selected from across the world, to demonstrate the networks’ improvements for specific predictions. Figures 9, 8 and 7 show our CRPS improvements for tropical cyclone Winston, hurricane Matthews, and a cold wave over Asia, respectively. Blue color indicates that the (post-processed) five member ensemble has more skill when compared to the ten member ensemble. All times are 00:00 UTC. We also present global CRPS plots in Figure 10.

(a) E10
(b) E5-E10
(c) B5U5-E10
(d) B5U5C-E10
Figure 7:

Tropical cyclone Winston, raging from February until March 2016 over Fiji, has been classified as the most intense cyclone in the southern hemisphere ever recorded, according to the Southwest Pacific Enhanced Archive for Tropical Cyclones. It reached category five on February 20. We present the prediction for Z500 as forecast on February 19 for February 21, and differences in CRPS. (a) The CRPS for the ten-member ensemble. The center of the cyclone is clearly visible. (b)-(d) The difference in CRPS between the ten-member ensemble and five-member ensembles with and without post-processing. Our CRPS-trained network shows improvement over E10. It also demonstrates similar confidence in the southern area where the cyclone is moving, while being minimally worse where the cyclone has already passed. This results in a large forecast skill improvement in the selected area, with a CRPSS of 0.739 (26.1% improvement over E10).

(a) E10
(b) E5-E10
(c) B5U5-E10
(d) B5U5C-E10
Figure 8: Hurricane Matthews, a category five hurricane brought severe destruction to the Carribean and southeastern United States during September and October of 2016. We look at the Z500 forecast for the third of October. Again we see large improvements over the center of the hurricane, as well as in the northern regions the hurricane will later progress over, with minimal worsening of outer regions.
(a) E10
(b) E5-E10
(c) B5U5-E10
(d) B5U5C-E10
Figure 9: Cold Wave over Asia: During January 2016 an unprecedented cold wave rushed over East and South Asia, leading to record lows. We focus on a forecast for January 24 where T850 forecast CRPS has its worst spike. In this case our CRPS trained model brings a large improvement of more than 25.5% (CRPSS), compared to the 10 member ensemble, over the most affected region, while keeping regions of low CRPS fairly close to their original values, resulting in a total forecast improvement of around 19.5% for the selected region.
(a) E10
(b) E5
(c) B5U5
(d) B5U5C
Figure 10: Global T850 CRPS plots for our models and the ENS10 forecast for January 24 2016 during the cold wave over East and South Asia (lower values are better).

5 Conclusion

We show that using informed model construction, deep learning can indeed improve the skill of global ensemble weather predictions. In particular, since we not only predict an ensemble spread, but also perform a locally-adaptive output bias correction, we improve the results of a five-member ensemble to even surpass the forecast skill of a ten-member ensemble. When tasked with hard-to-predict extreme weather cases, such as tropical cyclone Winston, the combined models exhibit especially pronounced forecast skill improvements. Through the use of heterogeneous hardware, they are able to run these global post-processing steps within tenths of a second. In the future, such deep learning tools could allow for reduced ensembles to be run at higher resolutions, providing cheaper and more informed predictions.

The network structures used in this paper should also be tested for other applications of deep learning in NWP, such as the learning of model error in data-assimilation systems or the learning of the global equations of motion. Future research could be conducted into whether the networks need to be re-trained to process other physical fields and forecast lead times or whether the normalisation of spread values could allow the same network to also be applied to these tasks through transfer learning. Recurrent networks that encompass more time steps, as well as deep learning models that are capable of working on the native unstructured grid of the prediction model (e.g., graph neural networks 

[48]), can also be investigated in this context. Finally, the presented improvements would need to be studied when applied to ensembles with more members, such as the operational 50-member ensemble system of the ECMWF.

We encourage researchers to make use of the ERA5 and ENS10 datasets as well as our code, to apply new deep learning methods and expand on our initial architectures, helping weather forecast centers worldwide.


  • [1] European Centre for Medium-Range Weather Forecasts, “The ECMWF ensemble prediction system,”, 2012.
  • [2] M. Leutbecher and Z. B. Bouallègue, “On the probabilistic skill of dual‐resolution ensemble forecasts,” 2019.
  • [3] T. Schulthess, P. Bauer, O. Fuhrer, T. Hoefler, C. Schaer, and N. Wedi, “Reflecting on the goal and baseline for exascale computing: a roadmap based on weather and climate simulations,” Computing in Science and Engineering (CiSE), vol. 21, no. 1, Jan. 2019.
  • [4] P. Neumann, P. Düben, P. Adamidis, P. Bauer, M. Brück, L. Kornblueh, D. Klocke, B. Stevens, N. Wedi, and J. Biercamp, “Assessing the scales in numerical weather and climate predictions: will exascale be the rescue?” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 377, no. 2142, 2019. [Online]. Available:
  • [5] C. Schär, O. Fuhrer, A. Arteaga, N. Ban, C. Charpilloz, S. D. Girolamo, L. Hentgen, T. Hoefler, X. Lapillonne, D. Leutwyler, K. Osterried, D. Panosetti, S. Rüdisühli, L. Schlemmer, T. Schulthess, M. Sprenger, S. Ubbiali, and H. Wernli, “Kilometer-scale climate models: Prospects and challenges,” Bulletin of the American Meteorological Society, vol. 100, no. 12, Dec. 2019, early Online Release.
  • [6] P. Dueben, N. Wedi, S. Saarinen, and C. Zeman, “Global simulations of the atmosphere at 1.45 km grid-spacing with the integrated forecasting system,” Journal of the Meteorological Society of Japan. Ser. II, vol. advpub, 2020.
  • [7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, 2015.
  • [8] N. D. Brenowitz and C. S. Bretherton, “Prognostic validation of a neural network unified physics parameterization,” Geophysical Research Letters, vol. 45, no. 12, 2018. [Online]. Available:
  • [9] S. Rasp and S. Lerch, “Neural Networks for Postprocessing Ensemble Weather Forecasts,” Monthly Weather Review, vol. 146, no. 11, Nov 2018.
  • [10] S. Rasp, M. S. Pritchard, and P. Gentine, “Deep learning to represent sub-grid processes in climate models,” 2018.
  • [11]

    P. D. Dueben and P. Bauer, “Challenges and design choices for global weather and climate models based on machine learning,”

    Geoscientific Model Development, vol. 11, no. 10, 2018. [Online]. Available:
  • [12] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott, “Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach,” 2018.
  • [13]

    P. R. Vlachas, J. Pathak, B. R. Hunt, T. P. Sapsis, M. Girvan, E. Ott, and P. Koumoutsakos, “Backpropagation algorithms and reservoir computing in recurrent neural networks for the forecasting of complex spatiotemporal dynamics,” 2019.

  • [14] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition with gradient-based learning,” in Shape, Contour and Grouping in Computer Vision, 1999. [Online]. Available:
  • [15] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro, “Deep learning with COTS HPC systems,” in Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ser. ICML’13, 2013, p. III–1337–III–1345.
  • [16] F. Chevallier, F. Chéruy, N. A. Scott, and A. Chédin, “A neural network approach for a fast and accurate computation of a longwave radiative budget,” Journal of Applied Meteorology, vol. 37, no. 11, 1998.
  • [17] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional LSTM network: A machine learning approach for precipitation nowcasting,” in Advances in neural information processing systems, 2015.
  • [18] X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo, “Deep learning for precipitation nowcasting: A benchmark and a new model,” in Advances in neural information processing systems, 2017.
  • [19] A. Heye, K. Venkatesan, and J. Cain, “Precipitation nowcasting: Leveraging deep recurrent convolutional neural networks,” Proceedings of the Cray User Group (CUG), 2017.
  • [20] S. Agrawal, L. Barrington, C. Bromberg, J. Burge, C. Gazen, and J. Hickey, “Machine learning for precipitation nowcasting from radar images,” 2019.
  • [21] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips, A. Mahesh, M. Matheson, J. Deslippe, M. Fatica et al., “Exascale deep learning for climate analytics,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2018.
  • [22] P. R. Larraondo, L. J. Renzullo, I. Inza, and J. A. Lozano, “A data-driven approach to precipitation parameterizations using convolutional encoder-decoder neural networks,” 2019.
  • [23] J. A. Weyn, D. R. Durran, and R. Caruana, “Can machines learn to predict weather? using deep learning to predict gridded 500-hpa geopotential height from historical weather data,” Journal of Advances in Modeling Earth Systems, vol. 11, no. 8, 2019. [Online]. Available:
  • [24] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips, A. Mahesh, M. Matheson, J. Deslippe, M. Fatica, P. Prabhat, and M. Houston, “Exascale deep learning for climate analytics,” in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018.
  • [25] R. Lagerquist, A. McGovern, and D. J. Gagne II, “Deep learning for spatially explicit prediction of synoptic-scale fronts,” Weather and Forecasting, vol. 34, no. 4, 2019. [Online]. Available:
  • [26] Y. Liu, E. Racah, Prabhat, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel, M. Wehner, and W. Collins, “Application of deep convolutional neural networks for detecting extreme weather in climate datasets,” 2016.
  • [27] T. Gneiting, A. E. Raftery, A. H. Westveld, and T. Goldman, “Calibrated probabilistic forecasting using ensemble model output statistics and minimum crps estimation,” Monthly Weather Review, vol. 133, no. 5, 2005. [Online]. Available:
  • [28] A. E. Raftery, T. Gneiting, F. Balabdaoui, and M. Polakowski, “Using bayesian model averaging to calibrate forecast ensembles,” Monthly Weather Review, vol. 133, no. 5, 2005. [Online]. Available:
  • [29] T. M. Hamill and J. S. Whitaker, “Ensemble calibration of 500-hpa geopotential height and 850-hpa and 2-m temperatures using reforecasts,” Monthly Weather Review, vol. 135, no. 9, 2007. [Online]. Available:
  • [30] A. Baran, S. Lerch, M. E. Ayari, and S. Baran, “Machine learning for total cloud cover prediction,” 2020.
  • [31] P. Grönquist, T. Ben-Nun, N. Dryden, P. Dueben, L. Lavarini, Shigang Li, and T. Hoefler, “Predicting weather uncertainty with deep convnets,” arXiv:1911.00630, 2019.
  • [32] T. M. Hamill, J. S. Whitaker, and S. L. Mullen, “Reforecasts: An important dataset for improving weather predictions,” Bulletin of the American Meteorological Society, vol. 87, no. 1, 2006. [Online]. Available:
  • [33] F. Vitart, “Evolution of ecmwf sub‐seasonal forecast skill scores,” 2014.
  • [34] F. Vitart, M. Alonso-Balmaseda, A. Benedetti, B. Balan-Sarojini, S. Tietsche, J. Yao, M. Janousek, G. Balsamo, M. Leutbecher, P. Bechtold, I. Polichtchouk, D. Richardson, T. Stockdale, and C. Roberts, “Extended-range prediction,” 2019.
  • [35] European Centre for Medium-Range Weather Forecasts, “ERA5,”, 2019.
  • [36] S. Scher and G. Messori, “How global warming changes the difficulty of synoptic weather forecasting,” Geophysical Research Letters, vol. 46, no. 5, 2019. [Online]. Available:
  • [37] World Meteorological Organization, “FM 92 GRIB,”, 2003.
  • [38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2015.
  • [39] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015. [Online]. Available:
  • [40] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: learning dense volumetric segmentation from sparse annotation,” in International conference on medical image computing and computer-assisted intervention, 2016.
  • [41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • [42] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” in Advances in neural information processing systems, 2018.
  • [43] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill, vol. 1, no. 10, 2016.
  • [44] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, 2004.
  • [45] H. Hersbach, “Decomposition of the continuous ranked probability score for ensemble prediction systems,” Weather and Forecasting, vol. 15, no. 5, 2000. [Online]. Available:<0559:DOTCRP>2.0.CO;2
  • [46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from [Online]. Available:
  • [47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [48] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, 2009.