deep-weather
Deep Learning for Post-Processing Ensemble Weather Forecasts
view repo
Quantifying uncertainty in weather forecasts typically employs ensemble prediction systems, which consist of many perturbed trajectories run in parallel. These systems are associated with a high computational cost and often include statistical post-processing steps to inexpensively improve their raw prediction qualities. We propose a mixed prediction and post-processing model based on a subset of the original trajectories. In the model, we implement methods from deep learning to account for non-linear relationships that are not captured by current numerical models or other post-processing methods. Applied to global data, our mixed models achieve a relative improvement of the ensemble forecast skill of over 13 extreme weather events on selected case studies, where we see an improvement in predictions by up to 26 computational costs of ensemble prediction systems can potentially be reduced, allowing weather forecasting pipelines to run higher resolution trajectories, and resulting in even more accurate raw ensemble forecasts.
READ FULL TEXT VIEW PDFDeep Learning for Post-Processing Ensemble Weather Forecasts
Operational weather predictions have a large impact on society. They influence individuals on a daily basis, and in more severe cases, save lives and property by predicting extreme events such as tropical cyclones. However, developing reliable weather prediction systems is a difficult task due to the complexity of the Earth System and the chaotic behaviour of its components: Small errors introduced by observations, their assimilation, and the forecast model configuration escalate chaotically, leading to a significant loss in forecast skill within a week. Numerical Weather Prediction (NWP) is based on computer models solving complex partial differential equations at limited resolution. To be useful, weather forecasts try to estimate the uncertainties in predictions using ensemble simulations, where a forecast model is run a number of times from slightly different initial conditions, parameter values, and stochastic forcing. The resulting spread of predictions among ensemble members provides an estimate for the prediction uncertainty. This enables us to estimate the probability of, for example, precipitation for a specific location and time of day as well as the probability of a tropical cyclone hitting a large city.
In this paper we will focus on post-processing ensemble predictions performed at the European Center for Medium-Range Weather Forecasts (ECMWF)[1]
using deep neural networks. ECMWF runs an operational forecast that consists of one high resolution (9 km grid) deterministic forecast (HRES), and an ensemble (ENS) with 51 members at a lower resolution (18 km), of which one is the unperturbed control trajectory. Each ensemble member starts from slightly different initial conditions and uses a different stochastic forcing in the physical parameterization schemes of subgrid-scale processes — so-called stochastic parameterization schemes. While ensemble methods have become a standard tool for numerical weather predictions, there is an ongoing discussion on how many ensemble members should be used. Larger ensembles allow for a better sampling of the probability density function (PDF) of predictions. However, computing power is limited and forecasts are bound by strict operational time windows of a couple of hours. Smaller ensembles would therefore allow individual members to run at higher resolution, likely resulting in better forecasts by each ensemble member
[2].The demand for ever more precise and dependable forecasts has led NWP methods to rank amongst the scientific domains with the most significant demand for supercomputing time [3, 4, 5, 6]. As such, the NWP field is constantly looking for new methods to improve accuracy and reduce the computational cost of its models. This is where the recent advances in deep neural networks (DNNs) [7] become relevant. The breadth of tasks and efficient inference DNNs enable has made them a very attractive option for improving weather forecasts [8, 9, 10, 11]. Related studies have also shown their capabilities for predicting chaotic behavior [12, 13]. However, the full potential of these methods remains unexplored in many areas of NWP.
We use convolutional neural networks (CNNs)
[14] and locally connected networks (LCNs) [15]to both improve forecast skill and reduce the computational requirements for NWP. Firstly, we reduce the number of ensemble members required by predicting spread (standard deviation) values quantifying ensemble uncertainty. Secondly, we perform bias correction on the reduced ensemble mean. Finally, we combine spread prediction and output bias correction to improve forecast skill scores. The reduced number of ensemble forecasts required allows NWP to be run at a fraction of the cost of additional trajectories. Prediction time is further reduced by making use of high throughput graphics processing units (GPUs) for DNN inference.
Our code and data are publicly available^{1}^{1}1https://github.com/spcl/deep-weather.
There have been many works leveraging the modelling capabilities of neural networks (NNs) for NWP. Early attempts at applying shallow NNs (multi-layer perceptrons) showed success in emulating physical processes and saving computational power
[16]. Since then, building on recent DNN developments, much effort has gone into applying NNs to weather nowcasting[17, 18, 19, 20]. Nowcasting focuses on the emulation of physical processes for short term (up to six hours), high-resolution forecasts. Other works have also shown the significant capabilities of DNNs to predict longer ranging forecasts and extreme weather patterns[21, 22, 23, 24, 25, 26].In contrast, we focus on the post-processing of operational medium-range ensemble forecasts and the prediction of extreme weather events. Post-processing ensemble outputs has been a long standing effort in the weather forecasting community. Methods such as Ensemble Model Output Statistics (EMOS)[27] and Bayesian Model Averaging (BMA)[28] currently allow for improvements of the raw ensemble forecast skill. Hamill and Whitaker [29] show initial explorations of those techniques on reforecast datasets, also used in this paper, for temperature at 850 hPa (T850) and geopotential at 500 hPa (Z500). Advances in neural networks have only recently reached the field of ensemble models in weather forecasting, focusing on its application to specific weather stations [9, 30]
or global interpolations
[31]. We expand on this work by applying DNNs on the novel task of improving the forecast skill for global predictions, specifically extreme weather forecasts, while reducing their computational costs.The quality of ensemble forecasts has improved significantly over the last decades and ensemble predictions are using increased resolutions and numbers of trajectories. Learning from past ensemble predictions would therefore lead to inconsistencies as the correction for mean and spread would need to adjust for changes in the quality of predictions over time. To address this, reforecasts [32] apply current state-of-the-art forecast models to past measurements.
We use data from reforecast ensemble experiments at ECMWF. These are routinely generated to provide an estimate for the climatological background state of the atmosphere, which is required for data assimilation[33, 34]. The reforecast experiments run a 10-member ensemble (ENS10) and an unperturbed control experiment on a cubic octahedral reduced Gaussian grid with approximately 18 km grid-spacing (Tco639) and 91 vertical levels. Simulations are performed with the same system for 1999-2017 with two forecast simulations starting each week. This provides a very large dataset with consistent forecast quality.
To fully train our networks and evaluate their forecast skill we also need ground truth weather conditions at specific forecast lead times. For this, we use the fifth major ECMWF ReAnalysis (ERA5)[35], which includes data on weather from 1979 up to the present^{2}^{2}2Available for download under https://cds.climate.copernicus.eu/. Compared to reforecasts, reanalysis datasets are produced by applying a constant stream of observations through state-of-the-art data assimilation on state-of-the-art forecast models used for decade-long simulations. In ERA5, this process generates hourly measurements for over 300 parameters.
We use the ENS10 dataset for our forecasts and make use of ERA5’s constant data assimilation product as ground truth, given the ENS10 forecast lead times. ENS10 and ERA5 both provide global data, which we interpolate to a latitude/longitude grid with a 0.5 degree resolution. We do this to avoid the native grid that was used within simulations, as it is unstructured in the longitudinal direction. While the use of latitude/longitude grids does lead to over-saturation of gridpoints in the poles, it simplifies our models; we leave the use of unstructured grids to future research. We also focus on a single pressure level for each model. When predicting temperature at 850 hPa (T850), we provide all input fields at 850 hPa. Similarly, when predicting geopotential at 500 hPa (Z500) all input fields are at 500 hPa. The years 1999-2013 are used for training, 2014-15 for validation, and 2016-17 for testing. Since the datasets are re-forecasts and a reanalysis, there is no difference in data assimilation and predictions between older and more recent dates, and therefore the selection of consecutive years should not have a major impact. The effects of climate change on the uncertainty of forecasts are currently being explored [36]. For our selected parameters and years, these effects are low. It is, however, important to use complete years, as different seasons demonstrate different weather patterns. We target forecasts with a lead time of 48 hours, and use the reduced ensemble forecasts for 0, 24, and 48 hour lead times as inputs.
As the datasets consist of several terabytes of data, we set up a data preprocessing pipeline to enable faster training. We first select the relevant inputs and labels to each of our respective models from the data provided in GRIB[37] format. We then convert the data from a 16-bit fixed point format to 32-bit floating point. This simplifies and speeds up training, while not impacting the results. Finally, we standardize our features and save them in TFRecord format, which will be randomly shuffled at training time. The resulting inputs and training targets can be seen in Figure 1. We base our model inputs on the first five ENS10 trajectories as we observed no significant differences in the average means or spreads when using different selections.
For DNNs to learn local weather patterns, it is important to keep local spatial differences in variability (coherence) when normalizing meteorological data [23]. However, if only one value per mean and standard deviation are applied to scale values on the whole globe, there will be massive differences for specific regions, e.g., different means and standard deviations closer to the poles compared to the equatorial region. This can lead to poor accuracy when applying CNNs that are translation-invariant. At the same time, just applying gridpoint-wise standardization will result in losing important information, otherwise represented through the coherence.
To remedy this problem, we apply a heuristic we refer to as Local Area-wise Standardization (LAS). First we apply a moving average and moving standard deviation filter on our training set. We use a step size of one and a filter size of
(the largest CNN filter size we apply in our DNNs). Then, as our mean and standard deviation maps now have reduced dimensions, we pad them using the edge values for latitudes and a wrap around for longitudes. Finally, we apply a Gaussian filter with a large standard deviation (10) to the padded result. This upscaling method allows for the coherence to be kept between singular grid points.
Using LAS we notice a relative improvement in our DNN results of around 15% for spread prediction, as well as faster convergence times compared to applying the same standardization for all grid points. However, as expected, we see no difference when using it with our LCNs, which are not translation-invariant (see Section 33.2).
We develop separate neural networks for our uncertainty quantification and output bias correction tasks: An Inception-style residual network [38] and a U-Net [39] with added LCN, respectively. While prior work in weather uncertainty prediction [31] used a single 3D U-Net [40], we find different network architectures performed better for our tasks.
Residual connections [41], which pass unmodified features between non-adjacent layers, are also crucial for our results. Indeed, Chen et al. [42]
demonstrated that applying successive residual connections has many similarities to ordinary differential equations. We also considered recurrent neural networks, but do not apply them here due to the short input sequences and lack of improvement in prior work
[31].and are meant to reduce the number of filters. For other convolutional layers we use Batch Normalization (BN) and ReLU activations on the outputs.
There are many uncertainties present in NWP models and data. Data, or aleatoric, uncertainty stems from observational measurement noise, while model uncertainty comes from structural (e.g., data assimilation, forecast model) and parametric uncertainties. In addition to these inherent uncertainties, we also introduce a structural uncertainty by assuming that the forecast variables follow Gaussian distributions, a common NWP assumption. Our DNN is able to address data and structural uncertainties; however, we cannot address parametric uncertainties, as the data assimilation pipelines and forecast models we use are fixed and used as prediction labels. More specifically, to reduce computational requirements, our DNN initially aims to predict the full ensemble spread using only a subset of NWP ensemble trajectories. The architecture is summarized in Figure
2.The non-linear nature of our DNN model, which introduces a previously unused structure in NWP and statistical post-processing, combined with the forecasts of the reduced NWP ensemble, helps address the original structural uncertainty. Such a design also allows our model to take into account deterministic forecasts, which it would otherwise struggle to learn. Additionally, as DNNs are relatively robust to noise, they are naturally able to account for data uncertainty. We also perform a minimal post-processing on NWP output, which reduces the number of required parameters and extracts basic features. This is used for all subsequent steps.
The core of the DNN is based on the ResNet architecture [41]. We use ten Inception-style layers [38] with residual connections; we did not see any improvement with more layers. Each layer is composed of three parallel convolutions, allowing the network to learn differently-sized receptive fields. These use dilated convolutions, which perform similarly to full convolutional kernels, at a reduced computational cost. We also perform a channel-wise concatenation of the post-processed NWP output to the input of each Inception-style layer. This allows the network to prioritize between different lead times from NWP forecast spreads and its own outputs. Using auxiliary losses for different lead times and depths performed worse than pure NWP predictions.
Finally, the output of the last Inception-style layer is combined with the NWP spread through a weighted mean. This guarantees the network performs at least as well as the NWP spread used as input during training. However, this does not guarantee success for unseen examples; we validate this experimentally. By combining this model with our bias correction model and training with ERA5 data as the ground truth (see Section 33.3), it is possible to also account for parametric uncertainty. This allows our combined networks to cover all types of uncertainty.
Our output bias correction model, summarized in Figure 3, corrects for biases in NWP forecasts that might arise through local trends in weather patterns. It is trained using the mean ensemble predictions with a 48-hour lead time and ERA5 data as ground-truth. Since the forecast can resemble the ground-truth, a straightforward predictor will closely resemble the identity function. Prior research [41] suggests that approximating an identity mapping with several non-linear layers is difficult. We therefore train our model to predict the difference between the NWP prediction and the ground-truth.
The network is based on a U-Net structure. It repeatedly applies several layers of convolution per level, followed by downscaling. These levels are repeated several times, before being upscaled in a similar, level-wise manner. Residual connections are also used between the down- and upscaling sides. We make three key changes to adapt the standard U-Net to our task. First, instead of up-convolution, we use bilinear interpolation to upscale, followed by a
convolution with stride 1. This is due to checkerboard artifacts that are known to appear when only using a simple deconvolution operation
[43]. Second, we reduce the number of levels in the U-Net, from five levels to three, as we found using additional levels resulted in overfitting on our data. Finally, we reduced the number of filters in each convolution by half, as we observed no additional performance improvements by using more.As we aim to predict the bias emerging from specific regional patterns, the translational invariance of regular convolution hinders performance. We therefore use a locally-connected network (LCN) as the last layer. LCNs perform a similar operation to regular convolution, but instead of sharing a filter, an independent filter is used for each output neuron. When training, we apply L1 regularization on the difference between all adjacent filters in an LCN, to encourage adjacent filters to learn similar weights; for an infinite regularization parameter, the LCN converges to a convolutional layer. This helps avoid overfitting.
In order to remain computationally efficient, our best models first use a U-Net to perform feature extraction and then apply an LCN to obtain our final output bias correction. As the U-Net is able to learn long-range dependencies, a single LCN with
kernels is sufficient to learn the gridpoint-wise dependencies, and we observed no improvements by using larger filter sizes.As our models solve different tasks, we need to use different metrics to evaluate them. When training, we treat both uncertainty quantification and output bias correction as regressions, and aim to predict extreme cases (outliers). We therefore train the networks on mean-squared error (MSE) and evaluate them with root mean-squared error (RMSE). However, when predicting the spread, the results lack the sharp edges that exist in the original forecasts. In computer vision, this problem is mitigated using the Structural SIMilarity (SSIM) metric
[44]. There can be infinitely many solutions to the task of minimizing RMSE. While remaining within the realm of these solutions, the SSIM measures the structural similarity between two images, with being a perfect match. Therefore, we use the negative mean SSIM of our prediction compared to the full ENS10 ensemble as our training loss for the uncertainty quantification model.To then gain an understanding of the forecast skill of our combined predictions and of the ENS10 forecasts, we use the Continuous Ranked Probability Score (CRPS)[45]
. CRPS, generally used to measure whether ensemble methods represent uncertainty correctly, is the integral of the square of the difference between the Cumulative Distribution Function (CDF) of the probabilistic predictions
and the ground truth (see Figure 4):Here, is the indicator function (equivalent to the CDF of a deterministic value). In the following, we do not only combine the uncertainty and bias correction networks to calculate CRPS, but also train a combination of both in a network that is optimized to reduce the CRPS. We achieve this by replacing the labels of our uncertainty quantification network, which were previously the spread values of the full ENS10 trajectories, by the difference between the ground truth and the output bias corrected forecast . With our assumption of the forecasts being of a Gaussian distribution, , the error function and standard deviation we then set the CRPS loss as follows:
Finally, the relative CRPS improvement of a prediction over the original raw ensemble is defined as the Continuous Ranked Probability Skill Score (CRPSS):
Our networks are implemented with TensorFlow
[46]. Layers are initialized with a truncated normal distribution. We train with the Adam optimizer
[47] and L2 regularization. Our models have not been fine-tuned extensively, and there is further potential for improvement.The uncertainty networks are trained for 4,725 update steps with a batch size of 2, requiring about four hours on one Nvidia V100 32 GB GPU (provided by the Swiss National Supercomputing Center). The bias correction networks are trained for the same wall time, taking about 25,000 update steps with a batch size of 2. We use early stopping on the validation dataset to identify the best parameters. Training can be done once and the resulting networks used until the ensemble prediction system is upgraded. Using the same GPU, inference for one parameter and forecast on a global grid takes approximately 0.31 seconds per network.
Notation | Description |
---|---|
B{} | Output bias correction NN trained with trajectories |
U{} | Spread prediction NN trained with trajectories |
E{} | Ensemble with trajectories |
Lin{} | Gridpoint-wise linear regression from trajectories |
C | Uncertainty NN trained on CRPS |
G | Ground truth data from ERA5 |
We primarily train our models to predict T850 but also evaluate their prediction capacity on Z500. First, our uncertainty quantification and bias correction are evaluated separately on the global RMSE for the spread of ENS10 forecasts or the ERA5 ground-truth respectively. All results are for a forecast lead time of 48 hours. In addition to our DNNs, we train linear regression models on ensemble trajectories as another baseline (see Table 1).
Figure 5 (a, b) shows the improvement in spread prediction of our uncertainty network using five ensemble trajectories, compared to simply using the five trajectories. There are significant improvements for both temperature and geopotential. Figure 5 (c, d) shows our output bias correction for predicting deviating weather patterns given a forecast mean and no measure of uncertainty. We see improvements for T850, but our network struggles to provide a strong global improvement for Z500, which we analyze more thoroughly through case studies (see Section 44.1).
T850 | E3 | E4 | E5 | E6 | E7 | E8 | E9 | Lin5 | U3 | U4 | U5 |
---|---|---|---|---|---|---|---|---|---|---|---|
Abs. | 0.35 | 0.28 | 0.23 | 0.19 | 0.15 | 0.11 | 0.07 | 0.21 | 0.26 | 0.23 | 0.19 |
Rel. | - | - | - | - | - | - | - | 9.0% | 26.6% | 18.7% | 19.5% |
Parameter | Lin10 | UN0 | UN1 | UN2 | UN0-LCN | UN1-LCN | UN2-LCN | UN1-LCN-reg |
---|---|---|---|---|---|---|---|---|
T850 | 4.8% | 6.3% | 7.1% | 6.7% | 7.6% | 7.7% | 7.6% | 7.9% |
Z500 | 2.1% | 1.2% | 1.6% | 1.0% | 2.6% | 2.5% | 2.4% | 2.3% |
Table 2 shows our analysis of the final results for the uncertainty quantification network using diverse trajectory numbers on T850. With a very low number of trajectories we can see that our model is already able to predict uncertainty to some degree. Also, while there is understandably a decrease in performance with a growing number of trajectories, the networks still consistently outperform the baselines. This is even the case for a relatively high number of trajectories (half). We also present the results of our ablation studies on the output bias correction networks in Table 3. It shows that T850 correction benefits from convolution operations, whereas Z500 correction benefits most from using a locally connected structure. Further supporting this, we observe that L1 regularization does not have as significant an impact for Z500, suggesting that locally independent weights are more important.
RMSE is insufficient to measure forecast skill, as it does not encompass both mean and spread. We therefore also consider CRPS (lower scores are better). We omit EMOS methods, as there is no previous work for these specific features on our datasets. Instead we use E5 and E10, the raw five- and ten-member ensembles,
CRPSS | Z500 | T850 |
---|---|---|
B5U5C towards E10 | 0.0650 | 0.0969 |
B5U5C towards E5 | 0.0972 | 0.1334 |
as our baselines. The results are presented in Figure 6. To measure our improvements, we specifically focus on the CRPSS of our combined network trained on CRPS, as compared to the raw ensemble outputs in Table 4. Our models, both those trained on CRPS and SSIM, also outperform the full ensemble forecast CRPS for both T850 and Z500, despite it using ten trajectories. The major source of improvement for our DNNs, especially for T850, is a reduction in extreme values (outliers), which are indicative of extreme weather events and forecast busts.
Thus far, our results have only demonstrated improvements on average values. They do not cover the performance for uncertainty quantification of extreme events, where forecast reliability is essential. We now present three cases of extreme weather phenomena within our test set, selected from across the world, to demonstrate the networks’ improvements for specific predictions. Figures 9, 8 and 7 show our CRPS improvements for tropical cyclone Winston, hurricane Matthews, and a cold wave over Asia, respectively. Blue color indicates that the (post-processed) five member ensemble has more skill when compared to the ten member ensemble. All times are 00:00 UTC. We also present global CRPS plots in Figure 10.
Tropical cyclone Winston, raging from February until March 2016 over Fiji, has been classified as the most intense cyclone in the southern hemisphere ever recorded, according to the Southwest Pacific Enhanced Archive for Tropical Cyclones. It reached category five on February 20. We present the prediction for Z500 as forecast on February 19 for February 21, and differences in CRPS. (a) The CRPS for the ten-member ensemble. The center of the cyclone is clearly visible. (b)-(d) The difference in CRPS between the ten-member ensemble and five-member ensembles with and without post-processing. Our CRPS-trained network shows improvement over E10. It also demonstrates similar confidence in the southern area where the cyclone is moving, while being minimally worse where the cyclone has already passed. This results in a large forecast skill improvement in the selected area, with a CRPSS of 0.739 (26.1% improvement over E10).
We show that using informed model construction, deep learning can indeed improve the skill of global ensemble weather predictions. In particular, since we not only predict an ensemble spread, but also perform a locally-adaptive output bias correction, we improve the results of a five-member ensemble to even surpass the forecast skill of a ten-member ensemble. When tasked with hard-to-predict extreme weather cases, such as tropical cyclone Winston, the combined models exhibit especially pronounced forecast skill improvements. Through the use of heterogeneous hardware, they are able to run these global post-processing steps within tenths of a second. In the future, such deep learning tools could allow for reduced ensembles to be run at higher resolutions, providing cheaper and more informed predictions.
The network structures used in this paper should also be tested for other applications of deep learning in NWP, such as the learning of model error in data-assimilation systems or the learning of the global equations of motion. Future research could be conducted into whether the networks need to be re-trained to process other physical fields and forecast lead times or whether the normalisation of spread values could allow the same network to also be applied to these tasks through transfer learning. Recurrent networks that encompass more time steps, as well as deep learning models that are capable of working on the native unstructured grid of the prediction model (e.g., graph neural networks
[48]), can also be investigated in this context. Finally, the presented improvements would need to be studied when applied to ensembles with more members, such as the operational 50-member ensemble system of the ECMWF.We encourage researchers to make use of the ERA5 and ENS10 datasets as well as our code, to apply new deep learning methods and expand on our initial architectures, helping weather forecast centers worldwide.
P. D. Dueben and P. Bauer, “Challenges and design choices for global weather and climate models based on machine learning,”
Geoscientific Model Development, vol. 11, no. 10, 2018. [Online]. Available: https://www.geosci-model-dev.net/11/3999/2018/P. R. Vlachas, J. Pathak, B. R. Hunt, T. P. Sapsis, M. Girvan, E. Ott, and P. Koumoutsakos, “Backpropagation algorithms and reservoir computing in recurrent neural networks for the forecasting of complex spatiotemporal dynamics,” 2019.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2015.
Comments
There are no comments yet.