WeatherBench: A benchmark dataset for data-driven weather forecasting

Data-driven approaches, most prominently deep learning, have become powerful tools for prediction in many domains. A natural question to ask is whether data-driven methods could also be used for numerical weather prediction. First studies show promise but the lack of a common dataset and evaluation metrics make inter-comparison between studies difficult. Here we present a benchmark dataset for data-driven medium-range weather forecasting, a topic of high scientific interest for atmospheric and computer scientists alike. We provide data derived from the ERA5 archive that has been processed to facilitate the use in machine learning models. We propose a simple and clear evaluation metric which will enable a direct comparison between different methods. Further, we provide baseline scores from simple linear regression techniques, deep learning models as well as purely physical forecasting models. All data is publicly available and the companion code is reproducible with tutorials for getting started. We hope that this dataset will accelerate research in data-driven weather forecasting.


page 3

page 8


RainBench: Towards Global Precipitation Forecasting from Satellite Imagery

Extreme precipitation events, such as violent rainfall and hail storms, ...

An Evaluation of Machine Learning and Deep Learning Models for Drought Prediction using Weather Data

Drought is a serious natural disaster that has a long duration and a wid...

Benchmark Dataset for Precipitation Forecasting by Post-Processing the Numerical Weather Prediction

Precipitation forecasting is an important scientific challenge that has ...

Statistical and machine learning ensemble modelling to forecast sea surface temperature

In situ and remotely sensed observations have huge potential to develop ...

Long-term stability and generalization of observationally-constrained stochastic data-driven models for geophysical turbulence

Recent years have seen a surge in interest in building deep learning-bas...

Towards a Precipitation Bias Corrector against Noise and Maldistribution

With broad applications in various public services like aviation managem...

WeatherBench Probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models

WeatherBench is a benchmark dataset for medium-range weather forecasting...

Code Repositories


A benchmark dataset for data-driven weather forecasting

view repo


This repository contains code to train a direct residual convolutional neural network on weather data.

view repo


WARNING! This is my working fork! For the actual repo please go to

view repo

1 Overview of previous work

In this section, we briefly describe the four existing studies on data-driven NWP with a focus on the data, methods and evaluation.

1.1 Dueben2018

In this study, the authors trained a neural network to predict 500 hPa geopotential (Z500; see Section 3

for details on commonly used fields), and in some experiments 2 meter-temperature, 1 hour ahead. The training data was taken from the ERA5 archive for the time period from 2010 to 2017 and regridded to a 6 degree latitude-longitude grid. Two neural network variants were used, a fully connected neural network and a spatially localized network, similar to a convolutional neural network (CNN). After training they then created iterative forecasts up to 120 h lead time for 10 month validation period. They compared their data-driven forecasts to an operational NWP model and the same model run at a spatial resolution comparable to the data-driven method. One interesting detail is that their networks predict the difference from one time step to the next, instead of the absolute field. To create these iterative forecasts, they use a third-order Adams-Bashford explicit time-stepping scheme. The CNN predicting only geopotential performed best but was unable to beat the low-resolution physical baseline.

1.2 Scher2018 and Scher2019a

These two studies addressed the issue of data-driven weather forecasting in a simplified reality setting. Long runs of simplified General Circulation Models (GMCs) were used as “reality”. Neural networks were trained to predict the model fields several days ahead. The neural network architecture are CNNs with an encoder-decoder setup. They take as input the instantaneous 3D model fields at one timestep, and output the same model fields at some time later. In Scher2018, a separate network was trained for each lead-time up to 14 days. Scher2019a trained only on 1-day forecasts, and constructed longer forecasts iteratively. Interestingly, networks trained to directly predict a certain forecast time, e.g. 5 days, outperformed iterative networks. The forecasts were evaluated using the root mean squared error and the anomaly correlation coefficient of Z500 and 800 hPa temperature. Scher2018 used a highly simplified GCM without hydrological cycle, and achieved very high predictive skill. Additionally, they were able to create stable "climate" runs (long series of consecutive forecasts) with the network. Scher2019a used several more realistic and complex GCMs. The data-driven model achieved relatively good short-term forecast skill, but was unable to generate stable and realistic “climate” runs. In terms of neural-network architectures they showed that architectures tuned on simplified GCMs also work on more complex GCMs, and that the same architecture also has some prediction skill on single-level reanalysis data.

1.3 Weyn2019

In this study, reanalysis-derived Z500 and 700-300 hPa thickness at 6-hourly time steps are predicted with deep CNNs. The data are from the Climate Forecast System (CFS) Reanalysis from 1979–2010 with 2.5-degree horizontal resolution and cropped to the northern hemisphere. The authors used similar encoder-decoder convolutional networks as those used by Scher2018 and Scher2019a

but also experimented with adding a convolution long short-term memory

(LSTM; Hochreiter1997) hidden layer. As in Scher2019a, forecasts are generated iteratively by feeding the model’s outputs back in as inputs. The authors found that using two input time steps, 6 h apart, and predicting two output time steps, performed better than using a single step. Their best CNN forecast outperforms a climatology benchmark at up to 120 h lead time, and appears to correctly asymptote towards persistence forecasts at longer lead times up to 14 days.

These three approaches outline promising first steps towards data-driven forecasting. The differences of the proposed methods already highlight the importance of a common benchmark case to compare prediction skill.

2 Dataset

For the proposed benchmark, we use the ERA5 reanalysis dataset (C3S2017) for training and testing. Reanalysis datasets provide the best guess of the atmospheric state at any point in time by combining a forecast model with the available observations. The raw data is available hourly for 40 years from 1979 to 2018 on a 0.25°latitude-longitude grid (7211440 grid points) with 37 vertical levels.

Since this raw dataset is very large (a single vertical level for the entire time period amounts to almost 700GB of data), we regrid the data to lower resolutions. This is also a more realistic use case, since very high resolutions are still hard to handle for deep learning models because of GPU memory constraints and I/O speed. In particular, we chose 5.625° (3264 grid points), 2.8125° (64128 grid points) and 1.40525° (128256 grid points) resolution for our data. The regridding was done with the xesmf Python package (Zhuang2019)

using a bilinear interpolation. Powers of two for the grid are used since this is common for many deep learning architectures where image sizes are halved in the algorithm. Further, for 3D fields we selected 10 vertical levels: 1, 10, 100, 200, 300, 400, 500, 600, 700, 850 and 1000 hPa. Note that it is common to use pressure in hecto-Pascals as a vertical coordinate instead of physical height. The pressure at sea level is approximately 1000 hPa and decreases roughly exponentially with height. 850 hPa is at around 1.5 km height. 500 hPa is at around 5.5 km height. If the surface pressure is smaller than a given pressure level, for example at high altitudes, the pressure-level values are interpolated.

The processed data (see Table 1) are available at ???. The data are split into yearly NetCDF files for each variable and resolution, packed in a zip file. The entire dataset at 5.625° resolution has a size of 191GB. Individual variables amount to around 25GB three-dimensional and 2GB for two-dimensional fields. File sizes for 2.8125° and 1.40525° resolutions are a factor 4 and 16 times larger. Data processing was organized using Snakemake (Koster2012). For further instructions on data downloading visit the Github page111 The available variables were chosen based on meteorological consideration. Geopotential, temperature, humidity and wind are prognostic state variables in most physical NWP and climate models. Geopotential at a certain pressure level , typically denoted as with units of ms, defined as


where describes height in meters and m s is the gravitational acceleration. Horizontal relative vorticity, defined as , describes the rotation of air at a given point in space. Potential vorticity (Hoskins1985; Holton2004) is a commonly used quantity in synoptic meteorology which combines the rotation (vorticity) and vertical temperature gradient of the atmosphere. It is defined as , where is the density, is the absolute vorticity (relative plus the Earth’s rotation) and is the potential temperature. In addition to the three-dimensional fields, we also include several two-dimensional fields: 2 meter-temperature is often used as an impact variable because of its relevance for human activities and is directly affected by the diurnal solar cycle; 10 meter-wind is also an important impact-related forecast variable, for example for wind energy; similarly, total cloud cover is an essential variable for solar energy forecasting. We also included precipitation but urge caution since precipitation in reanalysis datasets often shows large deviation from observations (e.g. Betts2019; Xu2019). Finally, we added the top-of-atmosphere incoming solar radiation as it could be a useful input variable to encode the diurnal cycle. Further, there are several potentially important time-invariant fields, which are contained in the constants file. The first three variables enclose information about the surface: the land-sea mask is a binary field with ones for land points; the soil type consists of seven different soil categories222Coarse = 1, Medium = 2, Medium fine = 3, Fine = 4, Very fine = 5, Organic = 6, Tropical organic = 7, see; orography is simply the surface height. In addition, we included two-dimensional fields with the latitude and longitude values at each point. Particularly the latitude values could become important for the network to learn latitude-specific information such as the grid structure or the Coriolis effect (see Section 5). The Github code repository includes all scripts for downloading and processing of the data. This enables users to download additional variables or regrid the data to a different resolution.

Long name Short name Description Unit Levels
geopotential z Proportional to the height of a pressure level [ms] 10 levels
temperature t Temperature [K] 10 levels
specific_humidity q Mixing ratio of water vapor [kg kg] 10 levels
relative_humidity rh Humidity relative to saturation [%] 10 levels
u_component_of_wind u Wind in x/longitude-direction [m s] 10 levels
v_component_of_wind v Wind in y/latitude direction [m s] 10 levels
vorticity vo Relative horizontal vorticity [1 s] 10 levels
potential_vorticity pv Potential vorticity [K m kg s] 10 levels
2m_temperature t2m Temperature at 2 m height above surface [K] Single level
10m_u_component_of_wind u10 Wind in x/longitude-direction at 10 m height [m s] Single level
10m_v_component_of_wind v10 Wind in y/latitude-direction at 10 m height [m s] Single level
total_cloud_cover tcc Fractional cloud cover (0–1) Single level
total_precipitation tp Hourly precipitation [m] Single level
toa_incident_solar_radiation tisr Accumulated hourly incident solar radiation [J m] Single level
constants File containing time-invariant fields
land_binary_mask lsm Land-sea binary mask (0/1) Single level
soil_type slt Soil-type categories see text Single level
orography orography Height of surface [m] Single level
latitude lat2d 2D field with latitude at every grid point [°] Single level
longitude lon2d 2D field with longitude at every grid point [°] Single level
Table 1: List of variables contained in the benchmark dataset.

3 Evaluation

Evaluation is done for the years 2017 and 2018. To make sure no overlap exists between the training and test dataset, the first test date is 1 January 2017 00UTC plus forecast time (i.e. for a three day forecast the first test date would be 4 January 2017 00UTC) while the last training target is 31 December 2016 23UTC. Further, the evaluation presented here is done on 5.625° resolution333The evaluation of all baselines in this paper are done in this Jupyter notebook: This means that predictions at higher resolutions have to be downscaled to the evaluation resolution. We also evaluated some baselines at higher resolutions and found that the scores were almost identical with differences smaller than 1%. Therefore we are reassured that little information is lost by evaluating at a coarser resolution.

A note on validation and testing: In machine learning it is good practice to split the data into three parts: the training, validation and test sets. The training dataset is used to actually fit the model. The validation dataset is used during experimentation to check the model performance on data not seen during training. However, there is the danger that through continued tuning of hyperparameters one unwillingly overfits to the validation dataset. Therefore it is advisable to keep a third, testing, dataset for final evaluations of model performance. For this benchmark this final evaluation is done for the years 2017 and 2018. Therefore, we strongly encourage users of this dataset to pick a period from 1979 to 2016 for validation of their models for hyperparameter tuning. Because meteorological fields are highly correlated in time, it is advisable to choose a longer contiguous chunck of data for validation instead of a completely random split. Here we chose the year 2016 for validation.

We chose 500 hPa geopotential and 850 hPa temperature as primary verification fields. Geopotential at 500 hPa pressure, often abbreviated as Z500, is a commonly used variable that encodes the synoptic-scale pressure distribution. It is the standard verification variable for most medium-range NWP models. Note that geopotential height, also commonly used, is defined as with units of meters. We picked 850 hPa temperature as our secondary verification field because temperature is a more impact-related variable. 850 hPa is usually above the planetary boundary layer and therefore not affected by diurnal variations but provides information about broader temperature trends, including cold spells and heat waves.

We chose the root mean squared error (RMSE) as a metric because it is easy to compute and mirrors the loss used for most ML applications. We define the RMSE as the mean latitude-weighted RMSE over all forecasts:


where is the model forecast and is the ERA5 truth. is the latitude weighting factor for the latitude at the th latitude index:


The anomaly correlation coefficient (ACC) is another common metric but for the chosen variable the qualitative results were very similar. For other variables, however, the ACC might provide additional information.

4 Baselines

To evaluate the skill of a forecasting model it is important to have baselines to compare to. In this section, we compute scores for several baselines. The results are summarized in Fig. 2 and Table 2.

Figure 2: RMSE of a) 500 hPa geopotential and b) 850 hPa temperature for different baselines at 5.625° resolution. Solid lines for linear regression and CNN indicate iterative forecasts, while dots represent direct forecasts for 3 and 5 days lead time.
Baseline Z500 RMSE (3 days / 5 days) [m s] T850 RMSE (3 days / 5 days) [K]
Persistence 936 / 1033 4.23 / 4.56
Climatology 1075 5.51
Weekly climatology 816 3.50
Linear regression (direct) 714 / 814 3.19 / 3.52
Linear regression (iterative) 719 / 812 3.17 / 3.48
CNN (direct) 626 / 757 2.87 / 3.37
CNN (iterative) 1114 / 1559 4.48 / 9.69
IFS T42 489 / 743 3.09 / 3.83
Operational IFS 154 / 334 1.36 / 2.03
Table 2: Baseline scores for 3 and 5 days forecast time at 5.625° resolution. Best machine learning model and operational baseline are highlighted.

4.1 Persistence and Climatology

The two simplest possible forecasts are a) a persistence forecast in which the fields at initialization time are used as forecasts ("tomorrow’s weather is today’s weather"), and b) a climatological forecast. For the climatological forecast, two different climatologies were computed from the training dataset (1979–2016): first, a single mean over all times in the training dataset and, second, a mean computed for each of the 52 calendar weeks. The weekly climatology is significantly better, approximately matching the persistence forecast between 1 and 2 days, since it takes into account the seasonal cycle. This means that to be useful, a forecast system needs to beat the weekly climatology and the persistence forecast.

4.2 Operational NWP model

The gold standard of medium-range NWP is the operational IFS (Integrated Forecast System) model of the European Center for Medium-range Weather Forecasting (ECMWF)444 We downloaded the forecasts for 2017 and 2018 from the THORPEX Interactive Grand Global Ensemble (TIGGE; Bougeault2010) archive, which contains the operational forecasts, initialized at 00 and 12 UTC regridded to a 0.5° by 0.5° grid, which we further regridded to 5.625°. Note that the forecast error starts above zero because the operational IFS is initializes from a different analysis. Operational forecasting is computationally very expensive. The current IFS deterministic forecast is computed on a cluster with 11,664 cores. One 10 day forecast at 10 km resolution takes around 1 hour of real time to compute.

4.3 Physical NWP model run at coarser resolution

To provide physical baselines more in line with the computational resources of a data-driven model, we ran the IFS model at two coarser horizontal resolutions, T42 (approximately 2.8° or 300 km resolution at the equator) and T64 (approximately 1.9° or 200 km), but still with 62 vertical levels. The runs were initialized from ???. The RMSE for these two forecasts for Z500 is in-between the operational IFS and the data-driven baselines. Therefore, beating these coarse-resolution physical forecasts is a realistic, primary target for data-driven approaches. For 850 hPa temperature the coarse-resolution forecasts perform much worse. The likely reason for this is that temperature close to the ground is much more affected by the resolution and representation of topography within the model. Further, the model was not specifically tuned for these resolutions. Therefore, the temperature scores from T42 and T64 should not be considered as a serious baseline. Computationally, a single forecast takes ??? for the T42 model and ??? for the T64 model on a single CPU core.

4.4 Linear regression

As a first purely data-driven baseline we fit a simple linear regression model to directly predict the Z500 and T850 fields at a given lead time. The 2D input and output fields are flattened and concatenated for this purpose (23264 4096). We do this for 3 d and 5 d lead time, as well as 6 h to create an iterative forecast. The advantage of iterative forecasts is that a single model is able to make predictions for any forecast time rather than having to train several models. For iterative forecasts the model takes its previous output as input for the next step. To create a 5 day iterative forecast the model trained to predict 6 hour forecasts is called 20 times. For this model, the iterative forecast performs just as well as the direct forecast due to its linear nature. At 5 days, the linear regression forecast is about as good as the weekly climatology.

4.5 Simple convolutional neural network

As our deep learning baseline we chose a simple fully-convolutional neural network. CNNs are the natural choice for spatial data since they exploit translational invariances in images/fields. Here we train a CNN with 5 layers. Each hidden layer has 64 channels with a convolutional kernel of size 5 and ELU activations (Clevert2015). The input and output layers have two channels, representing Z500 and T850. The model was trained using the Adam optimizer (Kingma2014)

and a mean squared error loss function. The total number of trainable parameters is 313,858. We implemented periodic convolutions in the longitude direction but not the latitude direction. The implementation can be found in the Github repository. The direct CNN forecasts beat the linear regression forecasts for 3 and 5 days forecast time. However, at 5 days these forecasts are only marginally better than the weekly climatology (see Table 

LABEL:tab:2). This baseline, however, is to be seen simply as a starting point for more sophisticated data driven methods. The iterative CNN forecast, which equivalently to the linear regression iterative forecast was created by chaining together 6 hourly predictions, performs well up to around 1.5 days but then the network’s errors grow quickly and diverge. This confirms the findings of Scher2019 whose experiments showed that training with longer lead time yields better results than chaining together short-term forecasts. However, the poor skill of the iterative forecast could easily be a result of using an overly simplistic network architecture. The iterative forecasts of Weyn2019, who employ a more complex network structure, show stable long term performance up to two weeks with realistic statistics.

4.6 Example forecasts

To further illustrate the prediction task, Fig. 3 shows example geopotential and temperature fields. The ERA5 temporal differences show several interesting features. First, the geopotential fields and differences are much smoother compared to the temperature fields. The differences in both fields are also much smaller in the tropics compared to the extratropics where propagating fronts can cause rapid temperature changes. An interesting feature is detectable in the 6h Z500 difference field in the tropics. These alternating patterns are signatures of atmospheric tides.

The CNN forecasts for 6h lead time are not able to capture these wave-like patterns which hints at a failure to capture the basic physics of the atmosphere. For 5 days forecast time the CNN model predicts unrealistically smooth fields. This is likely a result of two factors: first, the two input fields used in this baseline CNN contain insufficient information to create a skillful 5 day forecast; and second, at 5 days the atmosphere already shows some chaotic behavior which causes a model trained with a simple RMSE loss to predict smooth fields (see Section 5). The IFS operational forecast has much smaller errors than the CNN forecast. It is able to capture the propagation of tropical waves. Its main errors appear at 5 days in the mid-latitudes where extratropical cycles are in slightly wrong positions.

Figure 3: Example fields for 2017-01-01 00UTC initialization time. The top two rows show the ERA5 "truth" fields for geopotential (Z500) and temperature (T850) at initialization time (t=0h) and for 6h and 5d forecast time. In addition, the difference between the forecast times and the initialization time is shown. The third and fourth rows show the forecasts from the CNN model. Rows five and six show the IFS operational model. For the CNN forecasts the first column is identical to the ERA5 truth. We selected the 6h iterative CNN model for the 6h forecast but the 5d direct CNN model for the 5 day forecast. For the IFS the initial states (t=0h) differ slightly albeit not visibly. In addition to the forecast fields the error relative to the ERA5 "truth" is shown in the third and fifth columns. Please note that the colorbars for the difference fields change.

5 Discussion

5.1 Weather-specific challenges

From a ML perspective, state-to-state weather prediction is similar to image-to-image translation. For this sort of problem many deep learning techniques have been developed in recent years

(Kaji2019). However, forecasting weather differs in some important ways from typical image-to-image applications and raises several open questions.

First, the atmosphere is three-dimensional. So far, this aspect has not been taken into account. In the networks of Scher2019, for example, the different levels have been treated as separate channels of the CNN. However, simply using a three-dimensional CNN might not work either because atmospheric dynamics and grid spacings change in the vertical, thereby violating the assumption of translation invariance which underlies the effectiveness of CNNs. This directly leads to the next challenge: On a regular latitude-longitude grid, the dynamics also change with latitude because towards the poles the grid cells become increasingly stretched. This is in addition to the Coriolis effect, the deflection of wind caused by the rotation of Earth, which also depends on latitude. A possible solution in the horizontal could be to use spherical convolutions (Cohen2018; Perraudin2019; Jiang2019) or to feed in latitude information to the network.

Another potential issue is the limited amount of training data available. 40 years of hourly data amounts to around 350,000 samples. However, the samples are correlated in time. If one assumes that a new weather situation occurs every day, then the number of samples is reduced to around 15,000. Without empirical evidence it is hard to estimate whether this number is sufficient to train complex networks without overfitting. Should overfitting be a problem, one could try transfer learning. In transfer learning, the network is pretrained on a similar task or dataset, for example, climate model simulations, and then finetuned on the actual data. This is common practice in computer vision and has been successfully applied to seasonal ENSO forecasting

(Ham2019). Another common method to prevent overfitting is data augmentation, which in traditional computer vision is done by e.g. randomly rotating or flipping the image. However, many of the traditional data augmentation techniques are questionable for physical fields. Random rotations, for example, will likely not work for this dataset since the x and y directions are physically distinct. Thus, finding good data augmentation techniques for physical fields is an outstanding problem. Using ensemble analyses and forecasts could provide more diversity in the training dataset.

Finally, there are technical challenges. Data for a single variable with ten levels at 5.625° resolution take up around 30 GB of data. For a network with several variables or even at higher resolution, the data might not fit into CPU RAM any more and data loading could become a bottleneck. For image files, efficient data loaders have been created555See e.g. or

. One promising but so far unexplored option is to use Tensorflow’s TFRecords ( For netCDF files, however, so far no efficient solution exists to our knowledge. Further, one can assume that to create a competitive data-driven NWP model, high resolutions have to be used, for which GPU RAM quickly becomes a limitation. This suggests that multi-GPU training might be necessary to scale up this approach (potentially similar to the technical achievement of Kurth2018).

5.2 Probabilistic forecasts and extremes

One important aspect that is not currently addressed by this benchmark is probabilistic forecasting. Because of the chaotic evolution of the atmosphere, it is very important to also have an estimate of the uncertainty of a forecast. In physical NWP this is done by running several forecasts, called an ensemble, from slightly different initial conditions and potentially with different or stochastic model physics (Palmer2019b)

. From this Monte Carlo forecast one can then estimate a probability distribution. A different approach, which is often taken in statistical post-processing of NWP forecasts, is to directly estimate a parametric distribution

(e.g. Gneiting2005; Rasp2018d)

. For a probabilistic forecast to be reliable the forecast uncertainty has to be an accurate indicator of the error. A good first order approximation for this is the spread (ensemble standard deviation) to error (RMSE) ratio which should be one

(Leutbecher2008). A more stringent test is to use a proper probabilistic scoring rule, for example the continuous ranked probability score (CRPS) (Gneiting2007; Gneiting2014a). For deterministic forecast the CRPS reduces to the mean absolute error. Extending this benchmark to probabilistic forecasting simply requires computing a probabilistic score. How to produce probabilistic data-driven forecasts is a very interesting research question in its own right. We encourage users of this benchmark to explore this dimension.

A related issue is the question of extreme weather situations, for example heat waves. These events are, by definition, rare, which means that they will contribute little to regular verification metrics like the RMSE. However, for society these events are highly important. For this reason, it would make sense to evaluate extreme situations separately. But defining extremes is ambiguous which is why there is no standard metric for evaluating extremes. The goal of this benchmark is to provide a simple, clear problem. Therefore, we decided to omit extremes for now but users are encouraged to chose their own verification of extremes.

5.3 Climate simulations

Another aspect that is untouched by the benchmark challenge proposed here is climate prediction. Even though weather and climate deal with the same underlying physical system, they pose different forecasting challenges. In weather forecasting the goal is to predict the state of the atmosphere at a specific time into the future. This is only possible up to the prediction horizon of the atmosphere, which is thought to be at roughly two weeks. Climate models, on the other hand, are evaluated by comparing long-term statistics to observations, for example the mean surface temperature (Stocker2013). Scher2019 created iterative climate time scale runs with their data-driven models and compared first and second-order statistics. They found that the model sometimes produced a stable climate but with significant biases and a poor seasonal cycle. This indicates that, so far, iterative data-driven models have been unable to produce physically reasonable long-term predictions. This remains a key challenge for future research. While not specifically included in this benchmark, a good test for climate simulations is to look at long term mean statistics and the seasonal cycle as done in Figs. 6 and 7 of Scher2019.

Climate change simulations represent another step up in complexity. To start with, external greenhouse gas forcing would have to be included. Further, future climates will produce atmospheric states that lie outside of the historical manifold of states. Plain neural networks are very bad at extrapolating to climates beyond what they have seen in the training dataset (Rasp2018c). For this reason, climate change simulations with current data-driven methods are likely not a good idea. However, research into physical machine learning is ongoing and might offer new opportunities in the near future (e.g. Bar-Sinai2019a; Beucler2019).

5.4 Promising research directions

There is a wide variety of promising research directions for data-driven weather forecasting. The most obvious direction is to increase the amount of data used for training and the complexity of the network architecture. This dataset provides a, so far, unexploited volume and diversity of data for training. It is up to future research to find out exactly which combination of variables will turn out to be useful. Further, this dataset offers a four times higher horizontal resolution than all previous studies. The hope is that this data will enable researcher to train more complex models than have previously been used.

With regards to model architecture, there is a huge variety of network architectures that can be explored. U-Nets (Ronneberger2015) have been used extensively for image segmentation tasks that require computations across several spatial scales. Resnets (He2015) are currently the state of the art for image classification and their residual nature could be a good fit for state-to-state forecasting tasks. For synthesis tasks, generative adverserial networks (GANs) (Goodfellow2014) were shown to be particularly powerful for creating realistic natural images and fluid flows (Xie2018). This might be attractive since minimizing a mean loss, such as the MSE, for random or stochastic data leads to unrealistically smooth predictions as seen in Fig. 3. Conditional GANs (Mirza2014; Isola2016) could potentially alleviate this issue but it is still unclear to what extent GAN predictions are able to recover the multi-variate distribution of the training samples.

In this paper a benchmark dataset for data-driven weather forecasting is presented. It focuses on global medium-range (roughly 2 days to 2 weeks) prediction. With the rise of deep learning in physics, weather prediction is a challenging and interesting target because of the large overlap with traditional deep learning tasks (Reichstein2019). While first attempts have been made in this direction, as discussed in Section 1, the field currently lacks a common dataset which enables the inter-comparison of different methods. We hope that this benchmark can provide a foundation for accelerated research in this area. Loosely following Ebert-Uphoff2017, the key features of this benchmark are:

  • Scientific impact: Numerical weather forecasting impacts many aspects of society. Currently, NWP model run on massive super-computers at very high computational cost. Building a capable data-driven model would be beneficial in many ways (see Section WeatherBench: A benchmark dataset for data-driven weather forecasting). In addition, there is the open, and highly debated, question whether fully data-driven methods are able to learn a good representation of atmospheric physics.

  • Challenge for data science:

    While global weather prediction is conceptually similar an image-to-image task, and therefore allows for the application of many state-of-the-art deep learning techniques, there are some unique challenges to this problem: the three-dimensional, anisotropic nature of the atmosphere; non-uniform-grids; potentially limited amounts of training data and the technical challenge of handling large data volumes.

  • Clear metric for success: We defined a single metric (RMSE) for two fields (500 hPa geopotential and 850 hPa temperature). These scores provide a simple measure of success for data-driven, medium-range forecast model.

  • Quick start:

    The code repository contains a quick-start Jupyter notebook for reading the data, training a neural network and evaluating the predictions against the target data. In addition, the repository contains many functions which are likely to be used frequently, for example an implementation of periodic convolutions in Keras.

  • Reproducibility and citability: All baselines and results form this paper are fully reproducible from the code repository. Further, the baseline predictions are all saved in the data repository. The data has been assigned a permanent DOI.

  • Communication platform: We will use the Github code repository as an evolving hub for this project. We encourage users of this dataset to start by forking the repository and eventually merge code that might be useful for others back into the main branch. The main platform for communication, e.g. asking questions, about this project will be Github issues.

We hope that this benchmark will foster collaboration between atmospheric and data scientists in the ways we imagined and beyond.

All code is available at An archived snapshot of the repository at the time of submission can be found at ???.

All data can be downloaded from ???

SR, PD, SS and JW conceived the idea. SR prepared the data and baselines and led the writing. All authors contributed to the manuscript.

The authors declare no competing interests.

Stephan Rasp acknowledges funding from the German Research Foundation (DFG). We thank the Copernicus Climate Change Service (C3S) for allowing us to redistribute the data. Peter D. Düben gratefully acknowledges funding from the Royal Society for his University Research Fellowship and the ESIWACE2 project. The ESIWACE2 project have received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 823988.