## 1 Introduction

Weather forecasting is the prediction of future weather conditions such as precipitation, temperature, pressure and wind and is fundamental to both science and society. The field joins the forefronts of physical modelling and computing technology into a single century-long scientific and technological endeavor (bauer2015quiet). The social and economic benefits of accurate weather forecasting range from improvements in our daily lives to substantial impacts on agriculture, energy and transportation and to the prevention of human and economic losses through better prediction of hazardous conditions such as storms and floods (shakti2015comparison; hwang2015improved).

Operational weather forecasts are based on Numerical Weather Prediction (NWP) using the laws of physics to simulate the dynamics of the atmosphere. NWP has seen substantial advances over the preceding decades due to improvements in the representation of the physics, an increase in observational data and an exponential growth in computing capabilities. These aspects have increased the spatial and temporal resolution of NWP and have advanced the range of skillful weather forecasts by about one day per decade. Despite the advances, several challenges remain for NWP (bauer2015quiet). The computational and power demands of NWP grow as a power of the resolution of the forecast and create a trade-off between the accuracy of the forecast that requires increasing levels of resolution and the time required to make the forecast. The mathematical formulation of NWP is derived from our current understanding of atmospheric physics, which might be imprecise or cannot be fully resolved at the resolution of the model.

In the field of machine learning deep neural networks (DNN) have seen remarkable progress in recent years due to increased amounts of available data, better model architectures and ease of implementation on powerful specialized hardware such as GPUs and TPUs

(jouppi2017datacenter). Recently introduced DNN architectures are able to effectively process and make use of large spatial and temporal contexts in the input data (krizhevsky2012imagenet; he2016deep; vaswani2017attention). These models can produce probabilistic outputs and represent uncertainty in the predictions. The predictive performance tends to improve with increasing amounts of training data while the specification of the DNNs remains easy to comprehend and maintain. These properties make DNNs especially promising for weather forecasting due to the vast amounts of continually collected data from satellites, ground based radars and weather stations requiring no human annotation, the large computational requirements of the task, the rich spatial and temporal structure of the inputs and predictions and the inherent need to represent uncertainty.We adopt DNNs to tackle the self-annotated structured prediction task of precipitation forecasting. The setting is an ideal benchmark due to the availability of dense and continual precipitation measurements that do not require human annotation and well defined metrics to measure performance. We develop *MetNet* that is a Neural Weather Model (NWM) that forecasts rates of precipitation with a lead time of up to 8 hours, a spatial resolution of and a temporal resolution of minutes. MetNet covers a km geographical area corresponding to the continental United States. The architecture is conditioned on lead time and uses axial self-attention (ho2019axial) at its core to aggregate a large spatial context of km. MetNet relies on mosaicked ground based radar and satellite imagery as input and the predictions take in the order of seconds independently of lead time and can be done in parallel.

We show that for up to 7 to 8 hours of lead time MetNet is able to outperform the High Resolution Rapid Refresh (HRRR) system which is the current best operational NWP available from NOAA, the National Oceanic and Atmospheric Administration. MetNet substantially outperforms other 2-3 hour forecasting methods (ayzel2019optical; agrawalmachine). Ablation studies show that MetNet’s architecture is able to capture the large spatial context that is needed for accurate predictions with many hours of lead time. These results also suggest the effectiveness of axial self-attention blocks beyond generative modelling of images. Visualizations suggest that MetNet is able to capture advection and the formation of new regions of precipitation. To our knowledge MetNet is the first machine learning model to outperform NWP on a structured prediction task at such a range and resolution.

## 2 Precipitation Forecasting

Precipitation provides a benchmark for a highly varying and densely measured target (agrawalmachine)

. We cast precipitation forecasting as a structured prediction problem where the output comes in the form of a three-dimensional tensor. Each value of the tensor corresponds to a time and a location and indicates the corresponding rate of precipitation measured in mm/h. Target precipitation rates are estimated by the Multi Radar Multi Sensor (MRMS) ground based radars as a function of the returned radar echoes

(zhang2016multi). The spatial size obtained from MRMS is covering the continental United States. Each pixel covers of longitude and latitude corresponding to approximately 1 km. In addition to MRMS frames, the available input data include the 16 spectral bands of the optical Geostationary Operational Environmental Satellite 16 (GOES-16). Figure 1 contains examples of MRMS and GOES-16 frames.## 3 Neural Weather Models

We next describe defining features of DNNs when used as neural weather models (NWM) that is as models for the prediction of structured weather variables.

### 3.1 Direct Probabilistic Model

NWMs predict the probabilities of target weather conditions

at a target time from a set of input conditions at an input time up to :(1) |

where

is a probability distribution over the targets

given the inputs and is a deep neural network with learnable parameters . The probabilistic form of a NWM accounts for uncertainty by computing a probability distribution over possible outcomes. The NWM does not produce a single deterministic output.### 3.2 Architectural Constraints and Learning

NWMs contain no explicit assumptions about the physics of weather. One encodes generic assumptions about the spatial and temporal relations among the inputs and between inputs and targets in the form of parametric architectural constraints in the NWM. One learns the parameters governing these relations through back-propagation by minimizing the forecast error between the observed and the predicted .

### 3.3 Discrete Distributions

Weather variables often correspond to physical measurements on a continuous scale. Instead of modelling the continuous values of the targets we discretize the variables into a large number of small intervals that cover the continuous range of interest. The prediction from the NWM is then a categorical distribution that assigns a probability to each of the intervals (Figure 2). The generated categorical distribution is highly flexible and stabilises the training of the underlying DNN (oord2016pixel; kalchbrenner2017video).

## 4 MetNet Architecture

We next describe the neural network architecture that underlies the NWM for precipitation rate forecasting that we call MetNet.

### 4.1 Input Patch

MetNet receives as input a four-dimensional tensor of size that corresponds to data from a large patch of the continental United States with dimensions time, height, width and number of channels. The time dimension comprises slices sampled every 15 minutes over a 90 minutes interval prior to , where is the time at which the model makes a prediction into the future. The input data is calculated from a patch covering a geographical area of kilometers corresponding to values. The input features comprise the single MRMS radar image, the 16 spectral bands of the GOES-16 satellite and additional real-valued features for the longitude, latitude and elevation of each location in the patch as well as for the hour, day and month of the input time . The latter time features are tiled along the spatial dimensions of the input tensor (Figure 3).

### 4.2 Conditioning on Target Lead Time

A forward pass through MetNet makes a prediction for a single lead time. The model is informed about the desired lead time by concatenating this information with the descriptive input features. The aim is for the computation in MetNet to be aware of the lead time of the prediction from the very outset so that every aspect of the computation can be conditioned on the lead time. The lead time is represented as an integer indicating minutes from 2 to 480. The integer is tiled along the

locations in the patch and is represented as an all-zero vector with a 1 at position

in the vector. By changing the target lead time given as input, one can use the same MetNet model to make forecasts for the entire range of target times that MetNet is trained on.### 4.3 Target Patch

Accurate predictions of the target tensor with up to 480 minutes of lead time require a large spatial context around the target. With an input patch covering km and an indicative average precipitation displacement of 1 km per minute, we set the target patch to cover km centered on the input patch. This leaves at least 480 km of spatial context on each of the four sides of the target patch satisfying the indicative displacement rate.

### 4.4 Output Layer

The output of MetNet is a 512-way categorical distribution. Each of the 512 bins corresponds to a mm/h interval of predicted precipitation rate starting from mm/h to mm/h. All precipitation rates higher than mm/h are grouped in the last bin. Probabilities for any precipitation range of interest or for precipitation rate above a given threshold are obtained by summing probabilities of the intervals in that range or above that threshold.

### 4.5 Spatial Downsampler

MetNet aims at fully capturing the spatial context in the input patch. A trade-off arises between the fidelity of the representation and the memory and computation required to compute it. To maintain viable memory and computation requirements, the first part of MetNet contracts the input tensor spatially using a series of convolution and pooling layers. The slices along the time dimension of the input patch are processed separately. Each slice is first packaged into an input tensor of spatial dimensions (see Appendix A for the exact pre-processing operations). Each slice is then processed by the following neural network layers: a convolution with 160 channels, a max-pooling layer with stride 2, three more convolutions with 256 channels and one more max pooling layer with stride 2. These operations produce tensors of spatial dimensions and 256 channels.

### 4.6 Temporal Encoder

The second part of MetNet encodes the input patch along the temporal dimension. The

spatially contracted slices are given to a recurrent neural network following the order of time. We use a Convolutional Long Short-Term Memory network with kernel size

and 384 channels for the temporal encoding (xingjian2015convolutional). The recurrent network is able to gauge the temporal dynamics of the input slices and by following the direction of time it is able to give more relevance to the patterns in the most recent input slice. The result is a single tensor of size and 384 channels, where each location summarizes spatially and temporally one region of the large context in the input patch.### 4.7 Spatial Aggregator

To make MetNet’s receptive field cover the full global spatial context in the input patch, the third part of MetNet uses a series of eight axial self-attention blocks (ho2019axial; donahue2019large). Four axial self-attention blocks operating along the width and four blocks operating along the height are interleaved and have 2048 channels and 16 attention heads each. Axial self-attention blocks sidestep the computationally prohibitive quadratic factor of vanilla self-attention blocks (vaswani2017attention) while preserving the benefit of reaching a full receptive field. The global context can be reached in just two axial self-attention blocks compared to the 32 blocks that are needed using standard convolutions. In total this setting for MetNet has 225M parameters.

Configuration | Spatial Context (km) | Temporal Context (min) | Data Sources |
---|---|---|---|

MetNet | 1024 | 90 | MRMS, GOES-16 |

MetNet-ReducedSpatial | 512 | 90 | MRMS, GOES-16 |

MetNet-ReducedTemporal | 1024 | 30 | MRMS, GOES-16 |

MetNet-GOESOnly | 1024 | 90 | GOES-16 |

## 5 Discussion of NWMs and NWP

We next discuss some of the properties of MetNet and NWMs in relation to analogous aspects of NWP (bauer2015quiet). MetNet outputs directly a probability distribution, while NWP produces a probabilistic prediction by ensembling a set of deterministic physical simulations with either different initial conditions (Figure 2) or model parameters. MetNet is constructed from general modules such as convolutions, recurrence and attention that depend on the structure of the task but are largely independent of the underlying domain, whereas NWP relies on explicit phenomena-dependent physical equations; a consequence of this is that MetNet’s performance correlates with the availability of data for the task. The latency of MetNet is independent of the magnitude of the target lead time and all lead times can be predicted at once in parallel. On the other hand, in NWP the simulation is sequential along time and the latency scales linearly with target lead time. In practice MetNet’s latency is in the order of seconds for any of the target lead times (up to 480 minutes), whereas that of NWP is in order of tens of minutes to hours. An interesting observation concerns the scalability of models as currently designed with respect to the underlying resolution. For MetNet a doubling in spatial resolution requires 4 times more computation using the current architectural choices. For NWP doubling the resolution requires approximtely 8 times more computation (bauer2015quiet). An importance difference here is that the performance of NWP is inherently tied to the spatial resolution of the model since increased resolution allows more physical phenomenon to be directly resolved. In MetNet on the other hand the performance is not directly tied to the resolution since the network can learn to represent any sub-resolution structure in the hidden layers in a distributed way.

## 6 Experiments

We design three sets of experiments to evaluate the performance and characteristics of MetNet’s predictions.

### 6.1 Eight Hour Forecasts

In the first set of experiments, we evaluate the performance of MetNet on the precipitation rate forecasting benchmark using the data collected in agrawalmachine that cover the continental US. We compare MetNet with NOAA’s current HRRR system, with a strong optical flow methodayzel2019optical and with a persistence baseline using the F1 score on three precipitation rate thresholds: 0.2 mm/h, 1 mm/h and 2 mm/h in Figure 5 (see Appendix C and D for details). HRRR generates forecasts covering the same region as MetNet once an hour for up to 18 hours into the future at a native resolution of 3 km. Since MetNet outputs probabilities, for each threshold we sum the probabilities along the relevant range and calibrate the corresponding F1 score on a separate validation set. MetNet outperforms HRRR substantially on the three thresholds up to a lead time of respectively 400, 440 and the full 480 minutes. MetNet is also substantially better than the optical flow method and than the persistence baseline for all lead times. The F1 score degrades for higher precipitation rate thresholds for all methods since these events become increasingly rare. Recent work using neural networks for precipitation forecasting focuses on lead times between 60 and 90 minutes (agrawalmachine; ayzel2019all; lebedev2019precipitation; xingjian2015convolutional; shi2017deep) with optical flow at times outperforming neural networks (ayzel2019all; lebedev2019precipitation). To our knowledge MetNet is the first machine learning model to outperform HRRR and optical flow methods on a richly structured weather benchmark at such a scale and range.

### 6.2 MetNet Ablation Experiments

Next we perform ablation experiments to shed light on the importance of capturing spatial and temporal context and the importance of the various data sources in the input. The first ablation experiment reduces the spatial size of the input patch to 512 km. The very first convolutional layer in the spatial downsampling part of MetNet is removed and all else is kept exactly the same. The performance of this configuration, called MetNet-ReducedSpatial, is similar to MetNet up to 150 minutes and then it starts to become progressively worse. This indicates the importance of the large spatial context used as input as well as the ability of MetNet’s architecture to capture information contained in the original receptive field of 1024 km. This contrasts with other neural networks used for 1 hour precipitation forecasting that have a U-Net-style architecture (ronneberger2015u; ayzel2019all; agrawalmachine). The receptive field of these networks at the border of the target patch is limited and likely hurts their performance and suitability for the task.

The second ablation configuration is called MetNet-ReducedTemporal and reduces the temporal context of MetNet’s input features from 90 minutes prior to to 30 minutes prior to . This does not affect MetNet’s performance significantly and suffices to capture the advection in the input patch. In the MetNet-GOESOnly configuration, we evaluate the contribution of the MRMS data and the ability of MetNet to predict precipitation rate from just the globally available GOES-16 data. Despite starting off substantially worse, MetNet-GOESOnly’s performance approaches that of the full MetNet configuration with increasing hours of lead time suggesting that MRMS data becomes less necessary with time.

### 6.3 Visualization

In this section we visualize the 1 mm/h precipitation rate predictions of MetNet^{1}^{1}1Full videos: https://tinyurl.com/metnet-videos. Table 2 includes predictions every two hours for up to 8 hours for MetNet, HRRR and MRMS. MetNet’s predictions demonstrate substantial precipitation rate increases and decreases across time in a given location. In particular, MetNet is able to capture the creation of a significant region of higher precipitation as seen towards the middle of the US map for sample 1. With increasing lead time MetNet’s predictions become increasingly blurry explained by the increased uncertainty over exact time and location of precipitation events.

## 7 Conclusion

In this paper we presented MetNet, a neural weather model for precipitation forecasting. MetNet improves upon the current operational NWP system HRRR for up to 8 hours of lead time. Reaching beyond 8 hours will require ever larger input contexts, rigorous engineering and deeper neural networks.

## Acknowledgements

We would like to thank Manoj Kumar, Wendy Shang, Stephen Hoyer, Lak Lakshmanan, Rob Carver, Aaron Bell, John Burge and Cenk Gazen for comments on the manuscript and insightful discussions.

## References

## Appendix A Data

We construct the input and target data by selecting data relative to an anchor time where we stop collecting input data. We define as the acquisition times of the optical satellite GOES data which is sampled every 10 to 15 minutes. For the input data we find the first MRMS and GOES immediately prior to times relative to

. We ensure that the distributions of the input data are approximately normal distributed and within the a

range by applying the following data normalization: The GOES data is robustly normalized by subtracting the median and dividing by the inter-quantile range and the MRMS data is transformed by

. To avoid any remaining outliers we replace all NaNs with zeros and squash the data to

using the hyperbolic tangent function. As additional static features we include longitude, latitude, elevation and time (represented as three normalized features: , , ) as feature maps. To reduce the memory footprint on the model we perform space-to-depth transformation or downsampling of the input, see Figure 7 for details. The prediction targets are the MRMS measurements immediately after times relative to . We only retain samples where all GOES and MRMS could be mapped to the desired sampling times with a maximum of 5 minutes mismatch. Furthermore the range of each radar in the MRMS data source is approximately for an unblocked field of view with data quality generally dropping with distance to the radar. To minimize issues with wrongly labelled targets we only assign a loss to pixels contained in the NOAA better quality map (Figure 8). All data are remapped to the NOAA CONUS rectangular grid of size approximately covering the continental US from to longitude and to latitude with adegree resolution. For both MRMS and GOES we acquired data for the period January 2018 through July 2019. We split the data temporally into three non-overlapping data sets by repeatedly using approximately 16 days for training followed by two days for validation and two days for testing. From these temporal splits we randomly extracted 13,717 test and validation samples and kept increasing the training set size until we observed no over-fitting at 1.72 million training samples. For ablation studies we leave the target unchanged and only crop the spatial or temporal extend of the input data. All models and data-processing were implemented in a combination of TensorFlow

(tensorflow2015-whitepaper) and JAX (jax2018github) using up to 256 Google TPU accelerators in parallel.## Appendix B Numerical Weather Prediction

*Numerical Weather Prediction* is the most successful framework to perform medium- and long-range (up to 6 days with high confidence) forecast to date (bauer2015quiet). The core of NWP models are a set of PDEs and other equations which condenses our current beliefs about the dynamical behaviors in the earths atmosphere (wicker2013everything). The exact formulation of NWP models varies depending on the use case, target domain and scale, but the model variables includes surface-based, airborne and satelite-based measurements, such as temperature, moisture, precipitation, wind fields and pressure (ecmwf2018statement). Besides the validity of the mathematical formulation, the accuracy relies on the quality of the initial conditions—that is, how faithfully the initial conditions resemble the current state of the atmosphere. The initial conditions in modern NWP models are estimated through a complex data assimilation process leveraging a wide range of available information, including past and asynoptic observations, to improve initialization (barker2004three). Despite such effort, initial condition errors do always exist (NAP6434) and negatively impact model performance during the spin-up period (hwang2015improved), leading to suboptimal short-range forecast. In addition solving the PDEs numerically is computationally demanding with runtimes of more than one our for regional models such as HRRR.

## Appendix C HRRR Baseline

We primarily compare against the current operational NOAA HRRR version 3 (benjamin2016north) NWP model. HRRR produces forecasts each hour for 1-18 hours into the future at a 3-by-3 kilometer resolution. The HRRR predictions are resampled to the NOAA CONUS grid described in Appendix A. We compare the HRRR precipitation rate forecasts with the nearest MRMS measurement discarding samples where a MRMS target could not be found within 5 minutes of the HRRR lead time.

## Appendix D Optical Flow Baseline

To establish a strong optical flow baseline model we benchmarked several algorithms for precipitation rate forecasting available in the RainyMotion (ayzel2019optical) and PySteps (pulkkinen2019pysteps) libraries. For all methods we only use the MRMS data to fit the tracking algorithm, but otherwise leave the data unchanged. In Figure 9 (left) we compare three optical flow based methods for precipitation rate forecasting. We found that using a dense inverse search algorithm (kroeger2016fast) for tracking and a backward constant-vector scheme for extrapolation bowler2004development performed better than both the classical Lucas-Kanade (lucas1981iterative) tracking coupled with a backward semi-Lagrangian extrapolation scheme or the probabilistic Short‐Term Ensemble Prediction System (bowler2006steps; seed2013formulation). Figure 9 (right) shows the effect of reducing the context size for optical flow. Similar to the findings in Figure 6 adding more spatial context is always beneficial especially for the longest lead times. In summary we found the algorithm using Dense Inverse Search with constant-vector scheme extrapolation and a spatial context size of to perform the best and use that for all optical flow results in the main paper.

Comments

There are no comments yet.