1 Introduction
Precipitation nowcasting refers to the problem of providing very short range (e.g., 06 hours) forecast of the rainfall intensity in a local region based on radar echo maps^{1}^{1}1The radar echo maps are Constant Altitude Plan Position Indicator (CAPPI) images which can be converted to rainfall intensity maps using the MarshallPalmer relationship or ZR relationship marshall1948distribution ., rain gauge and other observation data as well as the Numerical Weather Prediction
(NWP) models. It significantly impacts the daily lives of many and plays a vital role in many realworld applications. Among other possibilities, it helps to facilitate drivers by predicting road conditions, enhances flight safety by providing weather guidance for regional aviation, and avoids casualties by issuing citywide rainfall alerts. In addition to the inherent complexities of the atmosphere and relevant dynamical processes, the evergrowing need for realtime, largescale, and finegrained precipitation nowcasting poses extra challenges to the meteorological community and has aroused research interest in the machine learning community
xingjian2015convolutional ; sun2014use .The conventional approaches to precipitation nowcasting used by existing operational systems rely on optical flow woo2017operational
. In a modern day nowcasting system, the convective cloud movements are first estimated from the observed radar echo maps by optical flow and are then used to predict the future radar echo maps using semiLagrangian advection. However, these methods are unsupervised from the machine learning point of view in that they do not take advantage of the vast amount of existing radar echo data. Recently, progress has been made by utilizing supervised
deep learning lecun2015deep techniques for precipitation nowcasting. Shi et al. xingjian2015convolutional formulated precipitation nowcasting as a spatiotemporal sequence forecasting problem and proposed theConvolutional Long ShortTerm Memory
(ConvLSTM) model, which extends the LSTM hochreiter1997long by having convolutional structures in both the inputtostate and statetostate transitions, to solve the problem. Using the radar echo sequences for model training, the authors showed that ConvLSTM is better at capturing the spatiotemporal correlations than the fullyconnected LSTM and gives more accurate predictions than the Realtime Optical flow by Variational methods for Echoes of Radar (ROVER) algorithm woo2017operational currently used by the Hong Kong Observatory (HKO).However, despite their pioneering effort in this interesting direction, the paper has some deficiencies. First, the deep learning model is only evaluated on a relatively small dataset containing 97 rainy days and only the nowcasting skill score at the 0.5mm/h rainrate threshold is compared. As realworld precipitation nowcasting systems need to pay additional attention to heavier rainfall events such as rainstorms which cause more threat to the society, the performance at the 0.5mm/h threshold (indicating raining or not) alone is not sufficient for demonstrating the algorithm’s overall performance woo2017operational . In fact, as the area Deep Learning for Precipitation Nowcasting is still in its early stage, it is not clear how models should be evaluated to meet the need of realworld applications. Second, although the convolutional recurrence structure used in ConvLSTM is better than the fullyconnected recurrent structure in capturing spatiotemporal correlations, it is not optimal and leaves room for improvement. For motion patterns like rotation and scaling, the local correlation structure of consecutive frames will be different for different spatial locations and timestamps. It is thus inefficient to use convolution which uses a locationinvariant filter to represent such locationvariant
relationship. Previous attempts have tried to solve the problem by revising the output of a recurrent neural network (RNN) from the raw prediction to be some locationvariant transformation of the input, like optical flow or dynamic local filter
finn2016unsupervised ; de2016dynamic . However, not much research has been conducted to address the problem by revising the recurrent structure itself.In this paper, we aim to address these two problems by proposing both a benchmark and a new model for precipitation nowcasting. For the new benchmark, we build the HKO7 dataset which contains radar echo data from 2009 to 2015 near Hong Kong. Since the radar echo maps arrive in a stream in the realworld scenario, the nowcasting algorithms can adopt online learning to adapt to the newly emerging patterns dynamically. To take into account this setting, we use two testing protocols in our benchmark: the offline setting in which the algorithm can only use a fixed window of the previous radar echo maps and the online setting in which the algorithm is free to use all the historical data and any online learning algorithm. Another issue for the precipitation nowcasting task is that the proportions of rainfall events at different rainrate thresholds are highly imbalanced. Heavier rainfall occurs less often but has a higher realworld impact. We thus propose the Balanced Mean Squared Error (BMSE) and Balanced Mean Absolute Error
(BMAE) measures for training and evaluation, which assign more weights to heavier rainfalls in the calculation of MSE and MAE. We empirically find that the balanced variants of the loss functions are more consistent with the overall nowcasting performance at multiple rainrate thresholds than the original loss functions. Moreover, our experiments show that training with the balanced loss functions is essential for deep learning models to achieve good performance at higher rainrate thresholds. For the new model, we propose the
Trajectory Gated Recurrent Unit
(TrajGRU) model which uses a subnetwork to output the statetostate connection structures before state transitions. TrajGRU allows the state to be aggregated along some learned trajectories and thus is more flexible than the Convolutional GRU (ConvGRU) ballas2016delving whose connection structure is fixed. We show that TrajGRU outperforms ConvGRU, Dynamic Filter Network (DFN) de2016dynamic as well as 2D and 3D Convolutional Neural Networks (CNNs) mathieu2015deep ; vondrick2016generating in both a synthetic MovingMNIST++ dataset and the HKO7 dataset.Using the new dataset, testing protocols, training loss and model, we provide extensive empirical evaluation of seven models, including a simple baseline model which always predicts the last frame, two optical flow based models (ROVER and its nonlinear variant), and four representative deep learning models (TrajGRU, ConvGRU, 2D CNN, and 3D CNN). We also provide a largescale benchmark for precipitation nowcasting. Our experimental validation shows that (1) all the deep learning models outperform the optical flow based models, (2) TrajGRU attains the best overall performance among all the deep learning models, and (3) after applying online finetuning, the models tested in the online setting consistently outperform those in the offline setting. To the best of our knowledge, this is the first comprehensive benchmark of deep learning models for the precipitation nowcasting problem. Besides, since precipitation nowcasting can be viewed as a video prediction problem ranzato2014video ; vondrick2016generating , our work is the first to provide evidence and justification that online learning could potentially be helpful for video prediction in general.
2 Related Work
Deep learning for precipitation nowcasting and video prediction
For the precipitation nowcasting problem, the reflectivity factors in radar echo maps are first transformed to grayscale images before being fed into the prediction algorithm xingjian2015convolutional . Thus, precipitation nowcasting can be viewed as a type of video prediction problem with a fixed “camera”, which is the weather radar. Therefore, methods proposed for predicting future frames in natural videos are also applicable to precipitation nowcasting and are related to our paper. There are three types of general architecture for video prediction: RNN based models, 2D CNN based models, and 3D CNN based models. Ranzato et al. ranzato2014video proposed the first RNN based model for video prediction, which uses a convolutional RNN with statestate kernel to encode the observed frames. Srivastava et al. srivastava2015unsupervised proposed the LSTM encoderdecoder network which uses one LSTM to encode the input frames and another LSTM to predict multiple frames ahead. The model was generalized in xingjian2015convolutional by replacing the fullyconnected LSTM with ConvLSTM to capture the spatiotemporal correlations better. Later, Finn et al. finn2016unsupervised and De Brabandere et al. de2016dynamic extended the model in xingjian2015convolutional by making the network predict the transformation of the input frame instead of directly predicting the raw pixels. Ruben et al. villegas2017decomposing proposed to use both an RNN that captures the motion and a CNN that captures the content to generate the prediction. Along with RNN based models, 2D and 3D CNN based models were proposed in mathieu2015deep and vondrick2016generating respectively. Mathieu et al. mathieu2015deep treated the frame sequence as multiple channels and applied 2D CNN to generate the prediction while vondrick2016generating treated them as the depth and applied 3D CNN. Both papers show that Generative Adversarial Network (GAN) goodfellow2014generative is helpful for generating sharp predictions.
Structured recurrent connection for spatiotemporal modeling
From a higherlevel perspective, precipitation nowcasting and video prediction are intrinsically spatiotemporal sequence forecasting problems in which both the input and output are spatiotemporal sequences xingjian2015convolutional . Recently, there is a trend of replacing the fullyconnected structure in the recurrent connections of RNN with other topologies to enhance the network’s ability to model the spatiotemporal relationship. Other than the ConvLSTM which replaces the fullconnection with convolution and is designed for dense videos, the SocialLSTM alahi2016social and the StructuralRNN (SRNN) jain2016structural have been proposed sharing a similar notion. SocialLSTM defines the topology based on the distance between different people and is designed for human trajectory prediction while SRNN defines the structure based on the given spatiotemporal graph. All these models are different from our TrajGRU in that our model actively learns the recurrent connection structure. Liang et al. liang2017interpretable have proposed the Structureevolving LSTM, which also has the ability to learn the connection structure of RNNs. However, their model is designed for the semantic object parsing task and learns how to merge the graph nodes automatically. It is thus different from TrajGRU which aims at learning the local correlation structure for spatiotemporal data.
Benchmark for video tasks
There exist benchmarks for several video tasks like online object tracking wu2013online and video object segmentation perazzi2016benchmark . However, there is no benchmark for the precipitation nowcasting problem, which is also a video task but has its unique properties since radar echo map is a completely different type of data and the data is highly imbalanced (as mentioned in Section 1). The largescale benchmark created as part of this work could help fill the gap.
3 Model
In this section, we present our new model for precipitation nowcasting. We first introduce the general encodingforecasting structure used in this paper. Then we review the ConvGRU model and present our new TrajGRU model.
3.1 Encodingforecasting Structure
We adopt a similar formulation of the precipitation nowcasting problem as in xingjian2015convolutional . Assume that the radar echo maps form a spatiotemporal sequence . At a given timestamp , our model generates the most likely step predictions, , based on the previous observations including the current one: . Our encodingforecasting network first encodes the observations into layers of RNN states: , and then uses another layers of RNNs to generate the predictions based on these encoded states: . Figure 2 illustrates our encodingforecasting structure for
. We insert downsampling and upsampling layers between the RNNs, which are implemented by convolution and deconvolution with stride. The reason to reverse the order of the forecasting network is that the highlevel states, which have captured the global spatiotemporal representation, could guide the update of the lowlevel states. Moreover, the lowlevel states could further influence the prediction. This structure is more reasonable than the previous structure
xingjian2015convolutionalwhich does not reverse the link of the forecasting network because we are free to plug in additional RNN layers on top and no skipconnection is required to aggregate the lowlevel information. One can choose any type of RNNs like ConvGRU or our newly proposed TrajGRU in this general encodingforecasting structure as long as their states correspond to tensors.
3.2 Convolutional GRU
The main formulas of the ConvGRU used in this paper are given as follows:
(1)  
The bias terms are omitted for notational simplicity. ‘’ is the convolution operation and ‘’ is the Hadamard product. Here, are the memory state, reset gate, update gate, and new information, respectively. is the input and
is the activation, which is chosen to be leaky ReLU with negative slope equals to 0.2
maas2013rectifier througout the paper. are the height and width of the state and input tensors and are the channel sizes of the state and input tensors, respectively. Every time a new input arrives, the reset gate will control whether to clear the previous state and the update gate will control how much the new information will be written to the state.3.3 Trajectory GRU
When used for capturing spatiotemporal correlations, the deficiency of ConvGRU and other ConvRNNs is that the connection structure and weights are fixed for all the locations. The convolution operation basically applies a locationinvariant filter to the input. If the inputs are all zero and the reset gates are all one, we could rewrite the computation process of the new information at a specific location at timestamp , i.e, , as follows:
(2) 
Here, is the ordered neighborhood set at location
defined by the hyperparameters of the statetostate convolution such as kernel size, dilation and padding
yu2016multi . is the th neighborhood location of position . Thefunction concatenates the inner vectors in the set and
is the matrix representation of the statetostate convolution weights.As the hyperparameter of convolution is fixed, the neighborhood set stays the same for all locations. However, most motion patterns have different neighborhood sets for different locations. For example, rotation and scaling generate flow fields with different angles pointing to different directions. It would thus be more reasonable to have a locationvariant connection structure as
(3) 
where is the total number of local links, is the th neighborhood parameterized by .
Based on this observation, we propose the TrajGRU, which uses the current input and previous state to generate the local neighborhood set for each location at each timestamp. Since the location indices are discrete and nondifferentiable, we use a set of continuous optical flows to represent these “indices”. The main formulas of TrajGRU are given as follows:
(4)  
Here, is the total number of allowed links. are the flow fields that store the local connection structure generated by the structure generating network . The are the weights for projecting the channels, which are implemented by convolutions. The function selects the positions pointed out by from via the bilinear sampling kernel jaderberg2015spatial ; ilg2017flownet . If we denote where and , we have:
(5) 
The advantage of such a structure is that we could learn the connection topology by learning the parameters of the subnetwork . In our experiments, takes the concatenation of and as the input and is fixed to be a onehiddenlayer convolutional neural network with kernel size and 32 feature maps. Thus, has only a small number of parameters and adds nearly no cost to the overall computation. Compared to a ConvGRU with statetostate convolution, TrajGRU is able to learn a more efficient connection structure with . For ConvGRU and TrajGRU, the number of model parameters is dominated by the size of the statetostate weights, which is for TrajGRU and for ConvGRU. If is chosen to be smaller than , the number of parameters of TrajGRU can also be smaller than the ConvGRU and the TrajGRU model is able to use the parameters more efficiently. Illustration of the recurrent connection structures of ConvGRU and TrajGRU is given in Figure 2. Recently, Jeon & Kim jeon2017active has used similar ideas to extend the convolution operations in CNN. However, their proposed Active Convolution Unit (ACU) focuses on the images where the need for locationvariant filters is limited. Our TrajGRU focuses on videos where locationvariant filters are crucial for handling motion patterns like rotations. Moreover, we are revising the structure of the recurrent connection and have tested different number of links while jeon2017active fixes the link number to 9.
4 Experiments on MovingMNIST++
Before evaluating our model on the more challenging precipitation nowcasting task, we first compare TrajGRU with ConvGRU, DFN and 2D/3D CNNs on a synthetic video prediction dataset to justify its effectiveness.
The previous MovingMNIST dataset srivastava2015unsupervised ; xingjian2015convolutional only moves the digits with a constant speed and is not suitable for evaluating different models’ ability in capturing more complicated motion patterns. We thus design the MovingMNIST++ dataset by extending MovingMNIST to allow random rotations, scale changes, and illumination changes. Each frame is of size and contains three moving digits. We use 10 frames as input to predict the next 10 frames. As the frames have illumination changes, we use MSE instead of crossentropy for training and evaluation ^{2}^{2}2The MSE for the MovingMNIST++ experiment is averaged by both the frame size and the length of the predicted sequence.. We train all models using the Adam optimizer kingma2014adam with learning rate equal to and momentum equal to 0.5. For the RNN models, we use the encodingforecasting structure introduced previously with three RNN layers. All RNNs are either ConvGRU or TrajGRU and all use the same set of hyperparameters. For TrajGRU, we initialize the weight of the output layer of the structure generating network to zero. The strides of the middle downsampling and upsampling layers are chosen to be . The numbers of filters for the three RNNs are respectively. For the DFN model, we replace the output layer of ConvGRU with a local filter and transform the previous frame to get the prediction. For the RNN models, we train them for 200,000 iterations with norm clipping threshold equal to 10 and batch size equal to 4. For the CNN models, we train them for 100,000 iterations with norm clipping threshold equal to 50 and batch size equal to 32. The detailed experimental configuration of the models for the MovingMNIST++ experiment can be found in the appendix. We have also tried to use conditional GAN for the 2D and 3D models but have failed to get reasonable results.
ConvK3D2  ConvK5D1  ConvK7D1  TrajL5  TrajL9  TrajL13  TrajGRUL17  DFN  Conv2D  Conv3D  

#Parameters  2.84M  4.77M  8.01M  2.60M  3.42M  4.00M  4.77M  4.83M  29.06M  32.52M 
Test MSE  1.495  1.310  1.254  1.351  1.247  1.170  1.138  1.461  1.681  1.637 
Standard Deviation  0.003  0.004  0.006  0.020  0.015  0.022  0.019  0.002  0.001  0.002 
Table 1 gives the results of different models on the same test set that contains 10,000 sequences. We train all models using three different seeds to report the standard deviation. We can find that TrajGRU with only 5 links outperforms ConvGRU with statetostate kernel size and dilation (9 links). Also, the performance of TrajGRU improves as the number of links increases. TrajGRU with outperforms ConvGRU with statetostate kernel and yet has fewer parameters. Another observation from the table is that DFN does not perform well in this synthetic dataset. This is because DFN uses softmax to enhance the sparsity of the learned local filters, which fails to model illumination change because the maximum value always gets smaller after convolving with a positive kernel whose weights sum up to 1. For DFN, when the pixel values get smaller, it is impossible for them to increase again. Figure 3 visualizes the learned structures of TrajGRU. We can see that the network has learned reasonable local link patterns.
5 Benchmark for Precipitation Nowcasting
5.1 HKO7 Dataset
The HKO7 dataset used in the benchmark contains radar echo data from 2009 to 2015 collected by HKO. The radar CAPPI reflectivity images, which have resolution of pixels, are taken from an altitude of 2km and cover a
area centered in Hong Kong. The data are recorded every 6 minutes and hence there are 240 frames per day. The raw logarithmic radar reflectivity factors are linearly transformed to pixel values via
and are clipped to be between 0 and 255. The raw radar echo images generated by Doppler weather radar are noisy due to factors like ground clutter, sea clutter, anomalous propagation and electromagnetic interference lee2017ensemble . To alleviate the impact of noise in training and evaluation, we filter the noisy pixels in the dataset and generate the noise masks by a twostage process described in the appendix.As rainfall events occur sparsely, we select the rainy days based on the rain barrel information to form our final dataset, which has 812 days for training, 50 days for validation and 131 days for testing. Our current treatment is close to the reallife scenario as we are able to train an additional model that classifies whether or not it will rain on the next day and applies our precipitation nowcasting model if this coarserlevel model predicts that it will be rainy. The radar reflectivity values are converted to rainfall intensity values (mm/h) using the ZR relationship:
where is the rainrate level, and . The overall statistics and the average monthly rainfall distribution of the HKO7 dataset are given in the appendix.5.2 Evaluation Methodology
As the radar echo maps arrive in a stream, nowcasting algorithms can apply online learning to adapt to the newly emerging spatiotemporal patterns. We propose two settings in our evaluation protocol: (1) the offline setting in which the algorithm always receives 5 frames as input and predicts 20 frames ahead, and (2) the online setting in which the algorithm receives segments of length 5 sequentially and predicts 20 frames ahead for each new segment received. The evaluation protocol is described more systematically in the appendix. The testing environment guarantees that the same set of sequences is tested in both the offline and online settings for fair comparison.
For both settings, we evaluate the skill scores for multiple thresholds that correspond to different rainfall levels to give an allround evaluation of the algorithms’ nowcasting performance. Table 2 shows the distribution of different rainfall levels in our dataset. We choose to use the thresholds 0.5, 2, 5, 10, 30 to calculate the CSI and Heidke Skill Score (HSS) hogan2010equitability . For calculating the skill score at a specific threshold , which is 0.5, 2, 5, 10 or 30, we first convert the pixel values in prediction and groundtruth to 0/1 by thresholding with . We then calculate the TP (prediction=1, truth=1), FN (prediction=0, truth=1), FP (prediction=1, truth=0), and TN (prediction=0, truth=0). The CSI score is calculated as and the HSS score is calculated as . During the computation, the masked points are ignored.
As shown in Table 2, the frequencies of different rainfall levels are highly imbalanced. We propose to use the weighted loss function to help solve this problem. Specifically, we assign a weight to each pixel according to its rainfall intensity : . Also, the masked pixels have weight 0. The resulting BMSE and BMAE scores are computed as and , where is the total number of frames and is the weight corresponding to the th pixel in the th frame. For the conventional MSE and MAE measures, we simply set all the weights to 1 except the masked points.
Rain Rate (mm/h)  Proportion (%)  Rainfall Level  

90.25  No / Hardly noticeable  
4.38  Light  
2.46  Light to moderate  
1.35  Moderate  
1.14  Moderate to heavy  
0.42  Rainstorm warning 
5.3 Evaluated Algorithms
We have evaluated seven nowcasting algorithms, including the simplest model which always predicts the last frame, two optical flow based methods (ROVER and its nonlinear variant), and four deep learning methods (TrajGRU, ConvGRU, 2D CNN, and 3D CNN). Specifically, we have evaluated the performance of deep learning models in the online setting by finetuning the algorithms using AdaGrad duchi2011adaptive with learning rate equal to . We optimize the sum of BMSE and BMAE during offline training and online finetuning. During the offline training process, all models are optimized by the Adam optimizer with learning rate equal to and momentum equal to and we train these models with earlystopping on the sum of BMSE and BMAE. For RNN models, the training batch size is set to 4. For the CNN models, the training batch size is set to 8. For TrajGRU and ConvGRU models, we use a 3layer encodingforecasting structure with the number of filters for the RNNs set to . We use kernel size equal to for the ConvGRU models while the number of links is set to for the TrajGRU model. We also train the ConvGRU model with the original MSE and MAE loss, which is named “ConvGRUnobal”, to evaluate the improvement by training with the BMSE and BMAE loss. The other model configurations including ROVER, ROVERnonlinear and deep models are included in the appendix.
5.4 Evaluation Results
The overall evaluation results are summarized in Table 3
. In order to analyze the confidence interval of the results, we train 2D CNN, 3D CNN, ConvGRU and TrajGRU models using three different random seeds and report the standard deviation in Table
4. We find that training with balanced loss functions is essential for good nowcasting performance of heavier rainfall. The ConvGRU model that is trained without balanced loss, which best represents the model in xingjian2015convolutional , has worse nowcasting score than the optical flow based methods at the 10mm/h and 30mm/h thresholds. Also, we find that all the deep learning models that are trained with the balanced loss outperform the optical flow based models. Among the deep learning models, TrajGRU performs the best and 3D CNN outperforms 2D CNN, which shows that an appropriate network structure is crucial to achieving good performance. The improvement of TrajGRU over the other models is statistically significant because the differences in BMSE and BMAE are larger than three times their standard deviation. Moreover, the performance with online finetuning enabled is consistently better than that without online finetuning, which verifies the effectiveness of online learning at least for this task.Based on the evaluation results, we also compute the Kendall’s coefficients kendall1938new between the MSE, MAE, BMSE, BMAE and the CSI, HSS at different thresholds. As shown in Table 5, BMSE and BMAE have stronger correlation with the CSI and HSS in most cases.
Algorithms  CSI  HSS  BMSE  BMAE  
Offline Setting  
Last Frame  0.4022  0.3266  0.2401  0.1574  0.0692  0.5207  0.4531  0.3582  0.2512  0.1193  15274  28042 
ROVER + Linear  0.4762  0.4089  0.3151  0.2146  0.1067  0.6038  0.5473  0.4516  0.3301  0.1762  11651  23437 
ROVER + Nonlinear  0.4655  0.4074  0.3226  0.2164  0.0951  0.5896  0.5436  0.4590  0.3318  0.1576  10945  22857 
2D CNN  0.5095  0.4396  0.3406  0.2392  0.1093  0.6366  0.5809  0.4851  0.3690  0.1885  7332  18091 
3D CNN  0.5109  0.4411  0.3415  0.2424  0.1185  0.6334  0.5825  0.4862  0.3734  0.2034  7202  17593 
ConvGRUnobal  0.5476  0.4661  0.3526  0.2138  0.0712  0.6756  0.6094  0.4981  0.3286  0.1160  9087  19642 
ConvGRU  0.5489  0.4731  0.3720  0.2789  0.1776  0.6701  0.6104  0.5163  0.4159  0.2893  5951  15000 
TrajGRU  0.5528  0.4759  0.3751  0.2835  0.1856  0.6731  0.6126  0.5192  0.4207  0.2996  5816  14675 
Online Setting  
2D CNN  0.5112  0.4363  0.3364  0.2435  0.1263  0.6365  0.5756  0.4790  0.3744  0.2162  6654  17071 
3D CNN  0.5106  0.4344  0.3345  0.2427  0.1299  0.6355  0.5736  0.4766  0.3733  0.2220  6690  16903 
ConvGRU  0.5511  0.4737  0.3742  0.2843  0.1837  0.6712  0.6105  0.5183  0.4226  0.2981  5724  14772 
TrajGRU  0.5563  0.4798  0.3808  0.2914  0.1933  0.6760  0.6164  0.5253  0.4308  0.3111  5589  14465 
Algorithms  CSI  HSS  BMSE  BMAE  
Offline Setting  
2D CNN  0.0032  0.0023  0.0015  0.0001  0.0025  0.0032  0.0025  0.0018  0.0003  0.0043  90  95 
3D CNN  0.0043  0.0027  0.0016  0.0024  0.0024  0.0042  0.0028  0.0018  0.0031  0.0041  44  26 
ConvGRU  0.0022  0.0018  0.0031  0.0008  0.0022  0.0022  0.0021  0.0040  0.0010  0.0038  52  81 
TrajGRU  0.0020  0.0024  0.0025  0.0031  0.0031  0.0019  0.0024  0.0028  0.0039  0.0045  18  32 
Online Setting  
2D CNN  0.0002  0.0005  0.0002  0.0002  0.0012  0.0002  0.0005  0.0002  0.0003  0.0019  12  12 
3D CNN  0.0004  0.0003  0.0002  0.0003  0.0008  0.0004  0.0004  0.0003  0.0004  0.0001  23  27 
ConvGRU  0.0006  0.0012  0.0017  0.0019  0.0024  0.0006  0.0012  0.0019  0.0023  0.0031  30  69 
TrajGRU  0.0008  0.0004  0.0002  0.0002  0.0002  0.0007  0.0004  0.0002  0.0002  0.0003  10  20 
Skill Scores  CSI  HSS  

MSE  0.24  0.39  0.39  0.07  0.01  0.33  0.42  0.39  0.06  0.01 
MAE  0.41  0.57  0.55  0.25  0.27  0.50  0.60  0.55  0.24  0.26 
BMSE  0.70  0.57  0.61  0.86  0.84  0.62  0.55  0.61  0.86  0.84 
BMAE  0.74  0.59  0.58  0.82  0.92  0.67  0.57  0.59  0.83  0.92 
6 Conclusion and Future Work
In this paper, we have provided the first largescale benchmark for precipitation nowcasting and have proposed a new TrajGRU model with the ability of learning the recurrent connection structure. We have shown TrajGRU is more efficient in capturing the spatiotemporal correlations than ConvGRU. For future work, we plan to test if TrajGRU helps improve other spatiotemporal learning tasks like visual object tracking and video segmentation. We will also try to build an operational nowcasting system using the proposed algorithm.
Acknowledgments
This research has been supported by General Research Fund 16207316 from the Research Grants Council and Innovation and Technology Fund ITS/205/15FP from the Innovation and Technology Commission in Hong Kong. The first author has also been supported by the Hong Kong PhD Fellowship.
References
 [1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li FeiFei, and Silvio Savarese. Social LSTM: Human trajectory prediction in crowded spaces. In CVPR, 2016.
 [2] Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional networks for learning video representations. In ICLR, 2016.
 [3] Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In NIPS, 2016.
 [4] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [5] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
 [6] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.

[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
In ICCV, 2015.  [8] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 [9] Robin J Hogan, Christopher AT Ferro, Ian T Jolliffe, and David B Stephenson. Equitability revisited: Why the “equitable threat score” is not equitable. Weather and Forecasting, 25(2):710–726, 2010.
 [10] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 2017.
 [11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [12] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
 [13] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. StructuralRNN: Deep learning on spatiotemporal graphs. In CVPR, 2016.
 [14] Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image classification. In CVPR, 2017.
 [15] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
 [16] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [17] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 [18] Hansoo Lee and Sungshin Kim. Ensemble classification for anomalous propagation echo detection with clusteringbased subsetselection method. Atmosphere, 8(1):11, 2017.
 [19] Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P Xing. Interpretable structureevolving LSTM. In CVPR, 2017.
 [20] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
 [21] John S Marshall and W Mc K Palmer. The distribution of raindrops with size. Journal of Meteorology, 5(4):165–166, 1948.
 [22] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multiscale video prediction beyond mean square error. In ICLR, 2016.
 [23] Federico Perazzi, Jordi PontTuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander SorkineHornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
 [24] MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
 [25] Xingjian Shi, Zhourong Chen, Hao Wang, DitYan Yeung, Waikin Wong, and Wangchun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.
 [26] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.
 [27] Juanzhen Sun, Ming Xue, James W Wilson, Isztar Zawadzki, Sue P Ballard, Jeanette OnvleeHooimeyer, Paul Joe, Dale M Barker, PingWah Li, Brian Golding, et al. Use of NWP for nowcasting convective precipitation: Recent progress and challenges. Bulletin of the American Meteorological Society, 95(3):409–426, 2014.
 [28] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
 [29] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NIPS, 2016.
 [30] Wangchun Woo and Waikin Wong. Operational application of optical flow techniques to radarbased rainfall nowcasting. Atmosphere, 8(3):48, 2017.
 [31] Yi Wu, Jongwoo Lim, and MingHsuan Yang. Online object tracking: A benchmark. In CVPR, 2013.
 [32] Fisher Yu and Vladlen Koltun. Multiscale context aggregation by dilated convolutions. In ICLR, 2016.
Appendix A Weight Initialization
The weights and biases of all models are initialized with the MSRA initializer [7] except that the weights and biases of the structure generating network in TrajGRUs are initialized to be zero.
Appendix B Structure Generating Network in TrajGRU
The structure generating network takes the concatenation of the state tensor and the input tensor as the input. We fix the network to have two convolution layers. The first convolution layer uses kernel size, padding size, 32 filters and uses the leaky ReLU activation. The second convolution layer uses kernel size, padding and filters where is the number of links.
Appendix C Details about the MovingMNIST++ Experiment
c.1 Generation Process
For each sequence, we choose three digits randomly from the MNIST dataset^{3}^{3}3MNIST dataset:http://yann.lecun.com/exdb/mnist/. Each digit will move, rotate, scale up or down at a randomly sampled speed. Also, we multiply the pixel values by an illumination factor every time to make the digits have timevarying appearances. The hyperparameters of the generation process are given in Table 6. In our experiment, we always generate a length20 sequence and use the first 10 frames to predict the last 10 frames.
Hyperparameter  Value 

Number of digits  3 
Frame size  
Velocity  
Scaling factor  
Rotation angle  
Illumination factor 
c.2 Network Structures
Appendix D Details about the HKO7 Benchmark
d.1 Overall Data Statistics
d.2 Denoising Process
We first remove the ground clutter and sun spikes, which appear at a fixed position, by detecting the outlier locations in the image. For each inboundary location in the frame, we use the ratio of its pixel value equal to as the feature and estimate these features’ sample mean and covariance matrix . We then calculate the Mahalanobis distance ^{4}^{4}4We use MoorePenrose pseudoinverse in the implementation.
of these features using the estimated mean and covariance. Locations that have the Mahalanobis distances higher than the mean distance plus three times the standard deviation are classified as outliers. After outlier detection, the
locations in the image are divided into 177316 inliers, 2824 outliers and 50260 outofboundary points. The outlier detection process is illustrated in Figure
6. After outlier detection, we further remove other types of noise, like sea clutter, by filtering out the pixels with value smaller than 71 and larger than 0. Two examples that compare the original radar echo sequence and the denoised sequence are included in the attached “denoising” folder.d.3 Evaluation Protocol
We illustrate our evaluation protocol in Algorithm 1. We can choose the evaluation type to be ‘offline’ or ‘online’. In the online setting, the model is able to store the previously seen sequences in a buffer and finetune the parameters using the sampled training batches from the buffer. For algorithms that are tested in the online setting in the paper, we sample the last 25 consecutive frames in the buffer to update the model if these frames are available. The buffer will be made empty once a new episode flag is received, which indicates that the newly observed 5frame segment is not consecutive to the previous frames.