1 Introduction
The weather system is a challenging complex system and stateoftheart methods utilize Numerical Weather Prediction (NWP) which is a computationally intense method bauer2015quiet . Recently, datadriven approaches for weather forecasting have become a major interest of researchers as they are computationally simpler and more straightforward feng2017data ; Houthuys2017 ; hu2014pattern ; ZahraESANN2017 .
Long ShortTerm Memory (LSTM), proposed by Hochreiter Schmidhuber hochreiter1997long
, is a popular type of recurrent neural network which is able to capture longterm dependencies. LSTMs have been widely used and shown significant performance on different sequence learning problems and time series prediction
ng2015beyond ; sutskever2014sequence ; graves2013speech ; freeman2018forecasting ; lipton2015learning ; tian2015predicting . LSTMs have been also used in analyzing spatiotemporal datasets patraucean2015spatio ; liu2016spatio . Stacked LSTM is a deep architecture which consists of more than one layer of LSTM and the input of each LSTM layer is the hidden states of the previous LSTM layer graves2013hybrid ; sutskever2014sequence .In this paper a spatiotemporal stacked LSTM model is proposed and its performance is evaluated on the application of temperature prediction. In the proposed model independent LSTM models per location are trained and afterward, the input of the second layer LSTM is formed based on the combination of the hidden states of the LSTM models in the first layer. It is worth mentioning that the general structure of the spatiotemporal stacked LSTM is similar to the approach proposed by Su et al. in the framework of Convolutional Neural Networks for 3D shape recognition
su2015multi . Note that both stacked LSTM and spatiotemporal stacked LSTM methodologies can be explained as a multiview approach as, similar to Houthuys2017 , they are fusing the information of different cities. Stacked LSTM and spatiotemporal stacked LSTM benefit from early fusion and intermediate fusion of the information from different views respectively.2 Spatiotemporal Stacked LSTM
Assuming , , , and to be the values of the input gate, forget gate, output gate, memory cell and hidden state at time in the sequence and layer respectively, and be the input of the system at time at location , the stacked LSTM model based on the architecture of the LSTM cell defined in graves2013generating is shown in Table 1. Note that as an input of the model is a concatenation of the variables from all locations; i.e. . In this study we focus on a 2layer stacked LSTM; however, the methodology can be extended to a larger number of layers. The full weight matrices for are the weights that connect the input to the corresponding gates and the memory cell. The weight matrices for
are diagonal matrices that connect the cell memory to different gates. Note that the number of neurons for all gates is a predefined parameter and the equations are applied for each neuron. For simplicity, we use column vectors
and to indicate all the elements in and respectively.Layer 1 LSTM  Layer 2 LSTM  
Input  
Equations  
Summary  
In Table 2 the equations for the proposed spatiotemporal stacked LSTM model are shown. As is explained, instead of one LSTM model in the first layer, there are independent LSTM models per location. Hence, having 5 locations, 5 LSTM models are created. The full weight matrices for refer to the connection of the data of location to different gates in the corresponding LSTM model. Similarly, other weight matrices, biases, gate values, memory cell and hidden state used subscript to indicate that they correspond to the LSTM related to the location . For simplicity, we use column vectors and which include all the elements in and to refer to the parameters for the LSTM part related to location for in the first layer. The information from different locations are then combined by merging the hidden states of the first layer and passing it as input to the second layer. For the second LSTM layer the definition of and remains the same. Figure 1 depicts the stacked LSTM and spatiotemporal stacked model when the number of layers is equal to two. As is shown, for the stacked LSTM model, the hidden states of the first layer are used as the input of the second layer. For the second LSTM layer the definition of and remains the same. However, in case of spatiotemporal stacked LSTM, there are independent LSTM models per location, and afterward the input of the second LSTM layer is defined based on the combination of the hidden states of the LSTM models of the first layer. Note that if the number of layers is more that two, the merging of the hidden states is possible at any layer before the last LSTM layer; e.g. if we have a 3layer spatiotemporal stacked LSTM model, the combination of the hidden states can happen after the first LSTM layer or the second one. Note that in both stacked LSTM and spatiotemporal stacked LSTM models, after the second LSTM layer, a dense layer is used. The final prediction can be done by using where is the number of days ahead to predict, is the input sequence length, and and
are the weights and bias term in the dense layer. For the experiments, we use a quadratic loss function to train the network and utilize
norm regularization to avoid overfitting.Layer 1 LSTM (for )  Layer 2 LSTM  
Input  
Equations  
Summary  
One of the advantages of the proposed spatiotemporal stacked LSTM method is a smaller number of parameters in comparison with stacked LSTM. Assume the total number of neurons in the first layer and second layer is similar in both cases; in other words, if the number of neurons in stacked LSTM model in the first layer is , then in the spatiotemporal stacked LSTM, each LSTM model in the first layer has neurons where is the number of locations. In this case, the spatiotemporal stacked LSTM is similar to the case that the full weight matrices in stacked LSTM are considered to be block diagonal at which point each block is related to a location. Hence, the number of parameters to be optimized is smaller in the proposed method. This makes the spatiotemporal stacked LSTM a better choice when the number of samples in the training set is relatively small. On the other hand, in spatiotemporal stacked LSTM, the relation between the locations are taken into account in the second LSTM layer by combining the hidden states from the first layer.
3 Experiments
In this paper, the data have been collected from the Weather Underground company website Wunderground and cover a time period from the beginning of 2007 to mid2014 for 5 cities including Brussels, Antwerp, Liege, Amsterdam and Eindhoven. To evaluate the performance of the proposed methods in various weather conditions, two test sets are defined: (i) from midNovember 2013 to midDecember 2013 (Nov/Dec) and (ii) from midApril 2014 to midMay 2014 (Apr/May). The data contain 18 measured weather variables, such as temperature and humidity, for each day per city. In order to benefit from all available data, the training data that is used for each test set includes the data from the beginning of 2007 until the day before the corresponding test set.
In this study, the LSTM cell architecture described by Zaremba Vinyals zaremba2014recurrent
implemented in TensorFlow has been used for the experiments. The considered range for the tuning parameters were selected empirically. For the number of neurons, in the stacked LSTM we examined
in the first layer and for the second layer. In case of the spatiotemporal stacked LSTM, the number of neurons in LSTM per location in the first layer is considered to be in the set of. Note that as there are 5 locations in the first layer, the total number of neurons in the first layer are similar in both models. For the inner state, we deployed both tanh and sigmoid as the activation function. In the experiments, the sequence length is considered to be 10.
The experiments were conducted for the prediction of the minimum and maximum temperature in Brussels for 1 to 6 days ahead. To avoid local minima problems in neural networks, the experiments are repeated 5 times and the median Mean Absolute Error (MAE) and the Mean Squared Error (MSE) on both test sets are presented in Table 3. As is shown, using sigmoid as the inner activation function can result in better performance. Moreover, it can be seen that in most of the cases taking the spatial information into account can improve the performance. This is more evident in case the activation function in the inner state is tanh.
Activation function :tanh  Activation function : sigmoid  
MAE  MSE  MAE  MSE  
Testset  Steps ahead  Temp.  stacked LSTM  ST stacked LSTM  stacked LSTM  ST stacked LSTM  stacked LSTM  ST stacked LSTM  stacked LSTM  ST stacked LSTM 
Nov/Dec  1  Min  1.66  1.43  4.36  3.64  1.69  1.56  4.04  3.71 
Max  1.15  1.22  2.48  2.65  1.37  1.23  3.87  2.96  
2  Min  2.30  1.72  8.17  4.36  1.86  1.76  5.12  5.00  
Max  1.89  1.71  6.51  4.57  1.65  1.61  4.24  4.99  
3  Min  3.04  1.72  14.16  4.22  1.94  1.94  5.35  5.23  
Max  3.44  1.86  17.30  4.83  1.73  1.73  4.99  5.34  
4  Min  3.28  1.98  17.52  6.39  1.66  1.57  4.16  3.74  
Max  2.56  2.14  9.36  5.87  1.61  1.76  3.85  3.87  
5  Min  3.72  1.71  22.20  4.39  1.58  1.58  4.06  4.13  
Max  2.75  1.89  11.30  4.92  1.58  1.55  3.51  3.65  
6  Min  3.23  1.90  14.06  5.42  1.76  1.90  4.68  5.09  
Max  4.01  1.80  22.06  5.40  1.63  1.68  4.70  4.79  
Apr/May  1  Min  1.64  1.60  4.15  4.63  1.58  1.55  3.78  3.92 
Max  2.27  2.45  7.66  8.51  2.24  2.27  8.00  7.93  
2  Min  2.52  2.15  12.36  9.09  2.01  1.96  7.86  7.75  
Max  2.64  2.64  11.87  10.37  2.77  2.55  10.03  9.38  
3  Min  2.83  2.20  14.58  9.07  2.09  2.03  8.86  8.54  
Max  3.61  3.03  18.63  12.75  2.53  2.58  8.98  9.29  
4  Min  2.63  2.25  12.09  9.59  2.03  2.07  8.27  8.36  
Max  3.08  2.92  13.02  12.69  2.72  2.59  10.18  9.75  
5  Min  2.63  2.51  11.71  12.36  2.36  2.31  10.17  10.05  
Max  3.62  2.89  19.82  12.89  2.64  2.99  10.17  13.46  
6  Min  3.04  2.93  15.58  15.39  2.46  2.53  10.92  11.99  
Max  2.77  2.99  10.99  13.51  3.04  2.95  12.58  12.15  
In addition, the comparison between the performance of the stacked LSTM, the proposed method and Weather Underground over the two test sets together is depicted in Figure 3. As is shown, although very few locations are taken into account, the LSTM models (as datadriven approaches) outperform the stateoftheart method used for minimum temperature prediction. Also, for maximum temperature prediction the performance of the LSTM models are competitive with the performance of the stateoftheart methods.
4 Conclusion
In this study, we proposed a spatiotemporal stacked LSTM model at which point in the first layer different LSTM models are considered per location, and then the corresponding hidden states are merged and given as input to the next layer. The proposed method was deployed in an application of weather forecasting.
The experimental results suggest that considering spatiotemporal property of the data in the LSTM model can improve the performance of the prediction. Moreover, it is shown that the proposed method is competitive with the stateoftheart method in weather forecasting.
References
 (1) Weather underground. www.wunderground.com.
 (2) Bauer, P., Thorpe, A., and Brunet, G. The quiet revolution of numerical weather prediction. Nature 525, 7567 (2015), 47–55.

(3)
Feng, C., Cui, M., Hodge, B.M., and Zhang, J.
A datadriven multimodel methodology with deep feature selection for shortterm wind forecasting.
Applied Energy 190 (2017), 1245–1257. 
(4)
Freeman, B. S., Taylor, G., Gharabaghi, B., and Thé, J.
Forecasting air quality time series using deep learning.
Journal of the Air & Waste Management Association, justaccepted (2018).  (5) Graves, A. Generating sequences with recurrent neural networks. preprint arXiv:1308.0850 (2013).
 (6) Graves, A., Jaitly, N., and Mohamed, A.r. Hybrid speech recognition with deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on (2013), IEEE, pp. 273–278.
 (7) Graves, A., Mohamed, A.R., and Hinton, G. Speech recognition with deep recurrent neural networks. In ICASSP (2013), IEEE, pp. 6645–6649.
 (8) Hochreiter, S., and Schmidhuber, J. Long shortterm memory. Neural Computation 9, 8 (1997), 1735–1780.
 (9) Houthuys, L., Karevan, Z., and Suykens, J. A. K. Multiview LSSVM regression for blackbox temperature prediction in weather forecasting. IJCNN (2017), 1102–1108.

(10)
Hu, Q., Su, P., Yu, D., and Liu, J.
Patternbased wind speed prediction based on generalized principal component analysis.
IEEE Transactions on Sustainable Energy 5, 3 (2014), 866–874. 
(11)
Karevan, Z., Feng, Y., and Suykens, J. A. K.
Moving least squares support vector machines for weather temperature prediction.
In Proc. of the European Symposium on Artificial Neural Networks (2016), pp. 611–616.  (12) Lipton, Z. C., Kale, D. C., Elkan, C., and Wetzel, R. Learning to diagnose with LSTM recurrent neural networks. ICLR (2016).

(13)
Liu, J., Shahroudy, A., Xu, D., and Wang, G.
Spatiotemporal LSTM with trust gates for 3D human action
recognition.
In
European Conference on Computer Vision
(2016), Springer, pp. 816–833.  (14) Ng, J. Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. Beyond short snippets: Deep networks for video classification. In CVPR, 2015 (2015), IEEE, pp. 4694–4702.
 (15) Patraucean, V., Handa, A., and Cipolla, R. Spatiotemporal video autoencoder with differentiable memory. preprint arXiv:1511.06309 (2015).
 (16) Su, H., Maji, S., Kalogerakis, E., and LearnedMiller, E. Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision (2015), pp. 945–953.
 (17) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (2014), pp. 3104–3112.
 (18) Tian, Y., and Pan, L. Predicting shortterm traffic flow by long shortterm memory recurrent neural network. In Smart City/SocialCom/SustainCom (SmartCity), 2015 IEEE International Conference on (2015), IEEE, pp. 153–158.
 (19) Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neural network regularization. preprint arXiv:1409.2329 (2014).
Comments
There are no comments yet.