The weather system is a challenging complex system and state-of-the-art methods utilize Numerical Weather Prediction (NWP) which is a computationally intense method bauer2015quiet . Recently, data-driven approaches for weather forecasting have become a major interest of researchers as they are computationally simpler and more straightforward feng2017data ; Houthuys2017 ; hu2014pattern ; ZahraESANN2017 .
Long Short-Term Memory (LSTM), proposed by Hochreiter Schmidhuber hochreiter1997long
, is a popular type of recurrent neural network which is able to capture long-term dependencies. LSTMs have been widely used and shown significant performance on different sequence learning problems and time series predictionng2015beyond ; sutskever2014sequence ; graves2013speech ; freeman2018forecasting ; lipton2015learning ; tian2015predicting . LSTMs have been also used in analyzing spatio-temporal datasets patraucean2015spatio ; liu2016spatio . Stacked LSTM is a deep architecture which consists of more than one layer of LSTM and the input of each LSTM layer is the hidden states of the previous LSTM layer graves2013hybrid ; sutskever2014sequence .
In this paper a spatio-temporal stacked LSTM model is proposed and its performance is evaluated on the application of temperature prediction. In the proposed model independent LSTM models per location are trained and afterward, the input of the second layer LSTM is formed based on the combination of the hidden states of the LSTM models in the first layer. It is worth mentioning that the general structure of the spatio-temporal stacked LSTM is similar to the approach proposed by Su et al. in the framework of Convolutional Neural Networks for 3D shape recognitionsu2015multi . Note that both stacked LSTM and spatio-temporal stacked LSTM methodologies can be explained as a multi-view approach as, similar to Houthuys2017 , they are fusing the information of different cities. Stacked LSTM and spatio-temporal stacked LSTM benefit from early fusion and intermediate fusion of the information from different views respectively.
2 Spatio-temporal Stacked LSTM
Assuming , , , and to be the values of the input gate, forget gate, output gate, memory cell and hidden state at time in the sequence and layer respectively, and be the input of the system at time at location , the stacked LSTM model based on the architecture of the LSTM cell defined in graves2013generating is shown in Table 1. Note that as an input of the model is a concatenation of the variables from all locations; i.e. . In this study we focus on a 2-layer stacked LSTM; however, the methodology can be extended to a larger number of layers. The full weight matrices for are the weights that connect the input to the corresponding gates and the memory cell. The weight matrices for
are diagonal matrices that connect the cell memory to different gates. Note that the number of neurons for all gates is a predefined parameter and the equations are applied for each neuron. For simplicity, we use column vectorsand to indicate all the elements in and respectively.
|Layer 1 LSTM||Layer 2 LSTM|
In Table 2 the equations for the proposed spatio-temporal stacked LSTM model are shown. As is explained, instead of one LSTM model in the first layer, there are independent LSTM models per location. Hence, having 5 locations, 5 LSTM models are created. The full weight matrices for refer to the connection of the data of location to different gates in the corresponding LSTM model. Similarly, other weight matrices, biases, gate values, memory cell and hidden state used subscript to indicate that they correspond to the LSTM related to the location . For simplicity, we use column vectors and which include all the elements in and to refer to the parameters for the LSTM part related to location for in the first layer. The information from different locations are then combined by merging the hidden states of the first layer and passing it as input to the second layer. For the second LSTM layer the definition of and remains the same. Figure 1 depicts the stacked LSTM and spatio-temporal stacked model when the number of layers is equal to two. As is shown, for the stacked LSTM model, the hidden states of the first layer are used as the input of the second layer. For the second LSTM layer the definition of and remains the same. However, in case of spatio-temporal stacked LSTM, there are independent LSTM models per location, and afterward the input of the second LSTM layer is defined based on the combination of the hidden states of the LSTM models of the first layer. Note that if the number of layers is more that two, the merging of the hidden states is possible at any layer before the last LSTM layer; e.g. if we have a 3-layer spatio-temporal stacked LSTM model, the combination of the hidden states can happen after the first LSTM layer or the second one. Note that in both stacked LSTM and spatio-temporal stacked LSTM models, after the second LSTM layer, a dense layer is used. The final prediction can be done by using where is the number of days ahead to predict, is the input sequence length, and and
are the weights and bias term in the dense layer. For the experiments, we use a quadratic loss function to train the network and utilize-norm regularization to avoid overfitting.
|Layer 1 LSTM (for )||Layer 2 LSTM|
One of the advantages of the proposed spatio-temporal stacked LSTM method is a smaller number of parameters in comparison with stacked LSTM. Assume the total number of neurons in the first layer and second layer is similar in both cases; in other words, if the number of neurons in stacked LSTM model in the first layer is , then in the spatio-temporal stacked LSTM, each LSTM model in the first layer has neurons where is the number of locations. In this case, the spatio-temporal stacked LSTM is similar to the case that the full weight matrices in stacked LSTM are considered to be block diagonal at which point each block is related to a location. Hence, the number of parameters to be optimized is smaller in the proposed method. This makes the spatio-temporal stacked LSTM a better choice when the number of samples in the training set is relatively small. On the other hand, in spatio-temporal stacked LSTM, the relation between the locations are taken into account in the second LSTM layer by combining the hidden states from the first layer.
In this paper, the data have been collected from the Weather Underground company website Wunderground and cover a time period from the beginning of 2007 to mid-2014 for 5 cities including Brussels, Antwerp, Liege, Amsterdam and Eindhoven. To evaluate the performance of the proposed methods in various weather conditions, two test sets are defined: (i) from mid-November 2013 to mid-December 2013 (Nov/Dec) and (ii) from mid-April 2014 to mid-May 2014 (Apr/May). The data contain 18 measured weather variables, such as temperature and humidity, for each day per city. In order to benefit from all available data, the training data that is used for each test set includes the data from the beginning of 2007 until the day before the corresponding test set.
In this study, the LSTM cell architecture described by Zaremba Vinyals zaremba2014recurrent
implemented in TensorFlow has been used for the experiments. The considered range for the tuning parameters were selected empirically. For the number of neurons, in the stacked LSTM we examinedin the first layer and for the second layer. In case of the spatio-temporal stacked LSTM, the number of neurons in LSTM per location in the first layer is considered to be in the set of
. Note that as there are 5 locations in the first layer, the total number of neurons in the first layer are similar in both models. For the inner state, we deployed both tanh and sigmoid as the activation function. In the experiments, the sequence length is considered to be 10.
The experiments were conducted for the prediction of the minimum and maximum temperature in Brussels for 1 to 6 days ahead. To avoid local minima problems in neural networks, the experiments are repeated 5 times and the median Mean Absolute Error (MAE) and the Mean Squared Error (MSE) on both test sets are presented in Table 3. As is shown, using sigmoid as the inner activation function can result in better performance. Moreover, it can be seen that in most of the cases taking the spatial information into account can improve the performance. This is more evident in case the activation function in the inner state is tanh.
|Activation function :tanh||Activation function : sigmoid|
|Testset||Steps ahead||Temp.||stacked LSTM||ST stacked LSTM||stacked LSTM||ST stacked LSTM||stacked LSTM||ST stacked LSTM||stacked LSTM||ST stacked LSTM|
In addition, the comparison between the performance of the stacked LSTM, the proposed method and Weather Underground over the two test sets together is depicted in Figure 3. As is shown, although very few locations are taken into account, the LSTM models (as data-driven approaches) outperform the state-of-the-art method used for minimum temperature prediction. Also, for maximum temperature prediction the performance of the LSTM models are competitive with the performance of the state-of-the-art methods.
In this study, we proposed a spatio-temporal stacked LSTM model at which point in the first layer different LSTM models are considered per location, and then the corresponding hidden states are merged and given as input to the next layer. The proposed method was deployed in an application of weather forecasting.
The experimental results suggest that considering spatio-temporal property of the data in the LSTM model can improve the performance of the prediction. Moreover, it is shown that the proposed method is competitive with the state-of-the-art method in weather forecasting.
- (1) Weather underground. www.wunderground.com.
- (2) Bauer, P., Thorpe, A., and Brunet, G. The quiet revolution of numerical weather prediction. Nature 525, 7567 (2015), 47–55.
Feng, C., Cui, M., Hodge, B.-M., and Zhang, J.
A data-driven multi-model methodology with deep feature selection for short-term wind forecasting.Applied Energy 190 (2017), 1245–1257.
Freeman, B. S., Taylor, G., Gharabaghi, B., and Thé, J.
Forecasting air quality time series using deep learning.Journal of the Air & Waste Management Association, just-accepted (2018).
- (5) Graves, A. Generating sequences with recurrent neural networks. preprint arXiv:1308.0850 (2013).
- (6) Graves, A., Jaitly, N., and Mohamed, A.-r. Hybrid speech recognition with deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on (2013), IEEE, pp. 273–278.
- (7) Graves, A., Mohamed, A.-R., and Hinton, G. Speech recognition with deep recurrent neural networks. In ICASSP (2013), IEEE, pp. 6645–6649.
- (8) Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
- (9) Houthuys, L., Karevan, Z., and Suykens, J. A. K. Multi-view LS-SVM regression for black-box temperature prediction in weather forecasting. IJCNN (2017), 1102–1108.
Hu, Q., Su, P., Yu, D., and Liu, J.
Pattern-based wind speed prediction based on generalized principal component analysis.IEEE Transactions on Sustainable Energy 5, 3 (2014), 866–874.
Karevan, Z., Feng, Y., and Suykens, J. A. K.
Moving least squares support vector machines for weather temperature prediction.In Proc. of the European Symposium on Artificial Neural Networks (2016), pp. 611–616.
- (12) Lipton, Z. C., Kale, D. C., Elkan, C., and Wetzel, R. Learning to diagnose with LSTM recurrent neural networks. ICLR (2016).
Liu, J., Shahroudy, A., Xu, D., and Wang, G.
Spatio-temporal LSTM with trust gates for 3D human action
European Conference on Computer Vision(2016), Springer, pp. 816–833.
- (14) Ng, J. Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. Beyond short snippets: Deep networks for video classification. In CVPR, 2015 (2015), IEEE, pp. 4694–4702.
- (15) Patraucean, V., Handa, A., and Cipolla, R. Spatio-temporal video autoencoder with differentiable memory. preprint arXiv:1511.06309 (2015).
- (16) Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision (2015), pp. 945–953.
- (17) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (2014), pp. 3104–3112.
- (18) Tian, Y., and Pan, L. Predicting short-term traffic flow by long short-term memory recurrent neural network. In Smart City/SocialCom/SustainCom (SmartCity), 2015 IEEE International Conference on (2015), IEEE, pp. 153–158.
- (19) Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neural network regularization. preprint arXiv:1409.2329 (2014).