1 Introduction and Literature Review
Time series data arise in broad areas, such as engineering, medicine, finance, and economics. Various types of statistical and machine learning techniques have been applied on time series analysis. Recently, several new scalable time series analyses have been studied, such as forecasting
lv2015traffic ahmad2017unsupervised , classification zheng2014time and clustering mikalsen2018time . They illustrated the performance gains of these works over traditional time series techniques on largescale problems. Moreover, spatial time series problems arise when there is a spatial dependency between neighboring time series. Spatialtemporal data arise in diverse areas of power grids bessa2015spatial , load demand forecasting qiu2017empirical , weather forecasting grover2015deep , smart city applications tascikaraoglu2018evaluation , and transportation systems, such as traffic flow forecasting polson2017deep , zhang2017deep .Traffic flow prediction is one of the essential components of Intelligent Transportation Systems and one of the most challenging spatialtemporal problems, because of recurrent and nonrecurrent patterns and the physical dynamics involved. Traffic flow prediction can assist travelers to make better decisions and improve traffic management, while decreasing traffic congestion and air pollution. Recently, smart devices increase the role of traffic flow prediction problems in our daily lives, which help people planning their travel and find the most efficient routes. With the advent of new sensing, computing and networking technologies, such as cameras, sensors, radars, inductive loops and GPS devices, a large volume of data are readily available zheng2016big . These increasingly large data sets means that big data and techniques to handle these data plays a key role in the success of future transportation systems zhang2011data . Hence, to improve the performance of transportation systems, researchers are motivated to take advantage of new spatialtemporal datadriven techniques and design scalable algorithms capable of processing large volume of data, such as deep neural networks , lv2015traffic , al2015efficient .
1.1 Background
Starting in the 1970’s with the original work of Gazis and Knapp gazis1971line there have been many studies applying time series forecasting techniques to traffic flow prediction problem, including parametric techniques, such as autoregressive integrated moving average (ARIMA) kamarianakis2003forecasting and SeasonalARIMA kumar2015short , and statistical techniques, such as Bayesian analysis ghosh2007bayesian
yu2003short wang2014new. However, there are several limitations on the models, because of prior assumptions, lack of handling missing data, noisy data, outliers, and the curse of dimensionality. Shallow architecture neural networks are capable of high dimension data, but cannot capture a high order computational complexity. With superior performance of deep neural networks on largescale problems, they became an alternative technique applied on largescale multivariate time series forecasting problems.
Recently, there have been several attempts to design deep learning models for multivariate time series forecasting problems. The primary work related to ours proposes a stacked autoencoder (SAE) model to learn traffic flow features and illustrate the advantage of SAE model versus Multilayer Perceptron
lv2015traffic . In huang2014deep, they propose stacked autoencoders with multitask learning at the top layers of the neural network. A Deep Belief Network(DBN) composed by layers of restricted boltzman machine is proposed
kuremoto2014time . In wang2018optimal , an ensemble of four categories of fully connected neural network is applied on time series forecasting problem. In qiu2014ensemble, an ensemble of DBN with Support Vector Regression for aggregation of outputs is proposed for time series forecasting problem. However, in fully connected neural networks, the size increases exponentially with increasing input size, therefore the convergence of the model is computationally expensive and challenging. Several other neural network layers have been proposed to reduce the computational time and capture patterns in high order computationally complex temporal datasets.
Convolutional Neural Networks (CNN) extract features of various types of input data, such as images, videos, and audio. Weight sharing, the main feature of CNN, reduces the number of parameters in deep neural network models. These properties improve performance of learning algorithms by reducing complexity of parameters krizhevsky2012imagenet . The performance of deep CNNs in multivariate time series forecasting is examined; in ma2017learning , a spatialtemporal relation of traffic flow data is represented as images. A CNN model is used to train from images and forecast speed in large transportation networks. In deng2018exploring , they studied imagelike representation of spatial time series data using convolution layers and ensemble learning. A convolution layer consider spatial structure in a euclidean space, which can miss some information on graphstructure data henaff2015deep . As an alternative approach, following the work bruna2013spectral , spatial dependency is captured using bidirectional diffusion convolutional recurrent networks li2017graph
. They illustrate a graphstructured representation of time series data capture spatial relation among time series. Moreover, in the presence of temporal data, recurrent neural networks have shown great performance in time series forecasting
connor1994recurrent . The vanishing gradient in deep Multilayer perceptron and recurrent neural network problem is solved by employing a LongShort Term Model (LSTM) sak2014long , which significantly improves time series forecasting zhao2017lstm , traffic speed prediction ma2015longand traffic flow estimation with missing data
tian2018lstm .While convolutional neural networks can exhibit excellent performance on spatial data, and recurrent neural networks have advantages on problems with temporal data; spatialtemporal problems combine both of these. In xingjian2015convolutional , they propose convolutionalLSTM layer for weather forecasting problem, in which consider spatiotemporal sequences. A convolutional deep learning model for multivariate time series forecasting is proposed yi2017grouped . They propose explicit grouping of input time series and implicit grouping using error backpropagation. In cheng2017deeptransport , they use a CNNLSTM model for downstream and upstream data to capture physical relationships among traffic flow data. A convolutional layer is followed by an LSTM layer for downstream and upstream traffic flow data. In liu2018short , they illustrate a CNN and gated CNN followed by attention layers for spatialtemporal data. The capability of CNNLSTM in learning spatialtemporal features are illustrated in above works. However, there is not any analysis on designing a neural network architecture with various components to separately capture spatialtemporal patterns.
1.2 Contribution
In the aforementioned works, spatial time series forecasting has been studied with the objective of proposing various types of convolution and recurrent neural network layers. However, spatialtemporal data have their specific patterns, which motivate us to use spatial and time series decomposition, and to explicitly consider various types of patterns in designing an efficient neural network architecture. There are some challenges in spatialtemporal data which should be considered in designing the deep neural network architecture. In spatialtemporal data, time series residuals are not only meaningless noise, but also related to physical properties and dynamical system of spatially dependent time series. Moreover, convolutional layers can capture spatial and shortterm patterns, but sliding convolution kernels on spatial features miss network structure. In existence of longterm patterns, an LSTM layers shows great performance in forecasting problems because it can separately capture detrending data. Furthermore, a challenging problem is to address missing small spatialtemporal data in the time series forecasting problems.
In this paper, we address the problem of explicitly decomposing spatialtemporal patterns in designing a deep neural network and we illustrate its performance improvement on a largescale traffic flow prediction problem. The contribution of the paper is described as follows:

We illustrate an approach for explicitly considering various types of patterns in a deep neural network architecture for a spatial multivariate time series forecasting problem.

We describe a Dynamic Time Warpingbased clustering method and time series decomposition with the objective of finding compact regions with similar time series residuals.

A multikernel convolution layer is designed for spatial time series data, to keep the spatial structure of time series data and extract shortterm and spatial patterns. It follows by a convolutionLSTM component to capture longterm patterns from trends, and a pretrained denoising autoencoder to have robust prediction to missing data.

The spatial and temporal patterns in traffic flow data is analyzed and the performance gains of the proposed model relative to baseline and stateofarttheart deep neural networks are illustrated for a traffic flow prediction, capturing meaningful time series residuals and a robust prediction to missing data.
The rest of the paper is as follows, in section II, we define the problem. In section III, the technical background of the proposed model are presented. In section IV, the proposed framework is illustrated, followed by the results of the work and conclusion discussed in section V.
2 Problem Definition
Time series data are a set of successive measurements, , where is an observed variable at location , time step and is the size of the time series. A sensor gathers with corresponding features . A spatialtemporal data is a set of multivariate time series }, represented by a matrix of , where is the number of sensors, which gather spatialtemporal data in synchronous time in a geographical area.
Given as the set of all time series in a region, a spatial time series forecasting problem is cast as a regression problem. Given a time window of last steps, and a horizon prediction , the objective is to predict , given . The time window is used to only consider a small portion of previous temporal data for predicting horizon data, while we expect the model to memorize the longterm patterns. In equation (1), an optimum parameter is the best model for forecasting time series data. In a neural network,
is the weights of the model and the optimization algorithm minimizes the nonlinear loss function
by solving following nonconvex optimization problem,(1) 
In this paper, given a spatial multivariate time series data , a deep neural network predicts , where is the number of output features, for input data .
3 Technical Background
Here, we detail the core components of the proposed approach, including Fuzzy Hierarchical Agglomerative Clustering, Convolutional layers, Convolutional LSTM layers and Denoising Autoencoder.
3.1 Dynamic Time Warping
A Dynamic time warping (DTW) algorithm finds an optimal path between two time series. It compares each point of first time series with second one. Hence, similar patterns occurred in different time slots are considered similar. A dynamic programming method finds the optimal match petitjean2012summarizing . Here, we illustrate the DTW for dimensional time series. Algorithm (1) finds the minimum distance between two dimensional time series with size of and .
3.2 Fuzzy Hierarchical Clustering
Given data points
, a fuzzy hierarchical clustering method finds a membership matrix
, where is the number of clusters and illustrates the distance of data points to cluster .To apply a DTWbased clustering method, the main challenge is to compute the mean of a cluster addressed in gupta1996nonlinear , niennattrakul2009shape , petitjean2012summarizing , because the initial values impacts on the final results of the algorithm. Hence, we consider fuzzy hierarchical clustering method without a need to find the cluster mean. Following the work of konkol2015fuzzy , a completelinkage is used for distances between clusters and a singlelinkage is used for distance between points, and a point and a cluster. An algorithm (2) finds the membership matrix of sensors to clusters.
The matrix is the set of distances between all pair of time series and clusters, and it is initialized by all distances between points. The function finds the closest pair of elements in the set , which and can be points or clusters. The matrix is the list of clusters and their members. The function adds the selected pair of elements to the list of clusters. This function merges and into a new cluster. The matrix is the list of assigned points, when a point is assigned into one cluster. Based on a new formation of clusters, the function finds the new distances between points and clusters. It updates the distance of all clusters and unassigned points to the new cluster. Moreover, it updates the fuzzy distance of all assigned points to the new cluster, and all points of new cluster to other clusters. The fuzzy distance between assigned point , and a cluster obtains by using equations (2).
(2)  
(3)  
(4) 
where is the minimum distance of a assigned point to any of the clusters, is membership value of assigned point to the cluster , is a fuzziness parameter, and the distance function is based on singlelinkage for each pair of points, or points and clusters, and completelinkage for two clusters.
3.3 Convolution Layer
A Convolutional layer uses three ideas of local receptive fields, shared weights and spatial subsampling; making them effective and efficient models for exploiting the local stationary on grid data krizhevsky2012imagenet . Given an input matrix , a 2dimension convolution layer has a weight matrix , called as a kernel, where and . A convolution multiplication
with strides
and is obtained by sliding a kernel all over input matrix. The kernel is a shared weight which assume to have a locally stationary input data. Given as the input of layer , a layer obtains byfor an activation function
and bias vector
. Pooling layers among successive convolution layers reduces size of hidden layers, while extract features in locally connected layers, which selects the maximum value in a matrix of size , and reduce the dimension of layers divided by and .3.4 ConvolutionLSTM layer
A LongShort Term Memory (LSTM) is a special recurrent neural network cell with powerful modelling of longterm dependencies
sak2014long . A memory cell , input gate , output gate and forgot gate works together in hidden units . Given convolution operator and a Hadamard product, a convolution LSTM is as follows xingjian2015convolutional ,(5)  
(6)  
(7)  
(8)  
(9) 
A convolutionLSTM layer have same structure of convolution layers, but having LSTM cells. The gates prevent a gradient from vanishing quickly by storing it in memory. The convolutionLSTM layer has a input of , where is time windows and the matrix is the spatial information on a grid of size and and each element has features.
3.5 Denoising Stacked Autoencoder
Given an input dimension data , an autoencoders transforms input with a nonlinear function , where is lower dimension space vincent2008extracting . The decoder generates , where . In the training process the objective is to reconstruct , by minimizing loss function , such as least square function, between and and obtaining optimum model parameters for all input data as follows,
(10) 
Stacked autoencoders are a set of multiple autoencoder layers, in which the input of each layer is the output of previous layer vincent2010stacked . The input data is corrupted with some noise, while the output remains unchanged. Adding noise to the input data and training the neural network to reconstruct the clean output, helps the neural network to learn robust features to noisy data. The noise can be added to the network using a dropout function in each layer , in which randomly
percent of neurons are dropped in each round of training process. Unsupervised training of stacked autoencoders with the form of
is capable of reconstructing the original data in the presence of noisy or missing data zhou2017delta , vincent2010stacked , gondara2018mida .4 Methodology:
In this section, we describe the architecture of the proposed deep learning framework for spatial time series forecasting problem. The proposed framework is illustrated in Fig. (1). The network structure represents the distance between neighboring sensors, and the spatialtemporal data includes time series data for each sensor.
4.1 Preprocessing
A time series decomposition method is applied on input time series , which generates three time series components of , which are seasonal, trends and residuals of time series, respectively. In spatial time series data, residuals can be different than only noise. For example, in a transportation network, time series residuals can be caused by the traffic evolution of the transportation network and they are meaningful patterns among neighboring time series, analyzed in section 5 experimental results.
To apply algorithm (2) on time series residuals, we consider a set for geographically nearest neighbors of sensors. The algorithm updates singlelinkage distances between two time series from set ; thus the clusters would not be distributed in a geographical area. Since some of the sensors might affect more than one regional cluster, the output of clustering algorithm finds fuzzy membership of each sensor to their similar clusters. Each sensor has a membership to some cluster . We say two time series and are similar, if two time series have similar patterns over some time shift, or have zero distance from each other. Hence, for a given distance function , which we consider DTW, a fuzzy hierarchical clustering algorithm finds the cluster of sensors with similar residual time series by finding the clusters in which the distance of its members is minimized. To represent shortterm similarity among neighboring time series, we used a rolling window on training data and getting average of corresponding DTW distances. A rolling window finds similarity between shortterm time window of neighboring area. To reduce computational time, rolling window is only applied, when there are high interaction among neighboring time series. For example, in traffic flow data, the interaction among neighboring sensors increases peak hours and congestion time periods. Applying algorithm 2 with aforementioned modifications on spatial time series finds fuzzy clusters of time series based on DTW distance.
4.2 Neural Network Architecture
The details of the deep neural network is represented in Fig. 2. Time series residuals are the first input of the neural network, detrended and represented with a matrix of . A convolutional component is applied to extract patterns from time series residuals. For a given set of time series , a general convolution kernel slides on first and second axis. However, because the sensors can have a spatial structure, like sensors in a transportation network, sliding a kernel on sensors cannot keep the structure of the network. Moreover, each sensor’s time series residuals are only dependent to small regions in the network. Hence, we propose a multikernel convolution layer, which receives the cluster set and residual time series data. A kernel for a cluster , is described with weight matrix where , if . In other words, the size of trainable variables for a kernel, corresponding to cluster , is . Only the sensors in cluster , has a local connectivity to same residual time series. The kernel slides over time axis and obtain hidden units for all , where
is pooling layer. Several convolutionRELUPooling layers extracts shortterm and spatial patterns from the time series residuals in each neighborhood. The output of kernels are concatenated and connected to a fullyconnected layer
and represented with a hidden layer , where is the number of represented features in convolution layers and is the total number of sensors.The time series trends represent longterm patterns. The trends of time series concatenate to on the last axis, which results in . Unlike residuals which represent physical dynamics of the problem and there is only similarity between neighboring areas, trends can represent global changes in the spatialtemporal data. Hence, we consider LSTM cells to capture longterm patterns for the concatenated output of the extracted features of smaller regions. The model follows by a 2dimension convolution LSTM layers. A 2dimension convolution LSTM layer, described in section 3.4, receives an input , and apply the convolution on the matrix of size with two channels. This convolutional layer has different architecture with the first multikernel convolution layer, that is, each neural cell is an LSTM cell and is applied on all input sensors. Some layers of convolution LSTM layers extract features from residuals and trends. Seasonal patterns represents repeated patterns for the given time horizon. The output is concatenated with seasonal patterns of time window . It follows by a fullyconnected layer. The output is , where is prediction horizon. The output consists of predicted values for all sensors in prediction horizon.
One of the challenges in spatialtemporal data is to have a robust prediction in existence of missing, noisy data. Hence, we consider an autoencoder layer in as the last component of the model. A denoising autoencoder decoder reconstructs the last output for each cluster. In the pretraining step, for a prediction horizon and a cluster , each denoising autoencoder decoder generates , where and drop out layer are between each successive layers. A denoising autoencoder component generates a predictions . As the output of autoencoders is designed based on the clusters, there are some sensors , where the fullyconnected target layer is connected to all common variables between denoising autoencoders with a linear activation function . In the training process the objective is to minimize loss function , such as mean square error function, between and real values , and obtaining optimum model parameters
for input data using stochastic gradient descent,
(11) 
5 Experimental Analysis
We illustrate the analysis and the performance of the proposed methodology on traffic flow data.
5.1 Data Set
We use traffic flow data from the Bay Area of California represented in Fig. 3 which is commonly used and available in PEMS californiapems . The traffic flow has been gathered every 30 seconds and aggregated every 5 minutes in the dataset. Each sensor on highways of California has flow, occupancy and speed at a time stamp. A sensor is a loop detector device in mainline, offramp or onramps locations. In preprocessing, we selected 597 sensors which they have more than observed values in the first six months of 2016.
5.2 Pattern analysis in traffic data
To illustrate the specific characteristics of traffic data arise from dynamic of traffic flow, we analyze the spatial, shortterm and longterm patterns.
In Fig. 4, an additive time series decomposition of traffic flow data is illustrated for one station. Given a one day frequency, time series decomposition has similar, repeated (seasonal) daily patterns. Moreover, there are longterm weekly patterns, shown as trends . The longterm patterns, such as seasonal and trends, arise from similar periodic patterns, generated outside of the highway network. In other words, they are related to origindestination matrix of vehicles in the network. The residual patterns are not only random noise, but also the results of spatial and shortterm patterns and related to the evolution of traffic flow or sudden changes in smaller regions of the network. Since they have more similar patterns for neighboring time series, illustrated in section 5.2.1.
5.2.1 Residual time series in traffic flow data
A time series decomposition consists of residuals, trends and seasonal components. The residuals are interpreted as random noise for time series data. However, in traffic flow data, the residuals are the results of physical evolution of the network. In non datadriven traffic flow problems, firstorder and secondorder traffic flow fundamental diagram shows the relation between traffic flow, occupancy and speed. Given a wavespeed , free speed and maximum density in a road segment , the first order dynamical traffic flow theorem approximates flow by . Wavespeed reduces flow in high occupancy. In Fig. 6, we examine the nonlinear relation of flow, speed and occupancy in one day and one station. It illustrates the relation between high occupancy and reduction of average speed in the road segment, leads to traffic congestion. This property of fundamental traffic flow diagram illustrates non repeated, residual patterns in traffic flow data as a result of congestion. The congestion propagation in a transportation network illustrates the relation among neighboring sensors in highways, described in Fig. 6 for flow data of three successive sensors. Congestion propagates over this sensors with nearly 20 min delay. For a larger area, in Fig. 7, the speed of 13 successive sensors is represented in an image. The reduction of speed in peak hours is presented with darker colors. It illustrates the reduction in speed is similar in neighboring areas.
5.3 Fuzzy Hierarchical Clustering
In this section, we illustrate the results of fuzzy clustering applied on time series residuals. In Fig. 8, the DTW distance matrix shows residual similarity among neighboring sensors. The matrix shows the average DTW distance for peak times of training data, which has the highest dependency because of high values of occupancy. Each cluster would be obtained by comparing neighboring sensors. On the elements near to diagonal, the lowest distance values shows the similarity between neighboring values.
After preprocessing of time series data, there are 597 sensors with complete data over a period of six months. The fuzzy clustering finds the membership of each sensor to clusters. In the fuzzy membership matrix, we consider threshold of 0.1. All the sensors which has a membership value of more than 0.1, they would be considered as the members of clusters. We also consider the average size of clusters to be less than 10 miles. The agglomerative clustering stops when the average become greater than 10. As the clustering is applied on mainline stations, we also added the onramp and offramp sensors to the closest mainline stations. The result of fuzzy heirarchial clustering method, has 64 clusters, where the average number of elements is 9.7 with standard deviation of 4.2 and minimum cluster size of 3 and maximum of 14. The length of smallest and largest cluster is 0.3 mile and 32.1 mile. And there are 53 sensors which appear in more than one cluster, nearly
of total sensors. To examine the relation between trends in one spatial area, using a rolling window, we obtain DTW distance of each pair of sensors. For a time window, we normalized trends, by subtracting all time stamp values from the last value. The average DTW distance is 0.7, for all pairs of sensors, which shows the high similarity of trends. By contrast, presented in section 5.2.1, the average DTW distance of time series residuals for all pair of sensors is 4.5, while applying the fuzzy clustering method on time series reduces the average DTW of clusters to 0.6. As a result, we only apply fuzzy clustering on time series residuals.5.4 Results of comparison
Our model is demonstrated to outperform the stateoftheart performance on the traffic flow prediction dataset. All models are trained using the ADAM optimizerkingma2014adam
. The batch size of each iteration is set to 512 and 400 epochs. All experiments are implemented in TensorFlow
abadi2016tensorflow and conducted on NVIDIA Tesla K80 GPU. We used a grid search for finding the optimum deep neural network architectures which have the best performance and most efficient computational time.The input matrix is , where the number of sensors is , the time window is , and there are features, including flow, occupancy and speed. For MLP, LSTM, CNN and the proposed multikernel CNNLSTM models, the input dimension is reshaped to have a appropriate dimensions, described in model details section 5.4.2. For all models the data transform into range of [0, 1]. For the models without time series decomposition component, including MLP, LSTM and CNN, we transform the data into stationary data by subtracting all input values from the value at time step , while detrending of the models with time series decomposition components is as follows. The residual time series is stationary. To feed trends and seasonal components to a neural network, we make them stationary by subtracting each time window from its last value and . To recover the output, we add the predicted value to sum of and . The output matrix is , where the size of horizon for 15min, 30min, 45min and 60min prediction in the result tables.
5.4.1 Baseline models
As it is illustrated in primary traffic flow prediction studies, the traffic flow patterns are similar in same hours and weekdays. The first baseline model (Aveweekdayhourly) is to use average of traffic flow of each station as a timetable for each time , given a weekday. The shortterm prediction for each sensor is obtained by using average values in training data. The second baseline model (current value) is to use current value of traffic flow for shortterm prediction .
5.4.2 StateoftheArt Neural Network Models
In this section, we describe the implemented neural network models. A Multilayer perceptron (MLP) with three fully connected layers and Xavier initialization glorot2010understanding , RELU activation function, and (500,300, 200) hidden units is used. A deep belief network (DBN) with greedy layer wise pretraining of autoencoders finds a good initialization for a fullyconnected neural network. A fine tuning step for stacked auto encoder finish training. We consider 30 epochs for pretraining each layer. Fully connected LongShort Term Memory neural network (LSTM) is capable of capturing longterm temporal patterns. However, in most of the studies, fully connected LSTM models have a shallow structure. They also are slow in convergence, although they have strong capabilities in capturing longterm patterns. The optimum LSTM neural network structure has one hidden units of size (400,200). The input is reshaped from a vector to matrix of two dimension . To use a convolutional neural network (CNN) for time series forecasting, the input matrix is reshaped to three dimension
. Each channel includes traffic flow, speed and occupancy. The optimum implemented deep CNN model has four layers with maxpooling and batch normalization layers. The number of filters are (16,32,64,128), the kernel is (5,5), and maxpooling layers reduce the dimension by two in each layer.Two fully connected layer connect the convolution layers to output layer.
The (CNNLSTM) model captures shortterm and spatial patterns in CNN layers, and temporal patterns in LSTM layers. Two convolution layers are applied on all input sensors with filters (16, 32). An LSTM layer of size (300,150) follows the output of CNN model, following by a fully connected layer. The model (CCNNLSTM) is a clustering based CNNLSTM, in which a multikernel convolution layer extract spatial, shortterm patterns from time series residuals. The clusters are obtained in section 5.3.
A pretraining denoising stacked auto encoder decoder is applied on each cluster of sensors to generate a robust output. Each layer is connected to dropout layer with rate of 0.2. As the average size of clusters is nearly 10 and standard deviation of 4, described in section 5.3, we used a same size of architecture for all of them with size of (40,20,10,20,40) units with fully connected layers, and RELU activation function. The pretraining is done in 60 epochs. The weights are loaded into the proposed model in the next section.
The Clusterbased CNNLSTM with Denoising autoencoder (CCNNLSTMDA) is the proposed model in section 4 which uses clustering of time series residuals, trends, and seasonal along with denoising autoencoders and time series decomposition components for each cluster. The proposed architecture, in section 4, consists of 2 convolution layers with RELU and maxpooling layers with filters (32, 64). It follows by two fully connected layers, two 2dimension convolutional LSTM for capturing longterm patterns (16, 32).
5.5 Performance Measure
To evaluate the performance of the proposed model, we used performance indices, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) in equation (12).
(12)  
(13) 
Here, are the real values and are the predicted values. In this paper, the prediction is for 15min, 30min, 45min and 60min time horizons.
5.6 Performance results on testing data
In the first analysis, we compare the performance of models for traffic flow prediction. The results are illustrated in Table 1. The comparison is between all described models for prediction horizons of 15min, 30min, 45min and 60min on testing data. The two baseline models have worst performance. The performance of neural network models are much better than the baseline models. The LSTM model has a better performance than MLP, BN and CNN models, demonstrating its improved performance in time series forecasting. CNNLSTM models are more capable for capturing shortterm and longterm patterns an are comparable with LSTM. Two models, CCNNLSTM and CCNNLSTMDA, have better performance due to explicitly separating spatial regions. The performance of CCNNLSTM and CCNNLSTMDA is almost quite close. In next sections, we discuss that in existence of missing data, the model with denoising autoencoders have better performance.
Baseline models  State of arts neural networks  Proposed models  
Horizon  Metric  currentvalue  Aveweekdayhourly  MLP  DBN  LSTM  CNN  CNNLSTM  CCNNLSTM  CCNNLSTMDA 
15 min  MAE  24  27.1  16.3  15.5  14  16  13.6  12.3  12.1 
RMSE  36  43.2  28.1  27  25  27.4  24.8  23.1  22.7  
30 min  MAE  31  27.1  16.9  15.9  14.4  16.2  14.3  12.7  12.4 
RMSE  45  43.2  29  28  26.2  28.4  26.0  23.4  22.9  
45 min  MAE  38  27.1  17.1  16.2  14.9  16.8  15  12.9  12.8 
RMSE  54  43.2  29.8  29  28.1  29.3  28.2  23.4  23.1  
60 min  MAE  44  27.1  17.6  16.5  15.2  17.2  15.1  13.3  13.3 
RMSE  63  43.2  30.8  29.3  28.4  30.1  28.1  23.8  23.7 
5.7 Performance results of peak and offpeak traffic
Next experiment is to compare peak and offpeak hours. In peak hours, physical properties and evolution of traffic flow possibly affect congestion propagation in network. Therefor, the residuals come from the evolution of traffic and are meaningful spatial patterns. On the other hand, in offpeak hours, traffic flow is based on free speed and without congestion. Hence, flow is obtained from longterm patterns in the network. In Fig. 9, the output of the CCNNLSTMDA and MLP models are illustrated. Among neural network models, the MLP model has the worst traffic flow prediction performance, while CCNNLSTMDA is the best in Table 1. The CCNNLSTMDA model has better performance in peak hours of the days, when there are high residuals. It shows the weakness of fullyconnected neural networks for capturing residual patterns. In offpeak hours, all neural network models have comparable and good performance.
In Table 2, we compare the performance of the models for peak and offpeak traffic flow data. We select the MLP and LSTM models which only capture temporal patterns, along with the proposed model which carefully captures spatial patterns. This table compares residualMAE and MAE on prediction values. For residual MAE, we detrend traffic flow prediction and find MAE error. For offpeak hours, LSTM has a comparable performance to the proposed model, as it simply capture longterm patterns. However, the performance of CCNNLSTMDA in peak hours is highly better than LSTM model. In Fig. 10, we plot the comparison of LSTM and CCNNLSTMDA. It is shown that CCNNLSTMDA captures big residuals compared to LSTM model. While a more smooth prediction which ignores noise shows the model is not overfitted, meaningful residual patterns in spatial data need to be carefully considered in the prediction of the model.
Flow State  Metric  MLP  LSTM  CCNNLSTMDA 

Offpeak time  Residual MAE  6.1  4.3  4.4 
MAE  12.3  11.9  11.8  
Peak time  Residual MAE  12.2  11.1  8.2 
MAE  18.3  16.8  13.8 
5.8 Performance results with missing data
We evaluate the proposed model relative to other neural network models with missing data. We randomly generated blocks of missing values in the test data. Each block is related to one randomly selected sensor at a random starting time of
, which is generated with a normal distribution with mean 2 hours and standard deviation 0.5. For each sensor one block of missing values are generated per week. We used missing data which increases error for model prediction and the real values are used for evaluation of the model. Since the missing values is only applied to one random sensor, we expect neighboring sensors to help the model to continue to predict the traffic flows well.
The results is illustrated in Table 3. To briefly describe the results, we only show the 30min prediction in Table 3. The performance of CCNNLSTMDA is better than other models in existence of missing values. In Fig. 11.a, we illustrate the increase in error in different models with random missing values. The figure shows a reduced increase of error in CCNNLSTMDA, as the time series decomposition and denoising autoencoder components generates a more robust prediction to missing values. In the existence of missing data, the value of forecasting can be distracted far from real values in LSTM neural network, Fig. 11b.
Metric  LSTM  CNNLSTM  CCNNLSTM  CCNNLSTMDA 

MAE  16.7  16.5  14.1  13.1 
RMSE  28.8  28.9  25.2  23.9 
6 Conclusion and future work
This paper illustrates a new framework for spatial time series forecasting problem and its application on traffic flow data. The proposed method consists of several components. Firstly, for a time series data gathered on a network, a convolution layer does not capture network structure, because a kernel slides on spatial locations. Hence, we obtain fuzzy clusters of time series and apply a multikernel convolution component, in which each kernel only slides on time steps and keep the network structure of time series. Secondly, time series residuals are not a noise. They are result of interaction among neighboring time series. As an example, the similarity of time series residuals is presented in Fig. 6 and Fig. 8. Thus, convolution component is applied on time series residuals to extract shortterm interaction among neighboring time series. In Table 2, we evaluate the prediction of time series residuals. In offpeak, the performance of the proposed model is the same as LSTM. However, the proposed model has better performance in peak hours. Table 1 shows the comparison of the baseline, stateofarts neural network, and CNNLSTM models for traffic flow prediction. The CCNNLSTM model uses spatial and time series decomposition, which have better performance than baseline and stateofarts models, in Table 1. One of the challenges in spatialtemporal data is to work with missing data. We illustrate the performance of using pretrained denoising autoencoder decoder as the last component of CCNNLSTMDA in Fig. 11. It shows the increase in error for the model with denoising autoencoder is less than other models.
This study demonstrates the effectiveness of designing new neural network architectures considering specific properties of spatialtemporal problems. Each component of the neural network is designed based on the characteristics of the extracted patterns. In the experimental results, we analyzed the spatialtemporal patterns in traffic flow data and we also illustrate the effect of each neural network components on the improvement of results. Related analyses can be constructed for other spatialtemporal problems, such as anomaly detection, missing data imputation, time series clustering and time series classification problems. In addition, different spatialtemporal problems have different physical properties or dynamical systems, which makes their patterns unique for the problems.
References
 (1) Y. Lv, Y. Duan, W. Kang, Z. Li, F.Y. Wang, Traffic flow prediction with big data: a deep learning approach, IEEE Transactions on Intelligent Transportation Systems 16 (2) (2015) 865–873.
 (2) S. Ahmad, A. Lavin, S. Purdy, Z. Agha, Unsupervised realtime anomaly detection for streaming data, Neurocomputing 262 (2017) 134–147.
 (3) Y. Zheng, Q. Liu, E. Chen, Y. Ge, J. L. Zhao, Time series classification using multichannels deep convolutional neural networks, in: International Conference on WebAge Information Management, Springer, 2014, pp. 298–310.

(4)
K. Ø. Mikalsen, F. M. Bianchi, C. SogueroRuiz, R. Jenssen, Time series cluster kernel for learning similarities between multivariate time series with missing data, Pattern Recognition 76 (2018) 569–581.
 (5) R. J. Bessa, A. Trindade, V. Miranda, Spatialtemporal solar power forecasting for smart grids, IEEE Transactions on Industrial Informatics 11 (1) (2015) 232–241.
 (6) X. Qiu, Y. Ren, P. N. Suganthan, G. A. Amaratunga, Empirical mode decomposition based ensemble deep learning for load demand time series forecasting, Applied Soft Computing 54 (2017) 246–255.
 (7) A. Grover, A. Kapoor, E. Horvitz, A deep hybrid model for weather forecasting, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015, pp. 379–386.
 (8) A. Tascikaraoglu, Evaluation of spatiotemporal forecasting methods in various smart city applications, Renewable and Sustainable Energy Reviews 82 (2018) 424–435.
 (9) N. G. Polson, V. O. Sokolov, Deep learning for shortterm traffic flow prediction, Transportation Research Part C: Emerging Technologies 79 (2017) 1–17.
 (10) J. Zhang, Y. Zheng, D. Qi, Deep spatiotemporal residual networks for citywide crowd flows prediction., in: AAAI, 2017, pp. 1655–1661.
 (11) X. Zheng, W. Chen, P. Wang, D. Shen, S. Chen, X. Wang, Q. Zhang, L. Yang, Big data for social transportation, IEEE Transactions on Intelligent Transportation Systems 17 (3) (2016) 620–630.
 (12) J. Zhang, F.Y. Wang, K. Wang, W.H. Lin, X. Xu, C. Chen, et al., Datadriven intelligent transportation systems: A survey, IEEE Transactions on Intelligent Transportation Systems 12 (4) (2011) 1624–1639.
 (13) O. Y. AlJarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, K. Taha, Efficient machine learning for big data: A review, Big Data Research 2 (3) (2015) 87–93.
 (14) D. C. Gazis, C. H. Knapp, Online estimation of traffic densities from timeseries of flow and speed data, Transportation Science 5 (3) (1971) 283–301.
 (15) Y. Kamarianakis, P. Prastacos, Forecasting traffic flow conditions in an urban network: Comparison of multivariate and univariate approaches, Transportation Research Record: Journal of the Transportation Research Board (1857) (2003) 74–84.
 (16) S. V. Kumar, L. Vanajakshi, Shortterm traffic flow prediction using seasonal arima model with limited input data, European Transport Research Review 7 (3) (2015) 21.
 (17) B. Ghosh, B. Basu, M. O’Mahony, Bayesian timeseries model for shortterm traffic flow forecasting, Journal of transportation engineering 133 (3) (2007) 180–189.
 (18) G. Yu, J. Hu, C. Zhang, L. Zhuang, J. Song, Shortterm traffic flow forecasting based on markov chain model, in: Intelligent Vehicles Symposium, 2003. Proceedings. IEEE, IEEE, 2003, pp. 208–212.
 (19) J. Wang, W. Deng, Y. Guo, New bayesian combination method for shortterm traffic flow forecasting, Transportation Research Part C: Emerging Technologies 43 (2014) 79–94.
 (20) W. Huang, G. Song, H. Hong, K. Xie, Deep architecture for traffic flow prediction: deep belief networks with multitask learning, IEEE Transactions on Intelligent Transportation Systems 15 (5) (2014) 2191–2201.

(21)
T. Kuremoto, S. Kimura, K. Kobayashi, M. Obayashi, Time series forecasting using a deep belief network with restricted boltzmann machines, Neurocomputing 137 (2014) 47–56.
 (22) L. Wang, Z. Wang, H. Qu, S. Liu, Optimal forecast combination based on neural networks for time series forecasting, Applied Soft Computing 66 (2018) 1–17.
 (23) X. Qiu, L. Zhang, Y. Ren, P. N. Suganthan, G. Amaratunga, Ensemble deep learning for regression and time series forecasting, in: Computational Intelligence in Ensemble Learning (CIEL), 2014 IEEE Symposium on, IEEE, 2014, pp. 1–6.

(24)
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
 (25) X. Ma, Z. Dai, Z. He, J. Ma, Y. Wang, Y. Wang, Learning traffic as images: a deep convolutional neural network for largescale transportation network speed prediction, Sensors 17 (4) (2017) 818.
 (26) S. Deng, S. Jia, J. Chen, Exploring spatial–temporal relations via deep convolutional neural networks for traffic flow prediction with incomplete data, Applied Soft Computing, 2018.
 (27) M. Henaff, J. Bruna, Y. LeCun, Deep convolutional networks on graphstructured data, arXiv preprint arXiv:1506.05163.
 (28) J. Bruna, W. Zaremba, A. Szlam, Y. LeCun, Spectral networks and locally connected networks on graphs, arXiv preprint arXiv:1312.6203.
 (29) Y. Li, R. Yu, C. Shahabi, Y. Liu, Graph convolutional recurrent neural network: Datadriven traffic forecasting, arXiv preprint arXiv:1707.01926.
 (30) J. T. Connor, R. D. Martin, L. E. Atlas, Recurrent neural networks and robust time series prediction, IEEE transactions on neural networks 5 (2) (1994) 240–254.
 (31) H. Sak, A. Senior, F. Beaufays, Long shortterm memory recurrent neural network architectures for large scale acoustic modeling, in: Fifteenth annual conference of the international speech communication association, 2014.
 (32) Z. Zhao, W. Chen, X. Wu, P. C. Chen, J. Liu, Lstm network: a deep learning approach for shortterm traffic forecast, IET Intelligent Transport Systems 11 (2) (2017) 68–75.
 (33) X. Ma, Z. Tao, Y. Wang, H. Yu, Y. Wang, Long shortterm memory neural network for traffic speed prediction using remote microwave sensor data, Transportation Research Part C: Emerging Technologies 54 (2015) 187–197.
 (34) Y. Tian, K. Zhang, J. Li, X. Lin, B. Yang, Lstmbased traffic flow prediction with missing data, Neurocomputing 318 (2018) 297–305.
 (35) S. Xingjian, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, W.c. Woo, Convolutional lstm network: A machine learning approach for precipitation nowcasting, in: Advances in neural information processing systems, 2015, pp. 802–810.
 (36) S. Yi, J. Ju, M.K. Yoon, J. Choi, Grouped convolutional neural networks for multivariate time series, arXiv preprint arXiv:1703.09938.
 (37) X. Cheng, R. Zhang, J. Zhou, W. Xu, Deeptransport: Learning spatialtemporal dependency for traffic condition forecasting, arXiv preprint arXiv:1709.09585.
 (38) Q. Liu, B. Wang, Y. Zhu, Shortterm traffic speed forecasting based on attention convolutional neural network for arterials, ComputerAided Civil and Infrastructure Engineering 33 (11) (2018) 999–1016.
 (39) F. Petitjean, P. Gançarski, Summarizing a set of time series by averaging: From steiner sequence to compact multiple alignment, Theoretical Computer Science 414 (1) (2012) 76–91.
 (40) L. Gupta, D. L. Molfese, R. Tammana, P. G. Simos, Nonlinear alignment and averaging for estimating the evoked potential, IEEE Transactions on Biomedical Engineering 43 (4) (1996) 348–356.
 (41) V. Niennattrakul, C. A. Ratanamahatana, Shape averaging under time warping, in: Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2009. ECTICON 2009. 6th International Conference on, Vol. 2, IEEE, 2009, pp. 626–629.

(42)
M. Konkol, Fuzzy agglomerative clustering, in: International Conference on Artificial Intelligence and Soft Computing, Springer, 2015, pp. 207–217.
 (43) P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 1096–1103.
 (44) P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of machine learning research 11 (Dec) (2010) 3371–3408.
 (45) T. Zhou, G. Han, X. Xu, Z. Lin, C. Han, Y. Huang, J. Qin, agree adaboost stacked autoencoder for shortterm traffic flow forecasting, Neurocomputing 247 (2017) 31–38.
 (46) L. Gondara, K. Wang, Mida: Multiple imputation using denoising autoencoders, in: PacificAsia Conference on Knowledge Discovery and Data Mining, Springer, 2018, pp. 260–272.
 (47) California. pems, http://pems.dot.ca.gov/, 2017.
 (48) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
 (49) M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: a system for largescale machine learning., in: OSDI, Vol. 16, 2016, pp. 265–283.
 (50) X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.
Comments
There are no comments yet.