A Spatial-Temporal Decomposition Based Deep Neural Network for Time Series Forecasting

02/02/2019 ∙ by Reza Asadi, et al. ∙ 0

Spatial time series forecasting problems arise in a broad range of applications, such as environmental and transportation problems. These problems are challenging because of the existence of specific spatial, short-term and long-term patterns, and the curse of dimensionality. In this paper, we propose a deep neural network framework for large-scale spatial time series forecasting problems. We explicitly designed the neural network architecture for capturing various types of patterns. In preprocessing, a time series decomposition method is applied to separately feed short-term, long-term and spatial patterns into different components of a neural network. A fuzzy clustering method finds cluster of neighboring time series based on similarity of time series residuals; as they can be meaningful short-term patterns for spatial time series. In neural network architecture, each kernel of a multi-kernel convolution layer is applied to a cluster of time series to extract short-term features in neighboring areas. The output of convolution layer is concatenated by trends and followed by convolution-LSTM layer to capture long-term patterns in larger regional areas. To make a robust prediction when faced with missing data, an unsupervised pretrained denoising autoencoder reconstructs the output of the model in a fine-tuning step. The experimental results illustrate the model outperforms baseline and state of the art models in a traffic flow prediction dataset.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Literature Review

Time series data arise in broad areas, such as engineering, medicine, finance, and economics. Various types of statistical and machine learning techniques have been applied on time series analysis. Recently, several new scalable time series analyses have been studied, such as forecasting


, anomaly detection

ahmad2017unsupervised , classification zheng2014time and clustering mikalsen2018time . They illustrated the performance gains of these works over traditional time series techniques on large-scale problems. Moreover, spatial time series problems arise when there is a spatial dependency between neighboring time series. Spatial-temporal data arise in diverse areas of power grids bessa2015spatial , load demand forecasting qiu2017empirical , weather forecasting grover2015deep , smart city applications tascikaraoglu2018evaluation , and transportation systems, such as traffic flow forecasting polson2017deep , zhang2017deep .

Traffic flow prediction is one of the essential components of Intelligent Transportation Systems and one of the most challenging spatial-temporal problems, because of recurrent and non-recurrent patterns and the physical dynamics involved. Traffic flow prediction can assist travelers to make better decisions and improve traffic management, while decreasing traffic congestion and air pollution. Recently, smart devices increase the role of traffic flow prediction problems in our daily lives, which help people planning their travel and find the most efficient routes. With the advent of new sensing, computing and networking technologies, such as cameras, sensors, radars, inductive loops and GPS devices, a large volume of data are readily available zheng2016big . These increasingly large data sets means that big data and techniques to handle these data plays a key role in the success of future transportation systems zhang2011data . Hence, to improve the performance of transportation systems, researchers are motivated to take advantage of new spatial-temporal data-driven techniques and design scalable algorithms capable of processing large volume of data, such as deep neural networks , lv2015traffic , al2015efficient .

1.1 Background

Starting in the 1970’s with the original work of Gazis and Knapp gazis1971line there have been many studies applying time series forecasting techniques to traffic flow prediction problem, including parametric techniques, such as auto-regressive integrated moving average (ARIMA) kamarianakis2003forecasting and Seasonal-ARIMA kumar2015short , and statistical techniques, such as Bayesian analysis ghosh2007bayesian

, Markov chain


and Bayesian networks


. However, there are several limitations on the models, because of prior assumptions, lack of handling missing data, noisy data, outliers, and the curse of dimensionality. Shallow architecture neural networks are capable of high dimension data, but cannot capture a high order computational complexity. With superior performance of deep neural networks on large-scale problems, they became an alternative technique applied on large-scale multi-variate time series forecasting problems.

Recently, there have been several attempts to design deep learning models for multi-variate time series forecasting problems. The primary work related to ours proposes a stacked autoencoder (SAE) model to learn traffic flow features and illustrate the advantage of SAE model versus Multi-layer Perceptron

lv2015traffic . In huang2014deep

, they propose stacked autoencoders with multi-task learning at the top layers of the neural network. A Deep Belief Network(DBN) composed by layers of restricted boltzman machine is proposed

kuremoto2014time . In wang2018optimal , an ensemble of four categories of fully connected neural network is applied on time series forecasting problem. In qiu2014ensemble

, an ensemble of DBN with Support Vector Regression for aggregation of outputs is proposed for time series forecasting problem. However, in fully connected neural networks, the size increases exponentially with increasing input size, therefore the convergence of the model is computationally expensive and challenging. Several other neural network layers have been proposed to reduce the computational time and capture patterns in high order computationally complex temporal datasets.

Convolutional Neural Networks (CNN) extract features of various types of input data, such as images, videos, and audio. Weight sharing, the main feature of CNN, reduces the number of parameters in deep neural network models. These properties improve performance of learning algorithms by reducing complexity of parameters krizhevsky2012imagenet . The performance of deep CNNs in multi-variate time series forecasting is examined; in ma2017learning , a spatial-temporal relation of traffic flow data is represented as images. A CNN model is used to train from images and forecast speed in large transportation networks. In deng2018exploring , they studied image-like representation of spatial time series data using convolution layers and ensemble learning. A convolution layer consider spatial structure in a euclidean space, which can miss some information on graph-structure data henaff2015deep . As an alternative approach, following the work bruna2013spectral , spatial dependency is captured using bi-directional diffusion convolutional recurrent networks li2017graph

. They illustrate a graph-structured representation of time series data capture spatial relation among time series. Moreover, in the presence of temporal data, recurrent neural networks have shown great performance in time series forecasting

connor1994recurrent . The vanishing gradient in deep Multi-layer perceptron and recurrent neural network problem is solved by employing a Long-Short Term Model (LSTM) sak2014long , which significantly improves time series forecasting zhao2017lstm , traffic speed prediction ma2015long

and traffic flow estimation with missing data

tian2018lstm .

While convolutional neural networks can exhibit excellent performance on spatial data, and recurrent neural networks have advantages on problems with temporal data; spatial-temporal problems combine both of these. In xingjian2015convolutional , they propose convolutional-LSTM layer for weather forecasting problem, in which consider spatio-temporal sequences. A convolutional deep learning model for multi-variate time series forecasting is proposed yi2017grouped . They propose explicit grouping of input time series and implicit grouping using error back-propagation. In cheng2017deeptransport , they use a CNN-LSTM model for downstream and upstream data to capture physical relationships among traffic flow data. A convolutional layer is followed by an LSTM layer for downstream and upstream traffic flow data. In liu2018short , they illustrate a CNN and gated CNN followed by attention layers for spatial-temporal data. The capability of CNN-LSTM in learning spatial-temporal features are illustrated in above works. However, there is not any analysis on designing a neural network architecture with various components to separately capture spatial-temporal patterns.

1.2 Contribution

In the aforementioned works, spatial time series forecasting has been studied with the objective of proposing various types of convolution and recurrent neural network layers. However, spatial-temporal data have their specific patterns, which motivate us to use spatial and time series decomposition, and to explicitly consider various types of patterns in designing an efficient neural network architecture. There are some challenges in spatial-temporal data which should be considered in designing the deep neural network architecture. In spatial-temporal data, time series residuals are not only meaningless noise, but also related to physical properties and dynamical system of spatially dependent time series. Moreover, convolutional layers can capture spatial and short-term patterns, but sliding convolution kernels on spatial features miss network structure. In existence of long-term patterns, an LSTM layers shows great performance in forecasting problems because it can separately capture detrending data. Furthermore, a challenging problem is to address missing small spatial-temporal data in the time series forecasting problems.

In this paper, we address the problem of explicitly decomposing spatial-temporal patterns in designing a deep neural network and we illustrate its performance improvement on a large-scale traffic flow prediction problem. The contribution of the paper is described as follows:

  • We illustrate an approach for explicitly considering various types of patterns in a deep neural network architecture for a spatial multi-variate time series forecasting problem.

  • We describe a Dynamic Time Warping-based clustering method and time series decomposition with the objective of finding compact regions with similar time series residuals.

  • A multi-kernel convolution layer is designed for spatial time series data, to keep the spatial structure of time series data and extract short-term and spatial patterns. It follows by a convolution-LSTM component to capture long-term patterns from trends, and a pretrained denoising autoencoder to have robust prediction to missing data.

  • The spatial and temporal patterns in traffic flow data is analyzed and the performance gains of the proposed model relative to baseline and state-of-art-the-art deep neural networks are illustrated for a traffic flow prediction, capturing meaningful time series residuals and a robust prediction to missing data.

The rest of the paper is as follows, in section II, we define the problem. In section III, the technical background of the proposed model are presented. In section IV, the proposed framework is illustrated, followed by the results of the work and conclusion discussed in section V.

2 Problem Definition

Time series data are a set of successive measurements, , where is an observed variable at location , time step and is the size of the time series. A sensor gathers with corresponding features . A spatial-temporal data is a set of multi-variate time series }, represented by a matrix of , where is the number of sensors, which gather spatial-temporal data in synchronous time in a geographical area.

Given as the set of all time series in a region, a spatial time series forecasting problem is cast as a regression problem. Given a time window of last steps, and a horizon prediction , the objective is to predict , given . The time window is used to only consider a small portion of previous temporal data for predicting horizon data, while we expect the model to memorize the long-term patterns. In equation (1), an optimum parameter is the best model for forecasting time series data. In a neural network,

is the weights of the model and the optimization algorithm minimizes the non-linear loss function

by solving following non-convex optimization problem,


In this paper, given a spatial multi-variate time series data , a deep neural network predicts , where is the number of output features, for input data .

3 Technical Background

Here, we detail the core components of the proposed approach, including Fuzzy Hierarchical Agglomerative Clustering, Convolutional layers, Convolutional LSTM layers and Denoising Autoencoder.

3.1 Dynamic Time Warping

A Dynamic time warping (DTW) algorithm finds an optimal path between two time series. It compares each point of first time series with second one. Hence, similar patterns occurred in different time slots are considered similar. A dynamic programming method finds the optimal match petitjean2012summarizing . Here, we illustrate the DTW for -dimensional time series. Algorithm (1) finds the minimum distance between two -dimensional time series with size of and .

1:procedure (, )
2:     return
3:end procedure
4:procedure DTW() Two input time series
5:     , , )
6:      Initialization of distance and path matrix
8:     for  to  do
10:     end for
11:     for  to  do
13:     end for
14:     for  to  do
15:         for  to  do
18:         end for
19:     end for
20:     return , Return the nonlinear distance of two time series
21:end procedure
Algorithm 1 Multi-dimensional Dynamic Time Warping

3.2 Fuzzy Hierarchical Clustering

Given data points

, a fuzzy hierarchical clustering method finds a membership matrix

, where is the number of clusters and illustrates the distance of data points to cluster .

To apply a DTW-based clustering method, the main challenge is to compute the mean of a cluster addressed in gupta1996nonlinear , niennattrakul2009shape , petitjean2012summarizing , because the initial values impacts on the final results of the algorithm. Hence, we consider fuzzy hierarchical clustering method without a need to find the cluster mean. Following the work of konkol2015fuzzy , a complete-linkage is used for distances between clusters and a single-linkage is used for distance between points, and a point and a cluster. An algorithm (2) finds the membership matrix of sensors to clusters.

The matrix is the set of distances between all pair of time series and clusters, and it is initialized by all distances between points. The function finds the closest pair of elements in the set , which and can be points or clusters. The matrix is the list of clusters and their members. The function adds the selected pair of elements to the list of clusters. This function merges and into a new cluster. The matrix is the list of assigned points, when a point is assigned into one cluster. Based on a new formation of clusters, the function finds the new distances between points and clusters. It updates the distance of all clusters and unassigned points to the new cluster. Moreover, it updates the fuzzy distance of all assigned points to the new cluster, and all points of new cluster to other clusters. The fuzzy distance between assigned point , and a cluster obtains by using equations (2).


where is the minimum distance of a assigned point to any of the clusters, is membership value of assigned point to the cluster , is a fuzziness parameter, and the distance function is based on single-linkage for each pair of points, or points and clusters, and complete-linkage for two clusters.

procedure FHC() input time series and spatial distances
     ; distance matrix
     ; List of assigned points
     ; List of clusters and their members
     while Convergence is satisfied do
     end while
     return Return the list of clusters
end procedure
Algorithm 2 A DTW-based Fuzzy hierarchical Clustering on Spatial Time Series

3.3 Convolution Layer

A Convolutional layer uses three ideas of local receptive fields, shared weights and spatial subsampling; making them effective and efficient models for exploiting the local stationary on grid data krizhevsky2012imagenet . Given an input matrix , a 2-dimension convolution layer has a weight matrix , called as a kernel, where and . A convolution multiplication

with strides

and is obtained by sliding a kernel all over input matrix. The kernel is a shared weight which assume to have a locally stationary input data. Given as the input of layer , a layer obtains by

for an activation function

and bias vector

. Pooling layers among successive convolution layers reduces size of hidden layers, while extract features in locally connected layers, which selects the maximum value in a matrix of size , and reduce the dimension of layers divided by and .

3.4 Convolution-LSTM layer

A Long-Short Term Memory (LSTM) is a special recurrent neural network cell with powerful modelling of long-term dependencies

sak2014long . A memory cell , input gate , output gate and forgot gate works together in hidden units . Given convolution operator and a Hadamard product, a convolution LSTM is as follows xingjian2015convolutional ,


A convolution-LSTM layer have same structure of convolution layers, but having LSTM cells. The gates prevent a gradient from vanishing quickly by storing it in memory. The convolution-LSTM layer has a input of , where is time windows and the matrix is the spatial information on a grid of size and and each element has features.

3.5 Denoising Stacked Autoencoder

Given an input -dimension data , an autoencoders transforms input with a non-linear function , where is lower dimension space vincent2008extracting . The decoder generates , where . In the training process the objective is to reconstruct , by minimizing loss function , such as least square function, between and and obtaining optimum model parameters for all input data as follows,


Stacked autoencoders are a set of multiple autoencoder layers, in which the input of each layer is the output of previous layer vincent2010stacked . The input data is corrupted with some noise, while the output remains unchanged. Adding noise to the input data and training the neural network to reconstruct the clean output, helps the neural network to learn robust features to noisy data. The noise can be added to the network using a dropout function in each layer , in which randomly

percent of neurons are dropped in each round of training process. Unsupervised training of stacked autoencoders with the form of

is capable of reconstructing the original data in the presence of noisy or missing data zhou2017delta , vincent2010stacked , gondara2018mida .

4 Methodology:

In this section, we describe the architecture of the proposed deep learning framework for spatial time series forecasting problem. The proposed framework is illustrated in Fig. (1). The network structure represents the distance between neighboring sensors, and the spatial-temporal data includes time series data for each sensor.

4.1 Preprocessing

A time series decomposition method is applied on input time series , which generates three time series components of , which are seasonal, trends and residuals of time series, respectively. In spatial time series data, residuals can be different than only noise. For example, in a transportation network, time series residuals can be caused by the traffic evolution of the transportation network and they are meaningful patterns among neighboring time series, analyzed in section 5 experimental results.

To apply algorithm (2) on time series residuals, we consider a set for geographically nearest neighbors of sensors. The algorithm updates single-linkage distances between two time series from set ; thus the clusters would not be distributed in a geographical area. Since some of the sensors might affect more than one regional cluster, the output of clustering algorithm finds fuzzy membership of each sensor to their similar clusters. Each sensor has a membership to some cluster . We say two time series and are similar, if two time series have similar patterns over some time shift, or have zero distance from each other. Hence, for a given distance function , which we consider DTW, a fuzzy hierarchical clustering algorithm finds the cluster of sensors with similar residual time series by finding the clusters in which the distance of its members is minimized. To represent short-term similarity among neighboring time series, we used a rolling window on training data and getting average of corresponding DTW distances. A rolling window finds similarity between short-term time window of neighboring area. To reduce computational time, rolling window is only applied, when there are high interaction among neighboring time series. For example, in traffic flow data, the interaction among neighboring sensors increases peak hours and congestion time periods. Applying algorithm 2 with aforementioned modifications on spatial time series finds fuzzy clusters of time series based on DTW distance.

Figure 1: The proposed framework for spatial multi-variate time series forecasting problem

4.2 Neural Network Architecture

The details of the deep neural network is represented in Fig. 2. Time series residuals are the first input of the neural network, detrended and represented with a matrix of . A convolutional component is applied to extract patterns from time series residuals. For a given set of time series , a general convolution kernel slides on first and second axis. However, because the sensors can have a spatial structure, like sensors in a transportation network, sliding a kernel on sensors cannot keep the structure of the network. Moreover, each sensor’s time series residuals are only dependent to small regions in the network. Hence, we propose a multi-kernel convolution layer, which receives the cluster set and residual time series data. A kernel for a cluster , is described with weight matrix where , if . In other words, the size of trainable variables for a kernel, corresponding to cluster , is . Only the sensors in cluster , has a local connectivity to same residual time series. The kernel slides over time axis and obtain hidden units for all , where

is pooling layer. Several convolution-RELU-Pooling layers extracts short-term and spatial patterns from the time series residuals in each neighborhood. The output of kernels are concatenated and connected to a fully-connected layer

and represented with a hidden layer , where is the number of represented features in convolution layers and is the total number of sensors.

Figure 2: The proposed spatial-temporal decomposition deep neural network architecture

The time series trends represent long-term patterns. The trends of time series concatenate to on the last axis, which results in . Unlike residuals which represent physical dynamics of the problem and there is only similarity between neighboring areas, trends can represent global changes in the spatial-temporal data. Hence, we consider LSTM cells to capture long-term patterns for the concatenated output of the extracted features of smaller regions. The model follows by a 2-dimension convolution LSTM layers. A 2-dimension convolution LSTM layer, described in section 3.4, receives an input , and apply the convolution on the matrix of size with two channels. This convolutional layer has different architecture with the first multi-kernel convolution layer, that is, each neural cell is an LSTM cell and is applied on all input sensors. Some layers of convolution LSTM layers extract features from residuals and trends. Seasonal patterns represents repeated patterns for the given time horizon. The output is concatenated with seasonal patterns of time window . It follows by a fully-connected layer. The output is , where is prediction horizon. The output consists of predicted values for all sensors in prediction horizon.

One of the challenges in spatial-temporal data is to have a robust prediction in existence of missing, noisy data. Hence, we consider an autoencoder layer in as the last component of the model. A denoising autoencoder decoder reconstructs the last output for each cluster. In the pretraining step, for a prediction horizon and a cluster , each denoising autoencoder decoder generates , where and drop out layer are between each successive layers. A denoising autoencoder component generates a predictions . As the output of autoencoders is designed based on the clusters, there are some sensors , where the fully-connected target layer is connected to all common variables between denoising autoencoders with a linear activation function . In the training process the objective is to minimize loss function , such as mean square error function, between and real values , and obtaining optimum model parameters

for input data using stochastic gradient descent,


5 Experimental Analysis

We illustrate the analysis and the performance of the proposed methodology on traffic flow data.

5.1 Data Set

We use traffic flow data from the Bay Area of California represented in Fig. 3 which is commonly used and available in PEMS californiapems . The traffic flow has been gathered every 30 seconds and aggregated every 5 minutes in the dataset. Each sensor on highways of California has flow, occupancy and speed at a time stamp. A sensor is a loop detector device in mainline, off-ramp or on-ramps locations. In preprocessing, we selected 597 sensors which they have more than observed values in the first six months of 2016.

Figure 3: Traffic sensors over a Bay Area, California. The red dots represents loop detector sensors on highways.

5.2 Pattern analysis in traffic data

To illustrate the specific characteristics of traffic data arise from dynamic of traffic flow, we analyze the spatial, short-term and long-term patterns.

In Fig. 4, an additive time series decomposition of traffic flow data is illustrated for one station. Given a one day frequency, time series decomposition has similar, repeated (seasonal) daily patterns. Moreover, there are long-term weekly patterns, shown as trends . The long-term patterns, such as seasonal and trends, arise from similar periodic patterns, generated outside of the highway network. In other words, they are related to origin-destination matrix of vehicles in the network. The residual patterns are not only random noise, but also the results of spatial and short-term patterns and related to the evolution of traffic flow or sudden changes in smaller regions of the network. Since they have more similar patterns for neighboring time series, illustrated in section 5.2.1.

(a) The observed flow data.
(b) Seasonal patterns traffic flow data
(c) Trends of traffic flow data
(d) Residuals of traffic flow data
Figure 4: Time series decomposition of traffic flow data for one station in the network with daily frequency.

5.2.1 Residual time series in traffic flow data

A time series decomposition consists of residuals, trends and seasonal components. The residuals are interpreted as random noise for time series data. However, in traffic flow data, the residuals are the results of physical evolution of the network. In non data-driven traffic flow problems, first-order and second-order traffic flow fundamental diagram shows the relation between traffic flow, occupancy and speed. Given a wave-speed , free speed and maximum density in a road segment , the first order dynamical traffic flow theorem approximates flow by . Wave-speed reduces flow in high occupancy. In Fig. 6, we examine the non-linear relation of flow, speed and occupancy in one day and one station. It illustrates the relation between high occupancy and reduction of average speed in the road segment, leads to traffic congestion. This property of fundamental traffic flow diagram illustrates non repeated, residual patterns in traffic flow data as a result of congestion. The congestion propagation in a transportation network illustrates the relation among neighboring sensors in highways, described in Fig. 6 for flow data of three successive sensors. Congestion propagates over this sensors with nearly 20 min delay. For a larger area, in Fig. 7, the speed of 13 successive sensors is represented in an image. The reduction of speed in peak hours is presented with darker colors. It illustrates the reduction in speed is similar in neighboring areas.

(a) An example of a relation among flow, occupancy and speed. Occupancy, with value more than 8, decreases average speed as a result of wave-speed.
(b) Log plot to represent the linear relation between occupancy and flow with free-speed near to 70.
Figure 5: The relation between Occupancy, Speed and Flow
(a) The upstream and downstream of sensors spatially affect each other.
(b) The reduction in speed of sensor 1 and 2 twice happens in this plot, which there is 20 minute delay in congestion propagation.
Figure 6: The congestion propagation in successive sensors.
Figure 7: Image representation of a speed value of 13 successive sensors over 7 hours in a highway. It shows neighboring sensors have same congestion hours.

5.3 Fuzzy Hierarchical Clustering

Figure 8: The table shows the Dynamic Time Warping distance distance of time series residuals among 15 stations on a highway. The result of hierarchical clustering method is illustrated with three clusters. The distance values near to diagonal have lower distance.

In this section, we illustrate the results of fuzzy clustering applied on time series residuals. In Fig. 8, the DTW distance matrix shows residual similarity among neighboring sensors. The matrix shows the average DTW distance for peak times of training data, which has the highest dependency because of high values of occupancy. Each cluster would be obtained by comparing neighboring sensors. On the elements near to diagonal, the lowest distance values shows the similarity between neighboring values.

After preprocessing of time series data, there are 597 sensors with complete data over a period of six months. The fuzzy clustering finds the membership of each sensor to clusters. In the fuzzy membership matrix, we consider threshold of 0.1. All the sensors which has a membership value of more than 0.1, they would be considered as the members of clusters. We also consider the average size of clusters to be less than 10 miles. The agglomerative clustering stops when the average become greater than 10. As the clustering is applied on mainline stations, we also added the on-ramp and off-ramp sensors to the closest mainline stations. The result of fuzzy heirarchial clustering method, has 64 clusters, where the average number of elements is 9.7 with standard deviation of 4.2 and minimum cluster size of 3 and maximum of 14. The length of smallest and largest cluster is 0.3 mile and 32.1 mile. And there are 53 sensors which appear in more than one cluster, nearly

of total sensors. To examine the relation between trends in one spatial area, using a rolling window, we obtain DTW distance of each pair of sensors. For a time window, we normalized trends, by subtracting all time stamp values from the last value. The average DTW distance is 0.7, for all pairs of sensors, which shows the high similarity of trends. By contrast, presented in section 5.2.1, the average DTW distance of time series residuals for all pair of sensors is 4.5, while applying the fuzzy clustering method on time series reduces the average DTW of clusters to 0.6. As a result, we only apply fuzzy clustering on time series residuals.

5.4 Results of comparison

Our model is demonstrated to outperform the state-of-the-art performance on the traffic flow prediction dataset. All models are trained using the ADAM optimizerkingma2014adam

. The batch size of each iteration is set to 512 and 400 epochs. All experiments are implemented in TensorFlow

abadi2016tensorflow and conducted on NVIDIA Tesla K80 GPU. We used a grid search for finding the optimum deep neural network architectures which have the best performance and most efficient computational time.

The input matrix is , where the number of sensors is , the time window is , and there are features, including flow, occupancy and speed. For MLP, LSTM, CNN and the proposed multi-kernel CNN-LSTM models, the input dimension is reshaped to have a appropriate dimensions, described in model details section 5.4.2. For all models the data transform into range of [0, 1]. For the models without time series decomposition component, including MLP, LSTM and CNN, we transform the data into stationary data by subtracting all input values from the value at time step , while detrending of the models with time series decomposition components is as follows. The residual time series is stationary. To feed trends and seasonal components to a neural network, we make them stationary by subtracting each time window from its last value and . To recover the output, we add the predicted value to sum of and . The output matrix is , where the size of horizon for 15-min, 30-min, 45-min and 60-min prediction in the result tables.

5.4.1 Baseline models

As it is illustrated in primary traffic flow prediction studies, the traffic flow patterns are similar in same hours and weekdays. The first baseline model (Ave-weekday-hourly) is to use average of traffic flow of each station as a time-table for each time , given a weekday. The short-term prediction for each sensor is obtained by using average values in training data. The second baseline model (current value) is to use current value of traffic flow for short-term prediction .

5.4.2 State-of-the-Art Neural Network Models

In this section, we describe the implemented neural network models. A Multi-layer perceptron (MLP) with three fully connected layers and Xavier initialization glorot2010understanding , RELU activation function, and (500,300, 200) hidden units is used. A deep belief network (DBN) with greedy layer wise pretraining of autoencoders finds a good initialization for a fully-connected neural network. A fine tuning step for stacked auto encoder finish training. We consider 30 epochs for pretraining each layer. Fully connected Long-Short Term Memory neural network (LSTM) is capable of capturing long-term temporal patterns. However, in most of the studies, fully connected LSTM models have a shallow structure. They also are slow in convergence, although they have strong capabilities in capturing long-term patterns. The optimum LSTM neural network structure has one hidden units of size (400,200). The input is reshaped from a vector to matrix of two dimension . To use a convolutional neural network (CNN) for time series forecasting, the input matrix is reshaped to three dimension

. Each channel includes traffic flow, speed and occupancy. The optimum implemented deep CNN model has four layers with max-pooling and batch normalization layers. The number of filters are (16,32,64,128), the kernel is (5,5), and max-pooling layers reduce the dimension by two in each layer.Two fully connected layer connect the convolution layers to output layer.

The (CNN-LSTM) model captures short-term and spatial patterns in CNN layers, and temporal patterns in LSTM layers. Two convolution layers are applied on all input sensors with filters (16, 32). An LSTM layer of size (300,150) follows the output of CNN model, following by a fully connected layer. The model (C-CNN-LSTM) is a clustering based CNN-LSTM, in which a multi-kernel convolution layer extract spatial, short-term patterns from time series residuals. The clusters are obtained in section 5.3.

A pretraining denoising stacked auto encoder decoder is applied on each cluster of sensors to generate a robust output. Each layer is connected to drop-out layer with rate of 0.2. As the average size of clusters is nearly 10 and standard deviation of 4, described in section 5.3, we used a same size of architecture for all of them with size of (40,20,10,20,40) units with fully connected layers, and RELU activation function. The pretraining is done in 60 epochs. The weights are loaded into the proposed model in the next section.

The Cluster-based CNN-LSTM with Denoising autoencoder (C-CNN-LSTM-DA) is the proposed model in section 4 which uses clustering of time series residuals, trends, and seasonal along with denoising autoencoders and time series decomposition components for each cluster. The proposed architecture, in section 4, consists of 2 convolution layers with RELU and max-pooling layers with filters (32, 64). It follows by two fully connected layers, two 2-dimension convolutional LSTM for capturing long-term patterns (16, 32).

5.5 Performance Measure

To evaluate the performance of the proposed model, we used performance indices, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) in equation (12).


Here, are the real values and are the predicted values. In this paper, the prediction is for 15-min, 30-min, 45-min and 60-min time horizons.

5.6 Performance results on testing data

In the first analysis, we compare the performance of models for traffic flow prediction. The results are illustrated in Table 1. The comparison is between all described models for prediction horizons of 15-min, 30-min, 45-min and 60-min on testing data. The two baseline models have worst performance. The performance of neural network models are much better than the baseline models. The LSTM model has a better performance than MLP, BN and CNN models, demonstrating its improved performance in time series forecasting. CNN-LSTM models are more capable for capturing short-term and long-term patterns an are comparable with LSTM. Two models, C-CNN-LSTM and C-CNN-LSTM-DA, have better performance due to explicitly separating spatial regions. The performance of C-CNN-LSTM and C-CNN-LSTM-DA is almost quite close. In next sections, we discuss that in existence of missing data, the model with denoising autoencoders have better performance.

Baseline models State of arts neural networks Proposed models
Horizon Metric current-value Ave-weekday-hourly MLP DBN LSTM CNN CNN-LSTM C-CNN-LSTM C-CNN-LSTM-DA
15 min MAE 24 27.1 16.3 15.5 14 16 13.6 12.3 12.1
RMSE 36 43.2 28.1 27 25 27.4 24.8 23.1 22.7
30 min MAE 31 27.1 16.9 15.9 14.4 16.2 14.3 12.7 12.4
RMSE 45 43.2 29 28 26.2 28.4 26.0 23.4 22.9
45 min MAE 38 27.1 17.1 16.2 14.9 16.8 15 12.9 12.8
RMSE 54 43.2 29.8 29 28.1 29.3 28.2 23.4 23.1
60 min MAE 44 27.1 17.6 16.5 15.2 17.2 15.1 13.3 13.3
RMSE 63 43.2 30.8 29.3 28.4 30.1 28.1 23.8 23.7
Table 1: Evaluation of the models for traffic flow forecasting problem.
(a) Traffic flow prediction with MLP
(b) Traffic flow prediction with C-CNN-LSTM-DA
Figure 9: Prediction results for traffic flow data of one sensor over one week, where the blue line is predicted, and the red is the real value. The proposed model outperforms MLP in peak hours, while they have comparable performance in off-peak hours.
Figure 10: Comparison of the predictions; to illustrate the capability of the proposed model in capturing residual patterns. Some of the big fluctuations are meaningful residual patterns, and can be predicted.

5.7 Performance results of peak and off-peak traffic

Next experiment is to compare peak and off-peak hours. In peak hours, physical properties and evolution of traffic flow possibly affect congestion propagation in network. Therefor, the residuals come from the evolution of traffic and are meaningful spatial patterns. On the other hand, in off-peak hours, traffic flow is based on free speed and without congestion. Hence, flow is obtained from long-term patterns in the network. In Fig. 9, the output of the C-CNN-LSTM-DA and MLP models are illustrated. Among neural network models, the MLP model has the worst traffic flow prediction performance, while C-CNN-LSTM-DA is the best in Table 1. The C-CNN-LSTM-DA model has better performance in peak hours of the days, when there are high residuals. It shows the weakness of fully-connected neural networks for capturing residual patterns. In off-peak hours, all neural network models have comparable and good performance.

In Table 2, we compare the performance of the models for peak and off-peak traffic flow data. We select the MLP and LSTM models which only capture temporal patterns, along with the proposed model which carefully captures spatial patterns. This table compares residual-MAE and MAE on prediction values. For residual MAE, we detrend traffic flow prediction and find MAE error. For off-peak hours, LSTM has a comparable performance to the proposed model, as it simply capture long-term patterns. However, the performance of C-CNN-LSTM-DA in peak hours is highly better than LSTM model. In Fig. 10, we plot the comparison of LSTM and C-CNN-LSTM-DA. It is shown that C-CNN-LSTM-DA captures big residuals compared to LSTM model. While a more smooth prediction which ignores noise shows the model is not overfitted, meaningful residual patterns in spatial data need to be carefully considered in the prediction of the model.

Flow State Metric MLP LSTM C-CNN-LSTM-DA
Off-peak time Residual MAE 6.1 4.3 4.4
MAE 12.3 11.9 11.8
Peak time Residual MAE 12.2 11.1 8.2
MAE 18.3 16.8 13.8
Table 2: Performance evaluation of three models for traffic flow forecasting in peak and off-peak hours

5.8 Performance results with missing data

We evaluate the proposed model relative to other neural network models with missing data. We randomly generated blocks of missing values in the test data. Each block is related to one randomly selected sensor at a random starting time of

, which is generated with a normal distribution with mean 2 hours and standard deviation 0.5. For each sensor one block of missing values are generated per week. We used missing data which increases error for model prediction and the real values are used for evaluation of the model. Since the missing values is only applied to one random sensor, we expect neighboring sensors to help the model to continue to predict the traffic flows well.

The results is illustrated in Table 3. To briefly describe the results, we only show the 30-min prediction in Table 3. The performance of C-CNN-LSTM-DA is better than other models in existence of missing values. In Fig. 11.a, we illustrate the increase in error in different models with random missing values. The figure shows a reduced increase of error in C-CNN-LSTM-DA, as the time series decomposition and denoising autoencoder components generates a more robust prediction to missing values. In the existence of missing data, the value of forecasting can be distracted far from real values in LSTM neural network, Fig. 11b.

(a) The increase in prediction error reduces using clustering of sensors and denoising autoencoder decoder.
(b) Comparison of prediction with random missing data
Figure 11: Prediction results with random missing data
MAE 16.7 16.5 14.1 13.1
RMSE 28.8 28.9 25.2 23.9
Table 3: The average MAE and RMSE for the best four multi-variate time series forecasting models with randomly generated missing data.

6 Conclusion and future work

This paper illustrates a new framework for spatial time series forecasting problem and its application on traffic flow data. The proposed method consists of several components. Firstly, for a time series data gathered on a network, a convolution layer does not capture network structure, because a kernel slides on spatial locations. Hence, we obtain fuzzy clusters of time series and apply a multi-kernel convolution component, in which each kernel only slides on time steps and keep the network structure of time series. Secondly, time series residuals are not a noise. They are result of interaction among neighboring time series. As an example, the similarity of time series residuals is presented in Fig. 6 and Fig. 8. Thus, convolution component is applied on time series residuals to extract short-term interaction among neighboring time series. In Table 2, we evaluate the prediction of time series residuals. In off-peak, the performance of the proposed model is the same as LSTM. However, the proposed model has better performance in peak hours. Table 1 shows the comparison of the baseline, state-of-arts neural network, and CNN-LSTM models for traffic flow prediction. The C-CNN-LSTM model uses spatial and time series decomposition, which have better performance than baseline and state-of-arts models, in Table 1. One of the challenges in spatial-temporal data is to work with missing data. We illustrate the performance of using pre-trained denoising autoencoder decoder as the last component of C-CNN-LSTM-DA in Fig. 11. It shows the increase in error for the model with denoising autoencoder is less than other models.

This study demonstrates the effectiveness of designing new neural network architectures considering specific properties of spatial-temporal problems. Each component of the neural network is designed based on the characteristics of the extracted patterns. In the experimental results, we analyzed the spatial-temporal patterns in traffic flow data and we also illustrate the effect of each neural network components on the improvement of results. Related analyses can be constructed for other spatial-temporal problems, such as anomaly detection, missing data imputation, time series clustering and time series classification problems. In addition, different spatial-temporal problems have different physical properties or dynamical systems, which makes their patterns unique for the problems.


  • (1) Y. Lv, Y. Duan, W. Kang, Z. Li, F.-Y. Wang, Traffic flow prediction with big data: a deep learning approach, IEEE Transactions on Intelligent Transportation Systems 16 (2) (2015) 865–873.
  • (2) S. Ahmad, A. Lavin, S. Purdy, Z. Agha, Unsupervised real-time anomaly detection for streaming data, Neurocomputing 262 (2017) 134–147.
  • (3) Y. Zheng, Q. Liu, E. Chen, Y. Ge, J. L. Zhao, Time series classification using multi-channels deep convolutional neural networks, in: International Conference on Web-Age Information Management, Springer, 2014, pp. 298–310.
  • (4)

    K. Ø. Mikalsen, F. M. Bianchi, C. Soguero-Ruiz, R. Jenssen, Time series cluster kernel for learning similarities between multivariate time series with missing data, Pattern Recognition 76 (2018) 569–581.

  • (5) R. J. Bessa, A. Trindade, V. Miranda, Spatial-temporal solar power forecasting for smart grids, IEEE Transactions on Industrial Informatics 11 (1) (2015) 232–241.
  • (6) X. Qiu, Y. Ren, P. N. Suganthan, G. A. Amaratunga, Empirical mode decomposition based ensemble deep learning for load demand time series forecasting, Applied Soft Computing 54 (2017) 246–255.
  • (7) A. Grover, A. Kapoor, E. Horvitz, A deep hybrid model for weather forecasting, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015, pp. 379–386.
  • (8) A. Tascikaraoglu, Evaluation of spatio-temporal forecasting methods in various smart city applications, Renewable and Sustainable Energy Reviews 82 (2018) 424–435.
  • (9) N. G. Polson, V. O. Sokolov, Deep learning for short-term traffic flow prediction, Transportation Research Part C: Emerging Technologies 79 (2017) 1–17.
  • (10) J. Zhang, Y. Zheng, D. Qi, Deep spatio-temporal residual networks for citywide crowd flows prediction., in: AAAI, 2017, pp. 1655–1661.
  • (11) X. Zheng, W. Chen, P. Wang, D. Shen, S. Chen, X. Wang, Q. Zhang, L. Yang, Big data for social transportation, IEEE Transactions on Intelligent Transportation Systems 17 (3) (2016) 620–630.
  • (12) J. Zhang, F.-Y. Wang, K. Wang, W.-H. Lin, X. Xu, C. Chen, et al., Data-driven intelligent transportation systems: A survey, IEEE Transactions on Intelligent Transportation Systems 12 (4) (2011) 1624–1639.
  • (13) O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, K. Taha, Efficient machine learning for big data: A review, Big Data Research 2 (3) (2015) 87–93.
  • (14) D. C. Gazis, C. H. Knapp, On-line estimation of traffic densities from time-series of flow and speed data, Transportation Science 5 (3) (1971) 283–301.
  • (15) Y. Kamarianakis, P. Prastacos, Forecasting traffic flow conditions in an urban network: Comparison of multivariate and univariate approaches, Transportation Research Record: Journal of the Transportation Research Board (1857) (2003) 74–84.
  • (16) S. V. Kumar, L. Vanajakshi, Short-term traffic flow prediction using seasonal arima model with limited input data, European Transport Research Review 7 (3) (2015) 21.
  • (17) B. Ghosh, B. Basu, M. O’Mahony, Bayesian time-series model for short-term traffic flow forecasting, Journal of transportation engineering 133 (3) (2007) 180–189.
  • (18) G. Yu, J. Hu, C. Zhang, L. Zhuang, J. Song, Short-term traffic flow forecasting based on markov chain model, in: Intelligent Vehicles Symposium, 2003. Proceedings. IEEE, IEEE, 2003, pp. 208–212.
  • (19) J. Wang, W. Deng, Y. Guo, New bayesian combination method for short-term traffic flow forecasting, Transportation Research Part C: Emerging Technologies 43 (2014) 79–94.
  • (20) W. Huang, G. Song, H. Hong, K. Xie, Deep architecture for traffic flow prediction: deep belief networks with multitask learning, IEEE Transactions on Intelligent Transportation Systems 15 (5) (2014) 2191–2201.
  • (21)

    T. Kuremoto, S. Kimura, K. Kobayashi, M. Obayashi, Time series forecasting using a deep belief network with restricted boltzmann machines, Neurocomputing 137 (2014) 47–56.

  • (22) L. Wang, Z. Wang, H. Qu, S. Liu, Optimal forecast combination based on neural networks for time series forecasting, Applied Soft Computing 66 (2018) 1–17.
  • (23) X. Qiu, L. Zhang, Y. Ren, P. N. Suganthan, G. Amaratunga, Ensemble deep learning for regression and time series forecasting, in: Computational Intelligence in Ensemble Learning (CIEL), 2014 IEEE Symposium on, IEEE, 2014, pp. 1–6.
  • (24)

    A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.

  • (25) X. Ma, Z. Dai, Z. He, J. Ma, Y. Wang, Y. Wang, Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction, Sensors 17 (4) (2017) 818.
  • (26) S. Deng, S. Jia, J. Chen, Exploring spatial–temporal relations via deep convolutional neural networks for traffic flow prediction with incomplete data, Applied Soft Computing, 2018.
  • (27) M. Henaff, J. Bruna, Y. LeCun, Deep convolutional networks on graph-structured data, arXiv preprint arXiv:1506.05163.
  • (28) J. Bruna, W. Zaremba, A. Szlam, Y. LeCun, Spectral networks and locally connected networks on graphs, arXiv preprint arXiv:1312.6203.
  • (29) Y. Li, R. Yu, C. Shahabi, Y. Liu, Graph convolutional recurrent neural network: Data-driven traffic forecasting, arXiv preprint arXiv:1707.01926.
  • (30) J. T. Connor, R. D. Martin, L. E. Atlas, Recurrent neural networks and robust time series prediction, IEEE transactions on neural networks 5 (2) (1994) 240–254.
  • (31) H. Sak, A. Senior, F. Beaufays, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, in: Fifteenth annual conference of the international speech communication association, 2014.
  • (32) Z. Zhao, W. Chen, X. Wu, P. C. Chen, J. Liu, Lstm network: a deep learning approach for short-term traffic forecast, IET Intelligent Transport Systems 11 (2) (2017) 68–75.
  • (33) X. Ma, Z. Tao, Y. Wang, H. Yu, Y. Wang, Long short-term memory neural network for traffic speed prediction using remote microwave sensor data, Transportation Research Part C: Emerging Technologies 54 (2015) 187–197.
  • (34) Y. Tian, K. Zhang, J. Li, X. Lin, B. Yang, Lstm-based traffic flow prediction with missing data, Neurocomputing 318 (2018) 297–305.
  • (35) S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-c. Woo, Convolutional lstm network: A machine learning approach for precipitation nowcasting, in: Advances in neural information processing systems, 2015, pp. 802–810.
  • (36) S. Yi, J. Ju, M.-K. Yoon, J. Choi, Grouped convolutional neural networks for multivariate time series, arXiv preprint arXiv:1703.09938.
  • (37) X. Cheng, R. Zhang, J. Zhou, W. Xu, Deeptransport: Learning spatial-temporal dependency for traffic condition forecasting, arXiv preprint arXiv:1709.09585.
  • (38) Q. Liu, B. Wang, Y. Zhu, Short-term traffic speed forecasting based on attention convolutional neural network for arterials, Computer-Aided Civil and Infrastructure Engineering 33 (11) (2018) 999–1016.
  • (39) F. Petitjean, P. Gançarski, Summarizing a set of time series by averaging: From steiner sequence to compact multiple alignment, Theoretical Computer Science 414 (1) (2012) 76–91.
  • (40) L. Gupta, D. L. Molfese, R. Tammana, P. G. Simos, Nonlinear alignment and averaging for estimating the evoked potential, IEEE Transactions on Biomedical Engineering 43 (4) (1996) 348–356.
  • (41) V. Niennattrakul, C. A. Ratanamahatana, Shape averaging under time warping, in: Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2009. ECTI-CON 2009. 6th International Conference on, Vol. 2, IEEE, 2009, pp. 626–629.
  • (42)

    M. Konkol, Fuzzy agglomerative clustering, in: International Conference on Artificial Intelligence and Soft Computing, Springer, 2015, pp. 207–217.

  • (43) P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 1096–1103.
  • (44) P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of machine learning research 11 (Dec) (2010) 3371–3408.
  • (45) T. Zhou, G. Han, X. Xu, Z. Lin, C. Han, Y. Huang, J. Qin, -agree adaboost stacked autoencoder for short-term traffic flow forecasting, Neurocomputing 247 (2017) 31–38.
  • (46) L. Gondara, K. Wang, Mida: Multiple imputation using denoising autoencoders, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2018, pp. 260–272.
  • (47) California. pems, http://pems.dot.ca.gov/,  2017.
  • (48) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
  • (49) M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: a system for large-scale machine learning., in: OSDI, Vol. 16, 2016, pp. 265–283.
  • (50) X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.