Spatio-temporal problems have been studied in broad domains , such as transportation systems, power grid networks and weather forecasting, where data is collected in a geographical area over time. Traffic flow data are an important spatial-temporal data. Unlike traditional methods for static network flow problems  and route finding , in which solving an optimization problem finds the solution, recently data-driven spatio-temporal approaches have been broadly applied on traffic flow data . Spatio-temporal data are gathered by a large number of sensors and they inevitably miss observations due to a variety of reasons, such as an error prone measurements, malfunctioning sensors, or communication error 
. In the presence of missing data, the performance of machine learning tasks such as classification, clustering and forecasting drops dramatically and results in biased inference. Hence, researchers address the problem by estimating missing values in preprocessing steps, or by developing machine learning models that are robust with respect to missing data. Here we propose a method for missing data imputation in the preprocessing step.
Statistical and machine learning techniques are broadly applied for missing data imputation. The primary approach is to use an ARIMA model, which works well under linear assumptions . A matrix completion method has also been proposed for missing data imputation ; however, it requires low-rankness and static data. Dimensional reduction techniques for missing data imputation have good performance, e.g., a probabilistic principle component analysis method for missing traffic flow data 
, and a tensor-based model for traffic data completion. Most recently,  proposes a clustering approach in spatial and temporal contexts for missing data imputation, including pattern clustering-classification and an Extreme Learning Machine with in-depth review of related work of missing data imputation in traffic flow problems. While clustering and dimensional reduction techniques differ from our model, some similarities suggests an avenue for further investigation in the future.
Increasing in the size of spatio-temporal datasets motivates researchers to develop scalable missing data imputation techniques. Contrary to statistical techniques, neural networks do not rely on hand-crafted feature engineering and do not use prior assumptions on input data. Shallow neural networks are shown to have great performance compared with other machine learning algorithms on traffic data , but their performance reduces in large-scale problems. Recently the outperformance of deep neural networks on large-scale problems and their flexible architecture to capture spatial and temporal data illustrates their dominance over statistical and other machine learning techniques. Following the proposed denoising autoencoder with a fully connected neural network in 
, a comparison of denoising autoencoders and k-means clustering for traffic flow imputation is studied in. Multiple missing data imputation with multiple training of fully connected, overcomplete autoencoders are examined in .
Since training neural networks is computationally expensive, fully-connected, multiply trained and overcomplete autoencoders can be inefficient solutions for large scale problems. Moreover, recent works demonstrate the increased performance of convolutional layers and LSTM layers for extracting spatial and temporal patterns compared to fully connected layers. A Convolutional neural network is proposed for missing data imputation in traffic flow data
. The model captures spatial and short term patterns with a convolutional layer. A bidirectional LSTM with a modification on the LSTM neurons is proposed, but spatial data is not considered. Convolutional recurrent neural networks have great performance in large-scale spatio-temporal problems . In 
, a spatio-temporal autoencoder is proposed for high dimension patient data with missing values, and classifiers are used for at top of feature learning.
In the aforementioned works deep neural networks have been studied on spatio-temporal data. However, there is lack of analysis in applying convolution-recurrent autoencoders on spatio-temporal problems for missing data imputation in traffic flow problems with the objective of learning spatial patterns with convolution and temporal patterns with LSTM layers. In this paper, we first propose a convolution recurrent autoencoder for multiple missing data imputation. The model is examined on traffic flow data. It is shown that the proposed convolution recurrent autoencoder improves the performance of missing data imputation problem. Moreover, the latent feature representation of the autoencoders is analyzed. This analysis shows that the latent feature space is semantically meaningful representation of traffic flow data. We also examine the performance of applying
-nearest-neighbor (KNN) to evaluate the effectiveness of using autoencoders’ latent representation in missing data imputation. The proposed model can be applied for missing data imputation in spatio-temporal problems.
Ii-a Problem definition
Spatio-temporal data is represented by a matrix , where is the number of sensors, is the number of time steps and is the number of features. Missing data can exist in various ways, for example at individual points or over intervals, where one sensor loses data for a period of time. To apply a deep neural network for time series imputation, a sliding window method generates , where is time window and . In the rest of the paper, we call as a data point. For the purpose of training and evaluation, an interval of missing values is added to the input data and represented with . The objective is to impute missing values for using spatial and temporal correlation. In Fig. 1, a schematic example of applying a sliding window on a spatial time series with interval-wise missing values is represented.
Ii-B A denoising autoencoder
An autoencoder decoder proposed in  and can be applied in missing data imputation problem. In the training process, a denoising encoder decoder receives as input and as target data. It reconstructs its input
by minimizing the loss function, e.g. mean square loss function, , for autoencoders’ output . In other words, the autoencoder receives a data point with some missing values and reconstructs it with the objective of accurate missing data imputation. An encoder reduces the dimension to a latent feature space , where , which extracts the most important patterns of the input data. An autoencoder is capable of producing semantically meaningful representations on real-world datasets . The decoder reconstructs the input from its latent representation. For a two layer encoder decoder, an encoder is represented with and a decoder is represented with , where
is the activation function andis dropout function. A multi layer fully connected, convolution or recurrent layers can be used as an encoder or decoder.
Iii A convolution-recurrent deep neural network encoder decoder framework
As discussed in detail in , in multiple imputation each missing datum is replaced with weighted average of more than one imputations. Hence, we propose a framework for multiple missing data imputation on spatio-temporal data, represented in Fig. 2.
A sliding window method gives the input data with size of to the autoencoder. A convolution recurrent autoencoder reconstructs the input data and automatically imputes missing values. There are reconstructed values for each time window. The average of these reconstructed values is the output of neural network. The evaluation of reconstructed values is shown in Section. IV-E. The second approach is with the latent feature representation of autoencoders. A KNN finds the most similar data points in training data. The average of these produces the imputed values for the testing data. The model is evaluated in Section. IV-G.
Iii-a A CNN-BiLSTM Autoencoder
Here we introduce the proposed convolution recurrent autoencoder for spatio-temporal missing data imputation. The proposed model is illustrated in Fig. 3.
To extract spatial and temporal patterns, an encoder consists of both convolution and LSTM layers. A convolutional layer has a kernel, which slides over spatial time series data , where is the number of channels. For non-grid data, sliding a kernel on spatial features loses the network structure and reduces the performance of model . Hence the kernel only slides over the time axis. The kernel size is , where
, and stride size is. Various length of kernel have been shown to have better performance. Hence, several kernels with different values of are applied to the input data. The output of each kernel is , where is the filter size. All of the outputs are concatenated and represented with , where is the total size of all filters.
An LSTM layer receives the output of convolution layer, represented by . An LSTM cell uses input, output and forget gates to prevent vanishing gradients in recurrent cells. It also returns hidden state and cell state . A bidirectional LSTM layer captures the relation of past and future data simultaneously. It has two sets of LSTM cells which propagate states in two opposite directions. Thus a bidirectional LSTM layer is used for the recurrent component. Given as the number of units in LSTM layer, the output of bidirectional LSTM is . The latent feature representation of encoder consists of LSTM states , where these are the hidden and cell states of the forward and backward direction of bidirectional LSTM.
The decoder receives the encoder states and encoder output. The decoder consists of a bidirectional LSTM and a fully connected layer. The LSTM layer receives the hidden and cell states of the encoder to reconstruct the input data. A bidirectional model reconstructs past and future data. It follows with a fully connected layer with linear activation function.
Training the encoder decoder with convolution and LSTM layers is slow, as the gradient of the loss function is propagated backward on to LSTM cells and then convolutional layers. To increase the speed of training, we used a residual layer, introduced in , to connect the output of the convolution layer to the fully connected layer with a function. In the training process, the convolution layer receives more effect from the gradient of loss function and as a result, there is faster convergence for the encoder decoder to learn spatial and temporal patterns.
The reconstruction of input automatically imputes missing data from the spatial and temporal correlation among neighboring areas. Given a time window , every time stamp is reconstructed times and the average is used for missing imputation. An autoencoder decoder reconstructs input data by minimizing loss function for all time steps .
Iii-B Missing data imputation using latent feature representations
A KNN algorithm compares the distance among all data points, and finds the nearest data points. This approach find the most similar data points and then find the average for missing data imputation. With a sliding window approach, the number of data points in training data is the same as number of time steps . For a given data point which is a matrix of size , the total number of comparison in KNN is . Moreover, a time series distance can be obtained with Dynamic Time Warping , which is computationally more expensive than euclidean distance.
The latent representation of autoencoder is a fixed size and reduced dimension vector. Applying KNN on latent representation is computationally more efficient than on time series data points. The total comparison is and the latent feature distance can be computed with euclidean distance, faster than Dynamic Time Warping. In the experimental analysis, we evaluate the computational time of applying KNN on latent feature. Moreover, the average of most similar data points is used as multiple missing imputation. The results of this analysis is compared with PCA-KNN in experimental results.
Iv Experimental results
We examine the performance of the proposed model on traffic flow data available in PeMS . Traffic data are gathered every 30 seconds and aggregated every 5 minutes using loop detector stations on highways. We use three subset of stations in the Bay Area, represented in Fig. 4, and evaluate the average performance of our model on these three regions to have better evaluation of the models. The first region has 10 mainline stations on 22 miles of highway US 101-South. The second region has 9 mainline stations on 10 miles of I-280-South highway, and the third region has 11 mainline stations on 13 miles of I-880-South. The training data is for the first 4 months of 2016 and the testing data is for next 2 months. The selected sensors have more than 99% of available data for this time period.
The data is scaled to range of [0-1] where for each data set, 0 is the minimum flow observed and 1 is the maximum. A sliding window approach is used to generate image-like input for time series data. During the experiments, we found out a time window of size 6, 30 minutes, works well. Each data point is represented with , where is the number of sensors for each region.
To evaluate the model for missing data imputation, we added missing blocks to have a ground truth for evaluation. The missing data is generated randomly on training and testing data. We generated blocks of missing data with size of 0.5 to 4 hours. The sensors are randomly selected for each missing block. In the analysis, training data without missing values cannot result in a robust autoencoder for missing data imputation. Therefor, 25% percent of training and testing data is considered as missing values. In the analysis, the performance of missing data imputation models are examined only on these missing blocks, represented with index list of .
Iv-C Baseline missing data imputation models
Our first missing data imputation method uses a temporal average to fill missing data. Traffic flow patterns are repeated every week. Hence, a weekly-hourly average table is obtained from training data (W-H-Average). The main drawback of using temporal average is that specific days such as holidays or event days (games, festivals, concerts) have their own patterns and they are not repeated in the training data.
The second method uses the closest sensors to estimate the missing data. The value of traffic flow should be similar to the closest sensors on highways. Following the work , a Dynamic Time Warping distance method finds the most similar sensors using time series residuals. The method uses the average of the two closest sensors and estimates the missing data (Neighbor-Value).
In the third baseline method, the most important principle components are selected, then a KNN finds the most similar data points. The average of nearest values is used to estimate missing data (KNN-PCA). In the analysis, we examine different values of PCA components. The first 10 components contain more than 95 % information ratio. Also, larger values of , improves the result, as the average of several missing imputations is usually a better estimation for missing values. The best size of PCA components and are 10 and 20, respectively. The number of features is the number of sensors multiplied by time window, which is 60, 54 and 66 for three regions. The best values of MAE and RMSE are shown in Table. I.
Iv-D Autoencoder models
Here we describe the implemented autoencoders. For all of the models, the batch size is set to 256 and the epochs are set to 100. An ADAM optimizer with learning rate of 0.001 is used for training the model.
A fully connected denoising encoder decoder is implemented for missing imputation FC-NN. The model is trained with architecture of (32, 16, 12, 16, 32) obtained by grid search over various number of layers and hidden units. Each layer is a fully connected layer with a Leaky-RELU activation function.
|Missing data imputation error for traffic flow data|
To capture temporal patterns, an LSTM encoder decoder with 32 neurons is trained LSTM. To capture the effect of past and future data points, a bidirectional LSTM is implemented with 16 neurons in each direction BiLSTM. A dropout with parameter 0.2 prevents over-fitting the LSTM layers. A convolution recurrent encoder decoder CNN-BiLSTM is implemented with four kernels of size , , and and filter size of 8 and a Leaky Relu activation function. The bidirectional-LSTM has 16 units on each direction and is connected to a fully connected layer with the size of input sensors. Slow convergence of convolutional-BiLSTM model motivates us to add residual layer connecting convolution to the output of BiLSTM for faster gradient propagation. The model CNN-BiLSTM-Res, the proposed model in Fig. 324].
Iv-E Comparison of results
Given as real value and as predicted value, Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) are used for evaluation. Given a set of missing data points in testing data and their corresponding indices , the index is selected from the index set of missing data in 1 and 2.
The results are represented in Table. I. It shows that the temporal and spatial averages, the first two models have a poor performance for missing data imputation. Among three baseline models, KNN-PCA is the best missing data imputation technique. Autoencoders have significantly better performance than baseline models. The LSTM model has good performance for missing data imputation compared with FC-NN for capturing temporal patterns. A bidirectional LSTM shows great performance by capturing the relation between past and future data simultaneously. A CNN-BiLSTM hardly converges to the optimum solution but is not better than the BiLSTM model. Finally, the proposed CNN-BiLSTM-Res encoder decoder has the best MAE and RMSE. It shows that a residual layer improves the performance for a combination of convolution and LSTM layers. The model CNN-BiLSTM-Res has 13% and 7% improvement on MAE and RMSE compared with the best BiLSTM model. As it is illustrated in Section. III-A, because of the slow convergence of convolutional LSTM models, a residual layer is used to propagate gradients of loss function directly to convolution layer. In Fig. 5, the convergence of CNN-BiLSTM and CNN-BiLSTM-Res are represented, which shows faster convergence of CNN-BiLSTM-Res.
In Fig. 6, the prediction results is represented for FC-NN and CNN-BiLSTM-Res as the example of missing data imputation results. Compared with FC-NN, the prediction result of CNN-BiLSTM-Res is clearly more accurate missing imputation and closer to ground truth. In Fig. 7, the plot illustrates the missing data imputation by CNN-BiLSTM-Res for two missing blocks during three days, and shows the closeness of imputed data to real traffic flow data. This output example shows the estimation of missing block of data is very close to real values; however, still the distance between real and predicted values for missing blocks is more than healthy data, which are the time series values out of missing blocks.
Iv-F Discussion on multiple missing data imputation
For non-temporal data, an autoencoder reconstructs one value for each input data point. However, for temporal data, a sliding window generates data points for each time step. Referring to Figure 1, the data point actually contains all of the values within a time window. For a given time window , there are reconstructed values for each time step. The result in Table. I is for multiple missing data imputation. Here we use one step reconstruction of each output for comparison purpose. In other words, here we describe a single missing imputation output of applying autoencoders on traffic flow data.
The value of MAE for FC-NN, LSTM, BiLSTM and CNN-BiLSTM are 23.7, 15.5, 11.9, 6.9, respectively. Also, the RMSE for FC-NN, LSTM, BiLSTM and CNN-BiLSTM are 32.1, 22.5, 18.1, 13.7, respectively. Comparing to Table. I, we can see that a single missing data imputation has very lower performance. The analysis shows that multiple imputation and using the average of them significantly improves missing data imputation. This multiple imputation approach improves the output of autoencoders on time series data.
Iv-G Latent feature representation
The latent feature representation of autoencoders illustrates meaningful information. In Fig 8, a t-SNE method  visualizes latent feature representation of hidden state of (FC-NN) for 7 days. The plot shows that for each time of day the patterns of data points are closer to each other. Here our objective is to illustrate how latent feature representation can be used for missing data imputation. Hence we use the concept of similarity of data points. A KNN is applied on latent feature representation in training data points. The most similar data points are used. The error for the average of different values of is represented in Fig 9. The plot shows that a 1 nearest neighbor on latent feature representation results in 23.5 and 31.0 for MAE and RMSE scores. However, a 13 nearest neighbor results in 16.7 and 22.6, MAE and RMSE, respectively. The reduction in missing data imputation error shows the effectiveness of multiple imputation on latent feature representation. We also examine the relation between size of latent features and missing imputation on FC-NN in Fig. 10. The analysis shows that across latent sizes of 2 to 20 there are changes in the performance of the missing data imputation. The results suggest that the best latent size is 10.
A KNN is applied on latent feature representation of various implemented autoencoders. The results of applying KNN on latent feature of FC-NN, hidden and cell state of LSTM and BiLSTM have MAE of 16.6, 18.1, 17.8 and RMSE of 22.5, 24.1, 23.8, respectively. While a FC-NN with six layers is the best model to generate latent features, the other convolution-recurrent models cannot easily generates a latent feature representation for missing data imputation. One conclusion is that size of latent vector greatly effect on the result. A KNN on smaller size of latent vector finds better missing data imputation. The analysis also shows that applying KNN on the latent feature of FC-NN is better than KNN-PCA, which shows autoencoders are capable of generating better latent feature representation for traffic flow data.
V Conclusion and Future Work
In this paper, we study autoencoders for missing data imputation in spatio-temporal problems and examined the performance of various autoencoders to capture spatial and temporal patterns. We illustrate that a convolution recurrent autoencoder can capture spatial and temporal patterns and outperforms state-of-the-art missing data imputation. We conclude that a convolution layer with various kernel sizes and a bidirectional LSTM improves missing data imputation in traffic flow data. Also, the slow convergence of the convolution-recurrent autoencoder is improved with a residual layer. We also describe an approach considering multiple imputation for autoencoders for time series data. The results show that multiple imputation is significantly better than single imputation. Moreover, We illustrate advantage of using the latent feature of autoencoders for missing data imputation. We describe an approach for using autoencoders’ latent feature representation for multiple imputation. The analysis shows that it outperforms KNN on principle components of traffic flow data. However, the latent feature of convolution-recurrent autoencoders needs a careful design of the architecture to obtain better results and can be explored more in future works.
Future research will focus on generative neural networks. Moreover, while it is shown that convolution-recurrent neural networks show a great performance for spatio-temporal problems, spatial and temporal clustering techniques can make the model more effective on larger geographical areas.
-  G. Atluri, A. Karpatne, and V. Kumar, “Spatio-temporal data mining: A survey of problems and methods,” ACM Computing Surveys (CSUR), vol. 51, no. 4, p. 83, 2018.
-  R. Asadi, S. S. Kia, and A. Regan, “Cycle basis distributed admm solution for optimal network flow problem over biconnected graphs,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 717–724, IEEE, 2016.
-  A. Regan, C. CHRISTIE, R. Asadi, and D. Arkhipov, “Integration of crowdsourced and traditional data for route analysis amd route finding for pedestrians with disabilities,” in Canadian Transportation Research Forum 51st Annual Conference-North American Transport Challenges in an Era of Change//Les défis des transports en Amérique du Nord à une aire de changement Toronto, Ontario, May 1-4, 2016, 2016.
-  B. Bae, H. Kim, H. Lim, Y. Liu, L. D. Han, and P. B. Freeze, “Missing data imputation for traffic flow speed using spatio-temporal cokriging,” Transportation Research Part C: Emerging Technologies, vol. 88, pp. 124–139, 2018.
-  L. Li, J. Zhang, Y. Wang, and B. Ran, “Missing value imputation for traffic-related time series data based on a multi-view learning method,” IEEE Transactions on Intelligent Transportation Systems, 2018.
-  C. F. Ansley and R. Kohn, “On the estimation of arima models with missing values,” in Time series analysis of irregularly observed data, pp. 9–37, Springer, 1984.
-  H.-F. Yu, N. Rao, and I. S. Dhillon, “Temporal regularized matrix factorization for high-dimensional time series prediction,” in Advances in neural information processing systems, pp. 847–855, 2016.
-  L. Qu, L. Li, Y. Zhang, and J. Hu, “Ppca-based missing data imputation for traffic flow volume: A systematical approach,” IEEE Transactions on intelligent transportation systems, vol. 10, no. 3, pp. 512–522, 2009.
-  B. Ran, H. Tan, Y. Wu, and P. J. Jin, “Tensor based missing traffic data completion with spatial–temporal correlation,” Physica A: Statistical Mechanics and its Applications, vol. 446, pp. 54–63, 2016.
-  I. Laña, I. I. Olabarrieta, M. Vélez, and J. Del Ser, “On the imputation of missing data for road traffic forecasting: New insights and novel techniques,” Transportation research part C: emerging technologies, vol. 90, pp. 18–33, 2018.
-  R. Asadi and M. Ghatee, “A rule-based decision support system in intelligent hazmat transportation system,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 5, pp. 2756–2764, 2015.
-  P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of machine learning research, vol. 11, no. Dec, pp. 3371–3408, 2010.
Y. Duan, Y. Lv, Y.-L. Liu, and F.-Y. Wang, “An efficient realization of deep learning for traffic data imputation,”Transportation research part C: emerging technologies, vol. 72, pp. 168–181, 2016.
-  L. Gondara and K. Wang, “Mida: Multiple imputation using denoising autoencoders,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 260–272, Springer, 2018.
-  Y. Zhuang, R. Ke, and Y. Wang, “Innovative method for traffic data imputation based on convolutional neural network,” IET Intelligent Transport Systems, 2018.
-  W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y. Li, “Brits: Bidirectional recurrent imputation for time series,” in Advances in Neural Information Processing Systems, pp. 6776–6786, 2018.
-  R. Asadi and A. Regan, “A spatial-temporal decomposition based deep neural network for time series forecasting,” arXiv preprint arXiv:1902.00636, 2019.
-  Y. Jia, C. Zhou, and M. Motani, “Spatio-temporal autoencoder for feature learning in patient data with missing observations,” in 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 886–890, IEEE, 2017.
-  Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng, “Building high-level features using large scale unsupervised learning,” arXiv preprint arXiv:1112.6209, 2011.
-  J. L. Schafer and M. K. Olsen, “Multiple imputation for multivariate missing-data problems: A data analyst’s perspective,” Multivariate behavioral research, vol. 33, no. 4, pp. 545–571, 1998.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in , pp. 770–778, 2016.
-  S. Salvador and P. Chan, “Toward accurate dynamic time warping in linear time and space,” Intelligent Data Analysis, vol. 11, no. 5, pp. 561–580, 2007.
-  “California. pems, http://pems.dot.ca.gov/, 2017,”
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.
-  L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.