I Introduction
The shortterm road traffic prediction (STTP) technique has been studied in achieving efficient route planning and traffic control in Intelligent Transportation Systems (ITS) recently [1]. The main idea of STTP is to predict the road traffic state (i.e., flow, speed and density) in the next five to thirty minutes by analyzing historical data [2]. However, existing STTP studies mainly focused on one road segment, or a smallscale network containing several adjacent road segments, which is opposite to the effective route planning that requires a global perspective based on the information of the whole network [3, 4, 5]. Besides, the majority of existing STTP algorithms are limited to a single scenario such as freeway, arterial or corridor, which are difficult to be generalized to a heterogeneous road network. The past STTP method for largescale road network is to develop a specific model for each road segment termed as individualbased model (IM), or a general model for all road segments termed as wholebased model (WI). Since the multiplicity and heterogeneity of the largescale network, neither of the two models is appropriate for the largescale networks. Firstly, too many IMs will take up lots of storage resources in ITS. Secondly, a WI is not competent for modeling the whole network with different types of traffic patterns. Moreover, the development of ITS over the city increases the number of traffic data in terms of time span and granularity [6]. Making full use of the big traffic data to improve the performance of the prediction becomes a challenge. Therefore, a feasible STTP at largescale network needs to be studied.
Generally, representation learning, a.k.a. dimension reduction, is used to transform the raw data into a good representation that makes the subsequent tasks easy. It plays an important role in time series clustering, because time series are essentially highdimensional and susceptible to noise. Hence, clustering directly with raw series is computationally expensive and distance measures are highly sensitive to the distortions. Recently, deep learning (DL) has been developed with great success in many areas, including computer vision, speech recognition and natural language processing due to its theoretical function approximation properties
[7] and demonstrated feature learning capabilities [8]. Therefore, deep representation is used for traffic series clustering.In this paper, a feasible framework composed of a deep clustering module and several prediction models is proposed for STTP at largescale networks. More specifically, a shapebased representation learning method is developed for road segments clustering. On the other hand, several predictions are combined to achieve the STTP at the network. The main contributions of the paper are summarized as follows:

By fully exploiting the periodicity of traffic patterns, we propose a method to generate triplets from unlabeled dataset. The raw traffic series are divided into subseries by periods, three of which are selected to generate a triplet according to a specific criterion. The dimension of subseries used for representation learning is significantly reduced, compared to raw series.

A supervised deep clustering module termed as DeepCluster, is developed. Unlike the existing handcraft features, such as the frequency transformation, wavelet transformation, Shapelets et al., a pure data driven method is proposed to learn the shapebased representations of traffic series in a visualized way. A rasterization strategy is first designed to transform the traffic series into traffic images. A convolutional neural network (CNN) with triplet loss is then used for representation learning. At last, the representations are used to cluster the network into groups by traditional clustering methods.

Based on the idea of model sharing, groupbased models (GMs) that are constituting a prediction at network is proposed to achieve a good tradeoff between the quantity of models and the performance of predictions. Specifically, all road segments in one group share one prediction model and each GM allows the training samples generated by the road segment from the same group to be aggregated to learn the model. Model sharing increases the number and the diversity of the training samples, which is beneficial for DNNs training. The experiment results validate that the GM has stronger generalization ability than IM. We also analyze the impact of input interval on performance by experiments.
The rest of paper is organized as follows. Section II reviews the related works. In Section III, the data used throughout the paper is described. Section IV formulates the STTP problem at largescale network. In Section V, the DL methodologies are introduced. The proposed framework of STTP including DeepCluster and DeepPrediction is then proposed in Section VI. In Section VII, simulation results demonstrating the performance of the proposed framework are given, before concluding the paper in Section VIII.
Ii Related Works
Iia Time Series Representation Learning
A wide colorvariety of methods had been developed for time series representation learning in clustering [9, 10, 11], such as spectral transformation [12], wavelets transformation [12]
, eigenvalue analysis techniques
[13], piecewise linear approximation (PLA) [14], adaptive piecewise constant approximation (APCA) [15], symbolic approximation (SAX) [16], piecewise aggregate approximation (PAA) [17], perceptually important point (PIP) [18] et al. However, all these methods are handcraft features, which are designed to describe specific time series pattern and heavily rely on the database.A new trend appears with artificial neural networks (ANNs), especially deep NNs (DNNs) based representation learning in clustering, which are datadriven and capable of learning a powerful representation from raw data through a highlevel and nonlinear mapping. Therefore, some works have used the deep representation learning to improve clustering performance. C. Song et al. in [19] integrated means algorithm into a stacked autoencoder (SAE) by minimizing the reconstruction error as well as the distance between data points and corresponding clusters. It alternatively learned the representations and updated cluster centers. In [20, 21], the means algorithm used the nonlinear representations that are learned by DNNs for clustering. J. Xie et al. in [22]
proposed a deep embedded clustering that simultaneously learned the representations and cluster assignments by defining a centroidbased probability distribution and minimizing its KullbackLeibler (KL) divergence to an auxiliary target distribution. K. Tian
et al. in [23] improved the existing works by proposing a general flexible framework that integrated traditional clustering methods into different DNNs. The framework is optimized by alternating direction of multiplier method (ADMM). However, the above methods all worked with the static data that is simple and low dimensional compared with time series data in general. On the other hand, there is less research on the deep representation learning of time series in clustering. Therefore, an efficient time series representation learning algorithm dedicated for clustering needs to be developed.IiB Short Term Traffic Prediction
There are numerous researches on singlepoint STTP [3]
, such as autoregressive integrated moving average (ARIMA) family of models, regression models, Markov models, Kalman filters, Bayesian networks, traffic flow theorybased simulation models and ANNs. Obviously, singlepoint models predict the future traffic state for a target road segment only using its own historical data, which ignores the relations between the target road segment and adjacent segments. Consequently, some researches have focused on predicting one or multiple segments by taking the spatiotemporal interrelations between adjacent road segments into account
[24, 25, 26, 27]. However, the above networklevel STTP researches are restricted to small regions that containing several adjacent road segments.Recently, a few literatures begin to pay attention to the predictions at the largescale networks. In [29, 30], dynamic simulator based on traffic flow theory was used for STTP at the whole network with limited traffic data. [31, 32, 33, 34] only predicted the traffic state of the representative road subset to achieve the prediction at the whole network by utilizing data compression technologies. However, the performance of prediction was poor resulted from compression and reconstruction errors. Min et al. in [35] considered a road network consists of about 500 road segments. However, they developed a custom model for the test area, which is not practical. M. Asif et al. in [36]
performed prediction for each individual road segment with support vector regression (SVR) algorithm over a large network containing 5,000 road segments. Then
means algorithm was used to cluster the road segments to analyze the spatial prediction performance. But the prediction method may not work well, since the performances differed greatly among clusters and the mean error of one cluster is up to of fiveminute prediction. Besides, STTP for each individual road segment is hard to implement on largescale networks in practice. X. Ma et al. in [37] proposed a CNNbased method that arranges the traffic data into 2D (2dimensional) matrices as inputs to predict the largescale traffic speeds. However, they only built one model and expected it to fit for all segments without considering the fact that the whole network is heterogeneous with different type of segments. Therefore, these attempts are hard to be implemented on largescale networks with high accuracy.Iii The Data
The traffic data used throughout the paper is described in this section. The topology of Liuli Bridge is shown in Fig. 1. The network consists of about 1,000 road segments with a diverse level of road functions including express way, arterial road, access road, side road et al. In addition, the dataset collected by Beijing Transportation Institute contains the traffic speed data from September, 2017 to November, 2017 with fiveminute sampling interval. Hence, it has totally measured data, where means the total number of days and means the number of values collected in each day. The data is measured by vehicles that are equipped with GPS such as taxis and buses.
Iv Formulation of STTP Problem
Consider a largescale network consisting of road segments, i.e., , where is a time series of measurements at segment . We denote a sub traffic series by
(1) 
where is a set of continuous measured values with intervals from a time series , that starts at position with , and . is abbreviated as for simplicity.
Let be the forecast of traffic state of the prediction horizon , given the corresponding historical measurements up to time . The goal of STTP is to construct a mapping between the historical traffic state and the future one, i.e.,
(2)  
As stated above, IM and WM are both inappropriate for the largescale networks, because they not only consist of a large number of road segments, but also a variety of types of road segments as shown in Fig. 2
. On one hand, it’s unpractical to construct and store massive amounts of IMs in ITS. Besides, the number of training samples collected from one segment is insufficient to learn a robust DL model. On the other hand, it’s impossible to build a model for the whole network with different types of traffic pattens. In addition, the model is vulnerable to the curse of dimensionality by taking historical data from all segments as inputs. Then how to make a proper utilization of the tremendous traffic data to achieve the effective and practical STTP is still a problem.
To tackle this problem, we cluster the road segments into groups, each of which has a typical traffic pattern. Within each group, the traffic patterns of all road segments are highly similar in shape. Based on that, a STTP model is built for a group, rather than a segment or whole network. The challenges in our problem include i) representation learning of the traffic series that are highdimensional and sensitive to distortion, and ii) representation learning from unlabeled traffic data that are beneficial to cluster task.
V Deep Learning for Traffic Predictions
In this section, we deals with the tremendous traffic series by means of the DL technologies, including CNNs and recurrent NNs (RNNs), which will be explained in this section.
Va Convolutional Neural Networks
The key aspect of CNNs is that the features are not designed by human engineers, but are learned from data using a generalpurpose learning procedure [8]. Fig. 3 shows the architecture of a typical CNN, named LeNet5. CNNs can take any form of arrays, such as 1D series, 2D images and 3D videos as inputs. A CNN is made up of layers, where two main types of layers different with the regular ANNs are convolutional layers (C layers in Fig. 3) and subsampling layers (S layers in Fig. 3).
In the th convolutional layer , the outputs of the previous layer are fed to convolve with several convolutional kernels. After that, the outputs are added by biases and activated by a nonlinear function to form new representations (features in Fig. 3) for the next layer. Assuming the current layer accept an input volume of size . Formally, the output of size filtered by the th kernel of size
with stride
is given by(3) 
where is the number of kernels and is a bias of layer , respectively. represents a discrete convolution operator .
is a activation function such as tanh functuon, relu function
et al. By concatenating along the last dimension, the output for layer of the size can be derived, and both of which can be calculated by(4) 
(5) 
(6) 
where represents rounded down. With parameter sharing, there are learnable weights of layer in total,
(7) 
In the th subsampling layer , the spatial resolution of representations is reduced to increase the level of distortion invariance. After layer , the layer accepts a volume of size as input. Specifically, representations in the previous layer are pooled over neighborhood within a rectangular region of
, by either a maxpooling function
(8)  
or an averagepooling function. where is the output of size , and both of them can be calculated by
(9) 
(10) 
(11) 
The convolutional and subsampling operators make the new representations more invariance to the distortion compared to the raw data. Besides, the parameter sharing make the CNNs capable of processing highdimensional inputs. The aforementioned characteristics allow to adopt the CNNs for time series representation learning.In this section, we explore an efficient deep CNN architecture, FaceNet [39] to learn the deep representations of the raw time series.
VB Recurrent Neural Networks
Unlike the regular ANNs, RNNs are capable of exhibiting the temporal correlations of time series, which makes them applicable to tasks such as language modeling, speech recognition or time series forecasting.
Assuming the duration of the temporal correlations (defined as time step) is , a threelayer RNN can be regarded as a layer feedforward NN by unfolding it through time, As shown in Fig. 4. The RNN reads a series one by one and each RNN block takes a value at one time as input. The current hidden state at time is computed from the current input and the previous hidden state by
(12)  
where is the hidden state of the last two RNN blocks. is the activation function of the hidden layer. The key idea of RNNs is to imitate a sequential dynamic behavior with a chainlike structure that allows the information to be passed from previous layer to the current one. In this paper, RNNs are used to model the temporal correlations of traffic series.
Vi Proposed Framework for STTP
In this section, a framework dedicated for STTP at largescale networks is described in details. The architecture of this framework is shown in Fig. 5. It consists of two major components, i.e., DeepCluster and DeepPrediction. The inputs are historical traffic states with fixed interval coming from different road segments, while the outputs are predictions for a given time period. The inputs are fed to the DeepCluster module, and are divided into several groups. Afterwards, the DeepPrediction module performs the predictions for the network.
Via DeepCluster
Traffic series clustering method at largescale networks is first proposed, which is implemented via deep representation learning. Before developing the clustering algorithm, the problem of clustering at largescale networks is formally defined as follows:
Definition 1: Given a largescale network consists of traffic series, i.e., the process of partitioning of into groups , is called traffic series clustering. In such a way that homogenous traffic series are grouped together based on a certain similarity measure.
In contrast to the traditional extrinsic handcraft features, human brains can seize the intrinsic visualbased features easily, which is why they can quickly distinguish different types of the time series under the help of high abstraction ability. Moreover, compared with raw time series, the intrinsic visualbased features are more steady. They are less affected by the distortions and the scale of samples. To address the issues of the raw databased or handcraftbased clustering methods, we use the deep representation learning for series clustering. The DNN is employed to learn a mapping from the raw highdimensional traffic series to the lowdimensional representations that are used for clustering.
The DeepCluster module includes triplets generation, inputs transformation, representation learning and clustering. Details of each step are given below.

Triplets Generation. As can be seen in Fig. 6, the traffic patterns follow the same trend among days. In order to study the traffic periodic pattern in a day, we calculate the traffic similarity defined in[38]. The traffic similarity is defined as the normalized gaps between each pair of measurements in two consecutive days from one road segment. As stated in Sec. IV, traffic speeds are collected every 5 minutes. Since one day has 288 time intervals, the traffic similarity at segment in time slot can be calculated by
(13) The cumulative distribution function (CDF) of
is shown in Fig. 7. We can see that more than are smaller than , which indicates that periodic pattern exists in traffic series at most read segments.To fully exploit the traffic temporal features and periodic patterns, we split the traffic series into subseries by periods, and generate triplets for representation learning. Given traffic series with period measured from road segments, we split the series into subseries by periods, termed as periodic subseries. Thus, we have periodic subseries for each segment,
(14) here is the th periodic subseries at segment with . A triplet is made up by randomly choosing two different periodic subseries from one segment, and one subseries from another segment,
(15) 
Inputs transformation. In order to extract the features of shape, a rasterization strategy is designed to visualize the series into images shown in Fig. 8. The transformed images can reveal the shape information of series well, such as bulge, sink and so on. Let the series be standardized by minmax normalization to keep values between and . A series is transformed to a matrix by expanding each element to a vector. For the th element , the position at the th column of the matrix is,
(16) The matrix corresponding to the series can be written as:
(17) where is a dimensional vector with the pixel value of at its th entry standing for white and standing for black elsewhere. The transformed image is shown in Fig. 9. The matrixes are used as the inputs to the representation learning. The subimage corresponding to the subseries is represented as . Therefore, the triplet becomes,
(18) 
Representation learning and clustering. DNNs with triplet loss from [39] is employed to strive for a representations over a triplet, from an image space into a feature space. The triplet loss encourages the representations of a pair of subimages from one segment to be close to each other in the feature space, and the those from different segments to be far away. The representation of is denoted by . Thus, the triplet loss that is being minimized is,
(19) where is the set of all possible triplets. The structure of DNNs with triplet loss is shown in Fig. 9, where the outputs of the last layer are the representations used for clustering. The dimension of the representations in clustering is lower than the raw series. For example, considering a traffic series with fiveminute interval during days. The length of whole series is , while the length of daily subseries is . If we use dimensional representations in clustering, the ratio of reduction in dimension is about . Subsequently, we average all the representations from one road segment, and cluster the representations into groups, where is much less than . Therefore, road segments are clustered into groups,
(20) where denotes the th group. represents the th road segment in network , which is clustered into group .
ViB DeepPrediction
After partitioning the network into groups, we build a prediction model for a group in the DeepPrediction module. Some definitions and statements are first given.
Definition 2: Given two functions and , if coincides with within a specified measurement range after horizontal translation, is homogeneous with .
Statement 1: Given two homogeneous functions and , for simplicity assuming has coincided with , and distinct successive samples generated from . Construct a mapping between historical values and the future value: . Similarly get successive samples from at same values and construct the mapping . It is obvious that is equal to .
Based on the Statement 1, we propose an idea of model sharing that all road segments within a group can share a prediction model. The implementation of the DeepPrediction is elaborated as follows:

Interval confirmation. According to the periodicity, it is intuitive to use the measurements in a period to predict the next traffic state. In order to measure the autocorrelation between current and past traffic values, we calculate the autocorrelation function (ACF) at lag , which is the correlation between series values that are intervals apart. As shown in Fig. 10, the measurements are linearly correlated with the contiguous measurements. The high autocorrelations imply that importing all measurements in a period will result in information redundancy. We calculate the input interval by
(21) where denotes the ACF at lag , and is the given threshold that is determined by experiments. Therefore, The input series from becomes
(22) The length of the input reduces from to correspondingly, where represents the operation of roundedup.

Model sharing. Within each group, we train a model for all road segments, which is known as groupbased model (GM). We generate the training samples for each group as
(23) where and denote the input and output of model, respectively. After that, we aggravate the samples within a group to train a GM for group ,
(24) Then the aggregated STTP model at the largescale network can be written as:
(25)
Vii Performance Evaluation
In this section, we evaluate the proposed framework on the network mentioned in Section III. road segments are chosen for simplicity. The network, experimental settings and performance metrics are described at first. Then, we analyze the performance over different metrics.
Viia Experiment Settings
For DeepCluster module, we split the traffic series into daily subseries of length for each segment. Fig. 11 shows that the traffic patterns on weekdays are different from the ones at weekends between six and ten o’clock in the morning, since most people do not work at weekends (The circular region). Besides, the traffic patterns behave abnormally during the National Day than usual, as shown in Fig. 12. As a result, daily subseries are chosen by getting rid of the ones at weekends and during the National Day. Then we transfer the subseries of size into images of size . As discussed in Section VIA, we generate triplets by the daily subseries from road segments, which are used for representation learning. The deep structure of FaceNet used in this paper is the Inception_ResNet, the configuration of which is the same with [39]. As a segment’s representative, the average representations of the subseries is used for clustering by means method. is confirmed by Silhouette coefficient [40].
For DeepPrediction module, we use the state of the art RNNs, i.e., long short term memory (LSTM)
[41] for STTP. The input span of traffic series is chosen to be a day. Then the length of the input is that is confirmed by the experiments discussed later. We split the data into training set and testing set for each road segment, and aggregate the training set belonging to the same group to train the LSTM. In the end, GMs are aggregated.Module  Network  Parameter  Size 
DeepCluster  FaceNet  Image size  160 
Batch size  
Segments per batch  
Images per segment  
Embedding size  
DeepPrediction  LSTM  Time steps  
LSTM1  
LSTM2  
Dense1  
Dense2 
The implications of the parameters given in this table are explained exactly in [39].
The inputs and outputs size are described in .
The key parameters of the relevant DNNs are listed in Table I. If not mentioned specifically, all models are trained on eighty percent of data while tested on the remaining data. fold crossvalidation is adopted over training dataset. The
means method is implemented using the Scikitlearn Python 3.6.5. The NNs are conducted with a NVIDIA p2000 GPU, TensorFlow r1.8, CUDA 9.0 and CuDNN 9.0. Moreover, four performance metrics includes relative error (RE), mean relative error (MRE), max mean relative error (MARE) and minimum mean relative error (MIRE) are used for evaluation, which are defined as
(26) 
(27) 
(28) 
(29) 
where denotes the RE of th segment in network clustered into group with being the true speed and being the prediction. is the number of road segments in the group . Besides, , and are MRE, MARE and MIRE of group , respectively. The performance metrics for road network can be similarly calculated.
Input interval  MRE of Training(%)  MRE of Testing(%) 
1  
3  
5  
7 
Prediction Horizon  Group  Algorithm 






1(fiveminute)  1  GM  
IM  
2  GM  
IM  
3  GM  
IM  
2(Tenminute)  1  GM  
IM  
2  GM  
IM  
3  GM  
IM  
3(Fifteenminute)  1  GM  
IM  
2  GM  
IM  
3  GM  
IM 
GM: Groupbased Model. IM: Individualbased Model.
ViiB Simulation Results
Three experiments are conducted, including road segments clustering, interval confirmation and STTP at network.

Road segments clustering. All 27 road segments are clustered into groups by DeepCluster as shown from Fig. (a)a to Fig. (c)c. It can be found that the series in a group are in general homogeneous with the other series defined at Section VIA, which demonstrates the proposed DeepCluster’s capacity of extracting the shapebased features. For example, the segments in cluster 1 have a breakdown in traffic speed during the evening peak period, followed by speed recovery. The cluster 2 have a breakdown during the morning peak, and start to swing at the middle speed backandforth. The segments in cluster 3 have some slight resemblances to cluster 1 during the evening peak period. However there is a stable condition holding the middle speed after six o ’clock in the morning.

Interval confirmation. This part investigates the effect of input interval on predictive performance and determines the threshold of the ACF defined in Section VIB. The LSTM is performed to predict the next fiveminute speed under different input intervals over the random segments. From the performance listed in Table II, the MRE of training increases with the decrease of . However, the performance improvements are insignificant when , such as the training MRE at and when and . Besides, the testing MRE at is slightly larger than that at . This is because the capacity of model becomes stronger as input interval decreases, leading to overfitting. From this result, the threshold is empirically set to . In the end, the input interval is set to corresponding to twentyfive minutes for all other simulations.

STTP at network. For the performance comparison, we construct an IM for a segment by the same configuration of LSTM under different prediction horizon . Simulation results are listed in Table III. The IMs have lower training MRE than the GMs due to the fact that the capacity of the IMs is highly stronger than that of the GMs. However, the GMs can get lower gaps between training MRE and testing MRE in all tests, since increasing the number and diversity of the training samples can improve generalization capability of the model. On the contrary, the IMs are constrained by the problem of overfitting resulted from modeling the noise. As shown in Fig. 14, the gaps of GMs are close to while the gaps of IMs are around .
From Table III, we can observe that the GMs perform better than IMs in terms of testing error in a relatively simple task of fiveminute forecasting. The testing MRE of the GMs and IMs are and for group 1, and for group 2, and for group 3, respectively. However, as the task becomes complex, the capacity of GMs become insufficient. For example, the testing MRE of GMs are around more than that of IMs when , while the testing MRE of GMs are around more than that of IMs when .
As shown in Fig. 15, the GM can predict the trends of traffic speed well, but the performance gets worse with the increase of the prediction horizon. It also shows that the model does not work well of 10 and 15 minutes forecasting in rush hours (The dash area in Fig. 15), that the traffic speed switching sharply.
The proposed framework is scalable that can be applied for the largescale networks easily by reducing the number of models significantly, and can reach the compromise of the number of models and prediction performance. Compared to the traditional IMs, the number of prediction models has been reduced up to with about performance degradation, in terms of network MRE in our test, as shown in Fig. 14. In conclusion, the performance of the framework is comparable to that of customized IMs, which validates the ability for STTP at largescale networks.
Viii Conclusion
The characteristics of the multiplicity and heterogeneity make STTP at largescale network a challenging and important problem. By exploiting the characteristic of traffic patterns, a DL framework for STTP at largescale networks is proposed in this paper. The key point of the framework is the combination of the DeepCluster and the DeepPrediction, as well as the model sharing strategy. We analytically evaluate the proposed framework over a real largescale network of Liuli Bridge in Beijing and some insights into generic DL models are obtained. Despite that the prediction performances of the GMs are slightly worse than that of IMs in most tests, the GMs have a better generalization ability. For fiveminute prediction, the GM gets error lower than IM. We also discuss the effect of input interval on the prediction performance, which guides the framework on how to select the effective input interval. Furthermore, we use only models to achieve the STTP at network, while the traditional way needs to construct models.
Ix Acknowledgement
This work is funded in part by the National Natural Science Foundation of China under Grant 61731004.
References
 [1] M. Wang, H. Shan, R. Lu, R. Zhang, X. Shen, and F. Bai, “Realtime path planning based on hybridVANETenhanced transportation system,” IEEE Transactions on Vehicular Technology, vol. 64, no. 5, pp. 16641678, 2015.
 [2] S. Sun, and C. Zhang, “The selective random subspace predictor for traffic flow forecasting,” IEEE Transactions on Intelligent Transportation Systems, vol. 8, no. 2, pp. 367373, June 2007.
 [3] E. I. Vlahogianni, M. G. Karlaftis, and J. C. Golias, “Shortterm traffic forecasting: Where we are and where we’ re going,” Transportation Research Part C: Emerging Technologies, vol. 43, pp. 319, June 2014.
 [4] A. Ermagun, D. Levinson, “Spatiotemporal traffic forecasting: review and proposed directions,” Transport Reviews, vol. 38, no. 6, pp. 786814, February 2018.
 [5] I. Lana, J. D. Ser, M. Velez, and E. Vlahogianni, “Road Traffic Forecasting: Recent Advances and New Challenges,” IEEE Intelligent Transportation Systems Magazine, vol. 10, no. 2, pp. 93109, April 2018.
 [6] W. Xu, H. Zhou, N. Cheng, F. Lyu, W. Shi, J. Chen, and X. Shen, “Internet of vehicles in big data era” IEEE/CAA Journal of Automatica Sinica, vol. 5, no. 1, pp. 1935, 2018.
 [7] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251257, March 1991.
 [8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436444, May 2015.
 [9] A. Bagnall, E. Keogh, S. Lonardi S, and G. Janacek, “A bit level representation for time series data mining with shape based similarity,” Data Mining and Knowledge Discovery, vol. 13, no. 1, pp. 1140, May 2006.
 [10] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and E. Keogh, “Experimental comparison of representation methods and distance measures for time series data,” Data Mining and Knowledge Discover, vol. 26, no. 2, pp. 275309, March 2013.
 [11] S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah, “Timeseries clusteringA decade review,” Information Systems, vol. 53, pp. 1638, October 2015.
 [12] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient similarity search in sequence databases,” in Proceedings of 4th International Conference on Foundations of Data Organization and Algorithms, Heidelberg, June 1993, pp. 6984.
 [13] F. Korn, H. V. Jagadish, and C. Faloutsos, “Efficiently supporting ad hoc queries in large datasets of time sequences,” in Proceedings of ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, May 1997, pp. 289300.
 [14] E. J. Keogh, and M. J. Pazzani, “An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback,” in Proceedings of 4th International Conferencee on Knowledge Discovery and Data Mining (KDD), NY, August 1998, pp. 239247.
 [15] E. Keogh, K. Chakrabarti, M. Sharad, and M. Pazzani, “Locally adaptive dimensionality reduction for indexing large time series databases,” in Proceedings of ACM SIGMOD International Conference on Management of data, Santa Barbara, California, USA, May 2001, pp. 151162.
 [16] E. Keogh, S. Lonardi, and C. A. Ratanamahatana, “Towards parameterfree data mining,” in Proceedings of 10th ACM SIGMOD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, August 2004, pp. 206215.
 [17] B. K. Yi, and C. Faloutsos, “Fast time sequence indexing for arbitrary Lp norms,” in Proceedings of 26th International Conference on Very Large Data Bases (VLDB), Cairo, Egypt, September 2000, pp. 385394.

[18]
F. L. Chung, T. C. Fu, R. W. P. Luk, and V. T. Y. Ng, “Flexible time series pattern matching based on perceptually important points,” in
Proceedings of International Joint Conference on Artificial Intelligence Workshop (Learning From Temporal and Spatial Data)
, Seattle, WA, August 2001, pp. 17. 
[19]
C. Song, F. Liu, Y. Huang, L. Wang, and T. Tan, “Autoencoder based data clustering,” in
Iberoamerican Congress on Pattern Recognition
, Springer, 2013, pp. 117124.  [20] F. Tian, B. Gao, Q. Cui, E. Chen, and T. Liu, “Learning deep representations for graph clustering,” in Proceedings of 28th AAAI Conference on Artificial Intelligence, 2014, pp. 12931299.
 [21] J. Xu, P. Wang, G. Tian, B. Xu, J. Zhao, and F. Wang, “Short Text Clustering via Convolutional Neural Networks,” in Proceedings of Conference on North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), Denver, Colorado, May 2015, pp. 6269.

[22]
J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in
Proceedings of the 33rd International Conference on Machine Learning
, San Juan, PR, USA, May 2016, pp. 478487.  [23] K. Tian, S. Zhou, and J. Guan, “DeepCluster: A General Clustering Framework Based on Deep Learning,” in Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Cham, December 2017, pp. 809825.
 [24] S. Sun, R. Huang, and Y. Gao, “Networkscale traffic modeling and forecasting with graphical lasso and neural networks,” Journal of Transportation Engineering, vol. 138, no. 11, pp. 13581367, 2012.
 [25] B. Yu, X. L. Song, F. Guan, Z. M. Yang, and B. Z. Yao, “kNearest neighbor model for multipletimestep prediction of shortterm traffic condition,” Journal of Transportation Engineering, vol. 142, no. 6, pp. 110, 2016.
 [26] G. N. Polson, O. V. Sokolov, “Deep learning for shortterm traffic flow prediction,” Transportation Research Part C: Emerging Technologies, vol. 79, pp. 117, June 2017.

[27]
Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Datadriven traffic forecasting,” in
Proceedings of International Conference of Learning Representation (ICLR), Vancouver, Canada, April 2018. 
[28]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in
Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, June 2015, pp. 815823.  [29] M. BenAkiva, M. Bierlaire, H. Koutsopoulos, and R. Mishalani, “DynaMIT: a simulationbased system for traffic prediction,” in Proceedings of DACCORD Short Term Forecasting Workshop, Delft, Netherlands, Feb. 1998.
 [30] A. Abadi, T. Rajabioun, and P. A. Ioannou, “Traffic flow prediction for road transportation networks with limited traffic data,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 2, pp. 653662, April 2015.

[31]
T. Djukic, J. W. C. Van Lint, and S. P. Hoogendoorn, “Application of principal component analysis to predict dynamic origin–destination matrices,”
Transportation Research Record: Journal of the Transportation Research Board, vol. 2283, no. 1, pp. 8189, January 2012  [32] N. Mitrovic, M. T. Asif, U. Rasheed, J. Dauwels, and P. Jaillet, “CUR decomposition for compression and compressed sensing of largescale traffic data,” in Proceedings of 16th International Conference on Intelligent Transportation Systems (ITSC), Hague, Netherlands, October 2013, pp. 14751480.
 [33] M. T. Asif, S. Kannan, J. Dauwels, and P. Jaillet, “Data compression techniques for urban traffic data,” in Proceedings of IEEE Symposium on Computational Intelligence in Vehicles and Transportation Systems (CIVTS), Singapore, April 2013, pp. 4449.
 [34] N. Mitrovic, M. T. Asif, J. Dauwels, and P. Jaillet, “LowDimensional Models for Compressed Sensing and Prediction of LargeScale Traffic Data,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 5, pp. 29492954, April 2015.
 [35] W. Min, and L. Wynter, “Realtime road traffic prediction with spatiotemporal correlations,” Transportation Research Part C: Emerging Technologies, vol. 19, no. 4, pp. 606–616, 2011.
 [36] M. T. Asif, J. Dauwels, C. Y. Goh, A. O. E. Fathi, M. Xu, M. M. Dhanya, N. Mitrovic, and P. Jaillet, “Spatiotemporal patterns in largescale traffic speed prediction,” IEEE Trans. Intell. Transp. Syst., vol. 15, no. 2, pp. 794804, April 2014.
 [37] X. Ma, Z. Dai, Z. He, J. Ma, Y. Wang, and Y. P. Wang, “Learning Traffic as Images: A Deep Convolutional Neural Network for LargeScale Transportation Network Speed Prediction,” Sensors, vol. 17, no. 4. pp. 116, April 2017.

[38]
K. Xie, L. Wang, X. Wang, G. Xie, J. Wen, G, Zhang, J. Cao, and D. Fang, “Accurate Recovery of Internet Traffic Data: A Sequential Tensor Completion Approach,”
IEEE/ACM Transactions on Networking (TON), vol. 26, no.2, pp. 793806, April 2018.  [39] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, June 2015, pp. 815823.
 [40] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 5365, November 1987.
 [41] S. Hochreiter, and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, November 1997.