Clustering of Time Series Data with Prior Geographical Information

07/03/2021 ∙ by Reza Asadi, et al. ∙ University of California, Irvine 0

Time Series data are broadly studied in various domains of transportation systems. Traffic data area challenging example of spatio-temporal data, as it is multi-variate time series with high correlations in spatial and temporal neighborhoods. Spatio-temporal clustering of traffic flow data find similar patterns in both spatial and temporal domain, where it provides better capability for analyzing a transportation network, and improving related machine learning models, such as traffic flow prediction and anomaly detection. In this paper, we propose a spatio-temporal clustering model, where it clusters time series data based on spatial and temporal contexts. We propose a variation of a Deep Embedded Clustering(DEC) model for finding spatio-temporal clusters. The proposed model Spatial-DEC (S-DEC) use prior geographical information in building latent feature representations. We also define evaluation metrics for spatio-temporal clusters. Not only do the obtained clusters have better temporal similarity when evaluated using DTW distance, but also the clusters better represents spatial connectivity and dis-connectivity. We use traffic flow data obtained by PeMS in our analysis. The results show that the proposed Spatial-DEC can find more desired spatio-temporal clusters.



There are no comments yet.


page 14

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Spatio-temporal data arise in broad areas of engineering and environmental sciences. Data mining techniques have been used extensively for spatio-temporal analysis [atluri2018spatio]. Geo-referenced time series are a subset of spatio-temporal data, where fixed locations over a geographical area observes some features for a time period in a synchronous way. Traffic data is a complex example of Geo-referenced data, which is multi-variate time series data, including the flow, speed and occupancy of a large number of sensors, and in which there are correlations and similarities in spatial and temporal neighborhood. Spatio-temporal analysis of traffic data have a pivotal role in future research to improve the performance of transportation systems [nagy2018survey], such as reducing traffic congestion and air pollution [chowdhury2017data], understanding the behaviour of a transportation network [rempe2016spatio], predicting traffic speed and flow [zang2018long], [asadi2020spatio], and detecting non-recurrent congestion events [anbaroglu2014spatio].

The volume and variety of spatio-temporal data has increased with the advent of new sensing technologies, such as cameras, GPS and sensors [toch2019analyzing]. Increases in the volume of traffic data requires the development of large-scale machine learning algorithms and big data analytics [zhu2018big], and data-driven approaches on traffic data [wang2016soft]. Deep learning models have been recently successfully applied on spatial and temporal domains [wang2019deep]. The models especially outperforms traditional machine learning and statistical methods on large-scale data. Several studies shows the success of deep learning solutions, such as traffic flow forecasting [ma2020daily]

, missing data imputation

[chen2019traffic] and spatio-temporal modelling of traffic flow data [dixon2019deep]. Success of deep learning models in various domains along with the challenges of applying the deep learning models on spatio-temporal traffic data are the main motivations to further study the problem.

I-a Clustering of traffic data

Spatio-temporal clustering of traffic data has been broadly studied with various goals. First, congestion detection and prediction can assist travelers and traffic management systems to improve the efficiency of existing systems. Second, detecting similarity in traffic patterns can help machine learning models to find similar regions in a transportation network. This can improve missing data imputation and traffic forecasting models, or can identify anomalies in the data.

In [cheng2018classifying]

, they propose an improvement of fuzzy k-means clustering to classify traffic states into five groups ranging from mild to extreme traffic. Also, in

[celikoglu2014dynamic], a clustering of traffic flow data is obtained based on congestion levels. They describe clusters in the temporal domain based on levels of congestion. While these works cluster traffic data based on traffic congestion, they did not consider spatial domains in their analysis. Moreover, in [wei2020spatio], they propose a method to better understand how traffic conditions are correlated in space-time. They cluster traffic data based on four congestion levels using an improved spatio-temporal Moran scatter-plot. These works cluster traffic data based on level of congestion. However, we consider clustering of traffic data based on similarities of patterns, which can be more generalizable to various machine learning problems, such as traffic flow prediction and anomaly detection. In [anbaroglu2014spatio], they define a measure, called the Link Journey Time, and they obtain spatio-temporal clusters of non-recurrent events. Each spatio-temporal cluster is a non-recurrent detected event, where it represents neighboring spatial and temporal features. Their model consider a similarity measure to obtain spatio-temporal data. However, their model is only designed to find non-recurrent events, and not generalized to find similar regions or temporal patterns.

In [shi2019detection], they consider the problem of clustering of traffic flow data to obtain spatial and temporal similar patterns. They propose spatio-temporal clustering of traffic flow by considering topology of the network and similarity of time series data, where clusters are made by successive connections of neighbors. This work considers prior assumptions about the data and the topology of the network. A data-driven approach is expected to find spatio-temporal clusters without any prior assumptions, which would be more generalizable to different problems and scenarios [kim2017data]. In [cheng2007mining], similarities of urban traffic flow are explored with a discrete wavelet transform. In [chunchun2011traffic], they proposed a fuzzy clustering method on traffic flow segments. Dynamic Time Warping (DTW), as a temporal similarity function, is used to identify locations with temporal similarities. They consider the problem of clustering of time series segments. In [nguyen2019feature]

, they represent spatio-temporal data as image-like representation. They propose a point-based and segment-based clustering of speed to represents classes of traffic congestion in spatial and temporal domain. A segment-based clustering is a similar approach to our model, where we find the clustering based on time series segments. However, they use a filter to obtain features from computer vision. On the other hand, we consider a temporal similarity distance to represent similarities in traffic flow data. Moreover, they evaluate clusters by visually assessing the model’s output, where we use such a visualization method to represent interesting insight in the clusters of traffic flow data.

Similarity of traffic patterns not only detect traffic congestion, but also detect spatial and temporal heterogeneous neighborhoods. In [tang2019short]

, a k-means clustering is applied to find traffic flow variations based on spatio-temporal correlations. The clusters of similar locations are the input to a neural network which predicts traffic flow with higher performance. In


, cluster of similar locations have been used with an autoencoder to impute missing values. In

[qiu2019traffic], a clustering method finds road segments based on their features and missing data imputation is applied on incomplete speed data. In [salamanis2017identifying], a clustering model is used to identify anomalies in traffic flow data. We consider the problem of discovering spatio-temporal similarities in traffic data. Clusters of traffic data, such as speed, flow and occupancy can represent levels of congestion. However, traffic flow data can be represents locations and time stamps with similar patterns with the goal of finding heterogeneous spatial and temporal domains.

I-B Spatio-temporal clustering with deep learning

Since we consider clustering of traffic flow segments, here we review some of the recent research regarding time series clustering and in the rest, we describe the literature review for deep learning models for clustering problems. In [aghabozorgi2015time], they describe a broad range of time series clustering applications. The main components of time series clustering are studied including time series representations, similarity and distance measures, clustering prototypes and time series clustering. In [soheily2016generalized]

, they describe the challenges of k-means clustering with time warp measures. They propose weighted and kernel time warp measures for k-means clustering. Their method has a faster estimation of clusters. Further investigation of time series clustering is studied in

[paparrizos2017fast]. These works illustrate that novelty in time series representations and distance measures are the main approaches of improving temporal clustering.

There are a broad range of clustering models applied to spatio-temporal data, such as k-means [huang2016time], DBSCAN [birant2007st], agglomerative clustering [yao2018stepwise], and matrix factorization based clustering [zhou2018visual]. However, increases in the size of datasets requires more scalable models such as deep learning models. When there is a huge dataset that includes data points with spatial and temporal properties, applying traditional clustering methods such as k-means on traffic data is computationally expensive and can have poor performance [huang2016time]

. More efficient heuristic methods for k-means clustering of traffic flow data have been studied


. Complex spatio-temporal patterns in traffic data necessitate further consideration of spatial and temporal information in the models. Deep learning models significantly improve performance of various machine learning problems, such as computer vision and natural language processing. Deep learning models have been broadly used for various large-scale spatio-temporal problems

[wang2019deep], [asadi2019convolution]. Moreover, deep learning models for clustering tasks are broadly studied in [min2018survey]. Deep embedded clustering is primarily introduced in [xie2016unsupervised]. Variations of the model have been studied in broad domains. Joint training of the model to preserve the latent feature space structure is proposed in [guo2017improved]. In [yang2017towards], they analyze clustering-friendly latent representations, which jointly optimize dimension reduction using both a neural network and k-means clustering. While most of the research applies deep embedded clustering on images, there are few studies to show their performance on time series data. In [tzirakis2019time], they jointly cluster and train the model. They also segment time series data with agglomerative clustering. In [8970987], they propose a DEC with a cluster tree structure to dynamically obtain the number of clusters, while the original DEC has a fix number of clusters. However, these models do not consider any prior relation among the clusters. In [ren2019semi]

, they propose a variation of DEC, which considers the pairwise distance between data points. The model uses the prior distances as a measure to classify unlabeled data points. This work consider a relation among the clusters for unsupervised learning, and it is similar to our work, as we consider any prior relation among the clusters based on their geographical information. In

[madiraju2018deep], they evaluate various similarity metrics to obtain clusters with DEC. While these models consider temporal similarity in a DEC, there is a lack of development of a deep learning model where it not only finds clusters based on temporal similarity, but also prior spatial features would be considered.

I-C Contributions of the work

The aforementioned works on clustering of time series data show the importance, advantages and applications of applying deep learning models to cluster time series and spatio-temporal traffic data. It also shows that recently there have been several deep learning models developed for clustering problems, where their goal is to modify latent feature space. There are various works that develop deep learning models for clustering of time series data. Clustering of time series data finds cluster of a transportation network based on the similarity of traffic flow data [asadi2019spatio]. However, considering prior geographical information in designing clusters is a challenging problem.

In this paper, we focus on clustering of time series with prior geographical information. We propose a model where it modify latent feature representation based on geographical information. The model is a variation of DEC, where it finds spatio-temporal clusters by adding a new loss function to the model.

The contributions of the paper are as follows:

  • We formulate spatio-temporal clustering of traffic flow data as the clustering of time series segments.

  • A spatial deep embedded clustering (Spatial-DEC) model is proposed which considers prior geographical information within the latent feature representation. To the best of our knowledge, this is the first work which considers prior geographical information to obtain spatial clusters with the DEC model.

  • We illustrate the application clustering of traffic flow data in transportation systems.

  • The spatio-temporal clusters obtained by deep learning models are evaluated on traffic flow data available in PeMS.

In section II, we describe the problem definition. In section III, we describe the technical background of the proposed model. In section IV, a deep learning model, Spatial-DEC, is proposed for spatio-temporal clustering. In Section V the models are evaluated on traffic flow data. Section VI describes the conclusions and future works.

Ii Problem Definition

Spatio-temporal data is represented with a three dimensional matrix , where is the number of sensors, is the number of time stamps and is the number of traffic features, including flow, speed and occupancy. Each fixed location has its own multi-variate time series data . A sliding window method, given a time window , generates a sequence of data points. In other words, the function receives input data and time window size , and outputs data points , which consists of all data points, time series segments, at time stamp and location , represented with . Throughout the paper, we represent each data point with two indices for location index and for temporal index. A clustering method assigns a data point into a cluster , where , and is the given number of clusters. While in alternative approaches, one can consider the problem of clustering of the whole time series , e.g. clustering of trajectory data [sabarish2018clustering], or sub sequences of spatio-temporal data . A clustering model finds similar data points based on a distance function, such as euclidean distance. Here, we define a temporal cluster, when its members have high temporal similarity, which can be obtained with a DTW distance function. It is desirable to have a more dense and compact temporal cluster. We define a spatial cluster, which includes location indices where their data points have high temporal similarity.

Traffic flow data is a spatio-temporal data. In Fig. 1, we represent an example of traffic flow for three sensors and three days. Finding temporal similarity among road networks is challenging with point-wise clustering. To prevent from the fluctuations in the clusters, we consider segment-wise clustering. In Fig 2, we describe a schematic representation of the input and output of the spatio-temporal clustering. The input data points are time series segments for location , three road segments, and time stamp , two time stamps. The selected data points are from PeMS traffic flow data, but the time stamps and road segments are arbitrary and are presented with the purpose of clarification of the problem definition. Each data point is a time series of length 12 in the figure. For 5-min time stamps, each data point represents one hour of traffic flow data for one road segment. The horizontal axis is time stamps, and vertical axis is traffic flow, normalized in range . The three output clusters represents similar data points. The clusters represent similar patterns over different days and hours. They also represents the locations that are similar on a transportation networks.

Fig. 1: An example of traffic flow data for three road segments, represented with three colors. The time stamps are every 5-min and the figure represents traffic flow data for three days.

This spatio-temporal clustering problem is challenging. First, the clustering method should consider both temporal and spatial similarities. Second, given a large number of time stamps, e.g. six months, and a large number of road segments of a city , a sliding window method generates of data points, which can be a very large number of data points. While k-means clustering methods have been proposed for time series segments, their performance drops when they faced with large number of data points, and it has expensive computational time. Moreover, a k-means clustering method has some limitations to be modified and consider both spatial and temporal similarities. Hence, in this work, we propose a deep learning model, Spatial-DEC, to solve spatio-temporal clustering problem.

input data points

Cluster 1

Cluster 2

Cluster 3

Spatio-temporal clustering

Fig. 2: An example of spatio-temporal clustering of traffic flow data

Iii Technical Background

Iii-a Autoencoders

An autoencoder is primarily proposed in [vincent2010stacked]. It consists of an encoder and a decoder

, where the activation function and the dropout function are represented with

and , respectively. An encoder is the first neural network component, which reduces the dimension of input data to a latent feature space , where . The second neural network component is a decoder, which reconstructs the input data from its latent representation.

In a deep autoencdoer, the encoder and decoder consist of several layers. The encoder and decoder are a symmetric and multi-layered neural network. The loss function, e.g. mean square error, reduces the difference of input data and its reconstruction. In other words, the input and target data are both . For the given spatio-temporal data, the reconstruction loss function is as follows,


where is the number of time stamps, and is the number of sensors or locations. Also, is the target of an autoencoder, which is the same as input data . Minimization of this objective function results in learning the latent feature representation of input data. We consider weight of for reconstruction loss throughout the paper, and in our representation of autoencoders the weight is .

Iii-B Deep Embedded Clustering

A Deep embedded clustering neural network is introduced in [xie2016unsupervised]. The encoder transforms into latent feature space . The clustering layer is connected to latent feature layer. The weights of clustering layer are initialized with cluster centers obtained by k-means clustering. Cluster center is represented with . Given as the number of clusters, and as latent feature size, the clustering layer is represented with a dense layer

. In other words, it converts latent features into a vector of size

, which

-th element represents the probability that the data point is assigned to the cluster


Given initial cluster centers , obtained by k-means clustering, and latent features , a student’s t-distribution measures the similarity between cluster centers and data points as follows,


where the degree of freedom of the student’s t-distribution is one. The probability of assigning a data point

to a cluster with center point is represented with . The assigned cluster is . The clustering algorithm iteratively adjusts clusters by learning from high confidence assignments. To learn from high confidence assignments, an auxiliary target distribution is as follows,


where is the number of elements in cluster . KL-divergence loss between and learns the high confidence soft cluster assignment,


In [guo2017improved], they train the DEC with joint learning of clustering loss and reconstruction loss. In joint training, the loss function of the neural network on spatio-temporal data is as follows,


where is the given number of clusters, is the weight of mean square error term and is the weight of clustering loss term. Minimization of the loss function in equation (5) results in learning the latent feature representation and the output clusters. The model receives input data and target data are for clustering layer and for decoder’s output. For DEC, the value of and represents the importance of each of loss functions. Higher value of reduces loss function for clustering, while higher value of better keep the structure of autoencoder’s latent features [guo2017improved].

Iv Spatial Deep Embedded Clustering

Here, we describe the proposed method for spatio-temporal clustering of traffic data. Algorithm 1 is the procedure of finding spatio-temporal clusters on traffic flow data.

A deep embedded clustering (DEC) receives data points, for all locations and time stamps , represented with . The encoder transforms each data point to its latent feature representation . Given the number of clusters , a k-means clustering on latent feature representations finds mean of the clusters for latent features. The mean of a cluster is obtained by k-means clustering and is stored into the clustering layer.

The data points with high temporal similarity are close to each other in the latent feature space, examined in Fig. 7. Hence, each cluster includes data points with high temporal similarity. However, not only the clusters should represents data points with high temporal similarity, but also they should consider data points of spatial neighborhood. If a cluster represent data points of locations, far from each other or distributed in a geographical area, it is not our desired cluster. Hence, our objective is to obtain clusters with both temporal similarity and spatial closeness. In the rest of this section, we describe our modification to deep embedded clustering and introduce Spatial-DEC, its architecture is presented in Fig. 3. The objective is to modify the DEC’s loss function, so if and are close (or far from) each other, then their latent representations and are also close (or far from) each other.

Fig. 3: The architecture of the Spatial-DEC. The input data has target values .

First, we make the latent feature representations conditional to the prior geographical location. The proposed model needs the location indices as the input data, because it maps the data points to latent feature space based on both their time series values and location indices. For a given

sensors, we generate a one-hot encoding of the locations

, where the location has the input vector , in which and the rest of the values are zeros. This computation of one-hot encoding is obtained in the algorithm 1. The S-DEC receives as the input data. The encoder outputs latent feature representations . Given the time series segment and its location the encoder outputs low dimensional representation of the data.

Next, we add a loss function into the DEC, and propose Spatial-DEC (S-DEC), where its latent features are constructed given prior geographical information. We add spatial loss term, , to the latent feature layer. The encoder’s input are and . The encoder’s output is . The encoder’s output in the last training steps is stored in . In other words, the model uses as the target value for the latent feature layer. In Algorithm (1), it is shown that once the DEC obtains as the target value for the clustering layer, we also obtains as the target value for latent feature layer .

To implement the new loss function effectively, first we change the size of the input data. The goal is to have input data and target data for each pair of locations and . We repeat each data point, a row in , times. The new training data are stored in . Moreover, we reshape the latent features of the last training step, . Each block of rows of the are repeated times. The reshaped latent features are represented with . The Spatial-DEC has as an input data point, and as the target for latent feature layer. After changing the size of input data and target data, for any given and its encoder’s output , there are target data for all . This modification allows the model to control the distance of latent features at location with all locations . The loss function optimizes the distance of to all previously obtained latent features at the same time stamp, represented with for all sensors . The loss function should increase (decrease) the distance of and , if location and are far from (close) each other. In the rest, we define the weight matrix which controls the distance of latent features.

We define a weight value for the loss function, represented with , where it is weight that represent the distance of two locations and . A transportation network can be represented with a graph, and is the adjacency matrix that represents distances of locations. Here we assume that all locations are on a line without loss of generality. We define as the weight matrix, which represents the spatial distance of locations, where represents the distance of two locations and . If the value is close to +1, then two locations and are close each other, and if the value is close to -1, then two locations are far from each other. Given as the i-th row of , we define as the diagonal matrix of the elements of , that is all element except diagonal elements are zero. We obtain by aggregating for all locations on the first dimension. The calculation is represented with in the algorithm 1. The spatial loss term is as follows,


In a back-propagation method, the gradients of the Equation (6) along with the gradient of autoencoder’s loss function are propagated in the neural network. We only describe the back-propagation for Equation (6). We refer the reader to [xie2016unsupervised] for further theoretical analysis of gradient propagation for DEC. Given as the spatial loss function, the gradient of Equation (6) is as follows,

1:procedure Clustering()
10:     for 
14:         for 
15:                         do       do
16:     return Return the trained spatial-DEC model for spatio-temporal clusteirng of traffic data
Algorithm 1 Spatio-temporal clustering with S-DEC

The model finds the gradient with respect to the encoder’s output

. In stochastic gradient descent algorithm, the gradient is propagated to update the weights of the neural network. The loss function is similar to

[ren2019semi]. They uses pairwise distances among clusters and apply the model on clustering of images for unsupervised learning.

Here we describe the reason that the value of directly affects the structure of latent features and the clustering model. The encoder’s output for a given input data point is . The value of encoder’s output for last training step and same time stamp is stored in . Given the data point at location , the model considers a target value for all . Since the neural network minimizes the loss function, the value of controls the distance given the last estimation of for all . If has a positive value, then the loss value is positive for the distance of and . Training a neural network with such loss value reduces the distance of latent features of and . On the other hand, a negative value for increases the distance of latent features of and . In section 5.2, we validate the effect of new loss function on a sample data. The weight for spatial loss function, represented in Equation (6), is . In the experimental results, , and are the weights of spatial loss, Equation 6, clustering loss, Equation 4, and reconstruction loss, Equation 1, respectively.

In algorithm (1), we describe procedure of finding spatio-temporal clusters. Lines 2-9 are the preprocessing of a DEC model, which includes pretraining an autoencoder, k-means initialization, and building a DEC model. In line 5, the pre-training of autoencoder is with loss weight values of , and . In line 11, the model finds value of latent features for the last pretraining step, and it follows by obtaining and as target values. Line 13 is the function in Equation (3), introduced in DEC. Line 14 generates batch size for input and output data. To clarify our notation, we illustrate the input arguments of a neural network with , and output arguments with , represented in .

Lastly, we analyze the computational time of the model. In traffic flow data, we have a large number of data points, , where it is the multiplication of the number of location and total time stamps. A k-means clustering method finds the clusters in order of steps. In each step, the method requires to calculates DTW distance, where it is for time series segments of . The DEC model maps time series segments to a lower dimension . In our experiments, we consider and . A DEC model finds k-means initialization by a sub-sample of data points, and train the model in steps, where

is the number of epochs. With mini-batch gradient descent we expect to train the model in less than 100 epochs based on our experiments. In each step, a DEC model requires to apply back-propagation, where its computational time depends on size of the neural network.

V Experimental Results

Here we illustrate the results for clustering of traffic flow data. The deep learning model is implemented with Keras. We use a fully-connected autoneocder with 7 layers. All of the layers have Relu activation function and dropout rate of 0.2. The number of hidden units are

in seven fully-connected layers. The batch size of 288, one day with 5-min time stamp, and Adam optimizer are selected.

We compare the performance of three models, k-means on latent features of an autoencoder, DEC and Spatial-DEC. We also have three loss terms, spatial loss, clustering loss, and reconstruction loss, with corresponding weights , and , respectively. For k-means with autoencoder, we have following weights, , and . For DEC, we have , and . For Spatial-DEC, we have , and .

V-a Traffic data

Traffic flow data are obtained from the PeMS [californiapems]. Traffic flow data is gathered from main-line loop detector sensors every 30 seconds and aggregated to every 5 minutes. The data are for US-101 South and, I-280 South and I-680 South highways, in the Bay Area of California, which includes 26 and 16 mainline sensors, respectively, illustrated in Fig. 4. We represent the average of values on these selected data. We select the data for the first five months of 2016. The models are trained on the first three months, and evaluated on the next two months. In a preprocessing step, we re-scale the data into the range of , and subtract each time window of size from its mean value. A time window of size 12, one hour, is selected. In the model we assume that sensors are on one line in a highway, and the average of results for these two highways is presented.

Fig. 4: The main-line loop detector sensors on highways in Bay Area, California. The black boxes are the sensors on I-680-S and I-280-S, and red boxes are US101-S.

V-B Validation of the spatial loss function

In Spatial-DEC, we use the spatial loss function introduced in Equation 6. The loss function decreases (increases) the latent feature representations, if two locations are close (far from) each other. Here, we examine the correctness of the model by visualizing the latent feature representations. We consider the first six successive sensors on the highway US101-S.

Fig. 5: The scatter plot of latent feature representations for the sample data. Fig (a) represents the latent features of an autoencoder without spatial loss. Fig (b) represents the latent features with spatial loss of weight , and in Fig (c).

We obtain the spatial weights, which represents distances of location on a line. We need any arbitrary function to obtain weights in following ways. If two locations are close (far), their weights should be close to +1 (-1). We assume that there are six locations on one line. We use a distance function of for , and zero for . Throughout the experiment, we notice that it better stabilize the clusters. As an example, the spatial weights for location one are , and for location two are .

We train the Spatial-DEC on a sample data. Fig. 5 represents the latent features in two dimensional with different values of . We set to 0, 1.0, 10.0, in Fig. 5.a, 5.b and 5.c respectively. With , the model is an autoencoder without spatial loss. The data points are scattered in latent feature space without any prior geographical information. With a higher value of , we train the Spatial-DEC model. The results is in Fig. 5.b, where the order of locations from 1 to 6 is preserved in the latent feature space. For a spatial loss weight of , the latent features are completely separable. The data points are mapped into latent feature with their corresponding order of distances, from top left to bottom right. In the rest of implementations, we use a the weight . it can be more comparable with DEC and k-means based on temporal similarity and it also finds latent feature with spatial closeness. We also notice that an early stopping of the deep learning results can prevent from completely separable data points like Fig 5.c, and results in Fig 5.b.

Clustering evaluation (percentage results)
Models Sum Square Error to Mean of Clusters Connectivity dis-connectivity
kmeans 0.27 0.19 0.21
kmeans-DTW 0.28 0.17 0.19
DEC 0.22 0.19 0.18
Spatial-DEC 0.23 0.45 0.11
TABLE I: The comparisons of clustering models

V-C Analysis of temporal clusters

After pretraining the autoencoders, the first step in training Spatial-DEC is to initialize the clusters with k-means clustering. To obtain an appropriate number of clusters, we use an Elbow method, i.e. the optimum value can be obtained, when the reduction in inertia, as the sum of squares of data points, becomes linear, represented in Fig. 6, where we find 80 as the best number of clusters.

Fig. 6: The Elbow method finds the best value for the number of clusters represented with a black vertical line.

To show that latent features are directly related to temporal features, the latent feature representation of one sensor’s traffic states is shown for five weekdays in Fig. 7. A t-distributed stochastic neighbor embedding (TSNE) [maaten2008visualizing] method is used for representing latent features in two dimension, with parameters of 40, 300, 500 for preplexity, number of iterations and learning rate, respectively. The color of each data point represents the hours of a day. One day is grouped into 10 colors, for every 2 hours. This visualization of latent features show that the data points are distinguishable based on their time stamps and latent feature preserve temporal properties of data.

Fig. 7: TSNE representation of autoencoder’s latent features.

To represent temporal similarity in traffic flow data, Dynamic Time Warping have been broadly used [lv2020temporal]. In our problem, warping window size can be obtained in the range of 1 to 12, the size of time series segments, where its smaller value reduces the computational time. Comparing the value of Rand Index based on warping window is a method to obtain the best value of warping window [dau2018optimizing]. We notice that there is not a significant change in the clusters, obtained by kmeans clustering with DTW distance function, when we reduce warping window size from 12 to 6. Hence, we selected six as the best warping window. To show that latent feature space preserve the temporal distances, for any given data points, the euclidean distance of latent features is calculated. Also the DTW among their time series is calculated. The correlation between latent features and dynamic time warping is obtained. For latent feature size from 1 to 10, the correlation changes from 0.9 to 0.98. The maximum correlation between dynamic time warping and euclidean distance of latent features is 0.98 with latent feature 4. Hence, we selected a latent feature of size 4 in our analysis. We also conclude that the latent feature space preserve temporal similarity of data points.

In the rest, we analyze the clusters obtained by k-means, DEC and Spatial-DEC. Unlike supervised learning, in unsupervised learning there is not a clear approach to evaluate clusters. Hence, we describe properties that we expect to see in the clusters, and based on them define evaluation metrics. We expect to have clusters that they are compact and include data points with high temporal similarities. A feature-based clustering of time series data can improve the time series forecasting performance

[bandara2020forecasting]. Our clustering models are feature-based, where the clustering method is applied on the latent feature representation of data points. Hence, we define a temporal similarity measure as follows,

where is the set of members of cluster , the mean of the cluster and is the element of the cluster. We consider the element which its latent feature is the closest to the mean of the cluster as or the mediods of the cluster.

Fig. 8: Comparison of implemented clustering models based on sum of square distance of clusters.

In Fig. 8, we compare the compactness of the clusters using for DTW-time series, which shows the compactness of clusters based on temporal similarity. A more compact cluster better represents data points with temporal similarity. We compare the implemented DEC and Spatial-DEC, where both have similar temporal similarity. It is important to show that while Spatial-DEC finds more connected clusters, it does not significantly reduces temporal similarity in the clusters. The other model is k-means, applied on latent feature of time series.

V-D Analysis of spatial clusters

Here, we evaluate spatial clusters obtained by Spatial-DEC. We define a spatial cluster as the set of locations which all have a similar assigned temporal cluster. In other words, for a given time stamp , all locations of one spatial cluster have equal assigned cluster . In traffic flow data analysis, a spatial cluster represents road segments with similar traffic flow patterns for a given time stamp. We define a connected spatial cluster , which includes location indices which have similar assigned cluster in a given time stamp and they all are neighbors. We define spatial connectivity as a evaluation metric for the analysis of spatial clusters. A spatial connectivity shows the total size of connected spatial clusters. If the size of connected spatial cluster of location is one, then it means that the assigned cluster of location is not equal to the assigned cluster of its neighbors. Such a clustering output is not desirable, as it cannot show similarity of locations. A desired spatial cluster should have high temporal similarity and spatial connectivity. We also mention that a high spatial connectivity can reduce temporal similarity, because the cluster includes larger road segments, which can have lower temporal similarity. A spatial connectivity is defined as follows,


where is the size of connected spatial cluster.

On the other hand, if a spatial cluster includes location indices, dis-connected in a geographical area, then the cluster is not desirable. We define a evaluation metric spatial dis-connectivity as follows. For each location and time stamp , we define , as the set of location indices which have an equal temporal cluster to , but they are not in . The spatial dis-connectivity is obtained as follows.


Fig 9 represents the spatial connectivity and dis-connectivity of the obtained clusters. Higher value of connectivity shows that data points of closer locations are assigned into same temporal clusters, which is a more desired cluster. On the other hand, lower value of dis-connectivity represents that the clusters are not dis-connected in a geographical area. The figure shows that Spatial-DEC can significantly increase connectivity, and decreases the dis-connectivity of clusters.

Fig. 9: Comparison of implemented clustering models based spatial connectivity and dis-connectivity.

Overall, Fig. 9 shows that the clusters of Spatial-DEC are more compact than k-means in terms of temporal similarity. In other words, all the data points of one cluster have higher temporal similarity.

We define spatial metric as . Here, we represent that the mean of for all

obtained by Spatial-DEC is significantly higher than DEC. The null hypothesis

is that the mean of Spatial-DEC and DEC is equal. The alternate hypothesis is that their mean is not equal. Since,

can be represented with a normal distribution with positive mean for both DEC and Spatial-DEC. We apply a t-test on

obtained by DEC and Spatial-DEC. The p-value is 0.0012, where we can reject the null-hypothesis with significant level of . It shows that the increase in connectivity of spatial clusters is statistically significant.

V-E Analysis of traffic flow clusters

Here we visualize and further analyze clusters of traffic flow data. Spatio-temporal clusters shows that how road segments are similar over time periods, represented in Fig 10. The figure shows time stamps for one day on y-axis and location indices on x-axis. Each color represents the assigned cluster. To better visualize clusters and represents their similarities, we only consider 8 clusters. The areas with same colors have temporal similarity. The figure shows how 26 locations are similar over time periods.

Fig. 10: Spatio-temporal clusters obtained by Spatial-DEC

The above representations shows spatio-temporal and spatial clusters. Our clusters are based on temporal similarity. If a data point is far from the mean of clusters, it means that the data point is rarely occur in temporal domain, or it is an anomaly. Here, we visualize such an example to clarify this interpretation. In Fig 11, the heatmap of distance of data points from the center of clusters is represented and we can see that a portion of values are far form the centers. Time stamps close to 45 and the first 4 locations have light values. Regardless of the reason of anomaly, we look at traffic flow values in Fig. 12, for the first four location. The area close to time stamp 45 includes a big reduction in traffic flow values. This could be the result of an accident; however, in this paper, we do not analyze the reasons behind anomalies and the performance of anomaly detection. These analysis shows different application and importance of having spatial, temporal and spatio-temporal clusters in a transportation network.

Fig. 11: The heatmap of distance of data points to the center of their assigned clusters. The light colors represents data points far from center of clusters and potential anomalies in data.
Fig. 12: Representation of traffic flow for the first four selected sensors. The anomaly is represented with a box, as there is a big reduction in the flow.

Vi Conclusion and Future Work

A spatio-temporal clustering is an important method for transportation systems. One of the challenging problems is to find spatio-temporal similarities in a transportation network. To obtain these similarities in traffic flow data, the problem definition is represented in Section II. Finding dynamic clusters of locations in a transportation network, illustrated in Section V.E, is necessary to analyze traffic congestion propagation and to improve traffic flow prediction and missing data imputation. Moreover, finding temporal patterns in traffic flow data is a method for more efficient prediction and detection of anomalies, illustrated in Section V.E. While these applications are important in transportation systems, there are few studies in the literature to develop deep learning models for spatio-temporal clustering of traffic flow data.

Increasing in the availability of traffic data requires further development of clustering models for complex and high-dimensional data. In this paper, We propose Spatial-DEC, a variation of Deep Embedded Clustering, to obtain spatio-temporal clusters, and illustrate its performance for finding dense and compact temporal clusters in Section V.C and spatially connected clusters in V.D. The contributions of this work are both in model architecture examined its validity in Section V.B and defining evaluation metrics for spatial and temporal clusters in Section V.C and V.D. The proposed model uses the loss function introduced in Eq. 6, and finds spatio-temporal clusters. Such a model can be useful not only for traffic data, but also for other spatio-temporal problems, such as environmental science and smart cities domains. Also, we consider a graph-structure for latent feature representation, which can be further studied in development of deep learning models for spatio-temporal data.