# Evolutionary Clustering of Streaming Trajectories

The widespread deployment of smartphones and location-enabled, networked in-vehicle devices renders it increasingly feasible to collect streaming trajectory data of moving objects. The continuous clustering of such data can enable a variety of real-time services, such as identifying representative paths or common moving trends among objects in real-time. However, little attention has so far been given to the quality of clusters – for example, it is beneficial to smooth short-term fluctuations in clusters to achieve robustness to exceptional data. We propose the notion of evolutionary clustering of streaming trajectories, abbreviated ECO, that enhances streaming-trajectory clustering quality by means of temporal smoothing that prevents abrupt changes in clusters across successive timestamps. Employing the notions of snapshot and historical trajectory costs, we formalize ECO and then formulate ECO as an optimization problem and prove that ECO can be performed approximately in linear time, thus eliminating the iterative processes employed in previous studies. Further, we propose a minimal-group structure and a seed point shifting strategy to facilitate temporal smoothing. Finally, we present all algorithms underlying ECO along with a set of optimization techniques. Extensive experiments with two real-life datasets offer insight into ECO and show that it outperforms state-of-the-art solutions in terms of both clustering quality and efficiency.

• 13 publications
• 61 publications
• 20 publications
• 16 publications
• 8 publications
06/10/2018

### A Scalable Framework for Trajectory Prediction

Trajectory prediction (TP) is of great importance for a wide range of lo...
04/11/2011

In many practical applications of clustering, the objects to be clustere...
02/07/2019

### Online Clustering by Penalized Weighted GMM

With the dawn of the Big Data era, data sets are growing rapidly. Data i...
10/02/2012

### Graph-Based Approaches to Clustering Network-Constrained Trajectory Data

Even though clustering trajectory data attracted considerable attention ...
08/12/2022

### Online Discovery of Evolving Groups over Massive-Scale Trajectory Streams

The increasing pervasiveness of object tracking technologies leads to hu...
07/11/2018

### Moving Objects Analytics: Survey on Future Location & Trajectory Prediction Methods

The tremendous growth of positioning technologies and GPS enabled device...
01/22/2016

### Online Event Recognition from Moving Vessel Trajectories

We present a system for online monitoring of maritime activity over stre...

## 1. Introduction

It is increasingly possible to equip moving objects with positioning devices that are capable of transmitting object positions to a central location in real time. Examples include people with smartphones and vehicles with built-in navigation devices or tracking devices. This scenario opens new opportunities for the real-time discovery of hidden mobility patterns. These patterns allow characterizing individual mobility for a certain time interval and enable a broad range of important services and applications such as route planning (Zeng et al., 2019; Wang et al., 2020), intelligent transportation management (Wang et al., 2021), and road infrastructure optimization (Wu et al., 2015).

As a typical moving pattern discovery approach, clustering aims to group a set of trajectories into comparatively homogeneous clusters to extract representative paths or movement patterns shared by moving objects. Considering a streaming setting, many works are proposed to cluster the trajectories in real-time (Jensen et al., 2007; Li et al., 2010; Yu et al., 2013a; Costa et al., 2014; Deng et al., 2015; Da Silva et al., 2016; Chen et al., 2019; Tang et al., 2012; Li et al., 2012). However, existing real-time clustering methods focus on the most current data, achieving low computational cost at the expense of clustering quality (Xu et al., 2014). In streaming settings, clusterings should be robust to short-term fluctuations in the underlying trajectory data, which may be achieved by means of smoothing (Chi et al., 2007). An example illustrates this.

###### Example 1 ().

Figure 1 shows the trajectories of 12 moving objects at three timestamps, . Traditional clustering algorithms return the two clusters and
at the first timestamp, the three clusters , , and at the second timestamp, and the same two clusters at the third timestamp as at the first timestamp.

The underlying reason for this result is the unusual behavior of objects and at the second timestamp. Clearly, returning the same two stable clusters for all three timestamps is a more robust and better-quality result. A naive approach to eliminating the effect of the two objects’ unusual behavior is to perform cleaning before clustering. However, studies of on two real-life datasets show that among the trajectories that cause the mutations of clusterings, 88.9% and 75.9% of the trajectories follow the speed constraint, while 97.8% and 96.1% of them are categorized as inliers (Ester et al., 1996). Moreover, in real-time applications, it is impractical to correct previous clusterings retroactively. Hence, it is difficult for existing cleaning techniques to facilitate smoothly shifted clustering sequences (Li et al., 2020a; Patil et al., 2018; Idrissov and Nascimento, 2012).

However, this problem can be addressed by applying evolutionary clustering (Kim and Han, 2009; Fenn et al., 2009; Chen et al., 2020; Chakrabarti et al., 2006; Chi et al., 2007; Gupta et al., 2011; Xu et al., 2014; Yin et al., 2021; Ma and Dong, 2017; Liu et al., 2020), where a good current clustering result is one that fits the current data well, while not deviating too much from the recent history of clusterings. Specifically, temporal smoothness is integrated into the measure of clustering quality (Chi et al., 2007). This way, evolutionary clustering is able to outperform traditional clustering as it can reflect long term trends while being robust to short-term variability. Put differently, applying evolutionary clustering to trajectories can mitigate adverse effects of intermittent noise on clustering and present users with smooth and consistent movement patterns. In Example 1, clustering with temporal consistency is obtained if is smoothed to and is smoothed to at the second timestamp. Motivated by this, we study evolutionary clustering of trajectories.

Existing evolutionary clustering studies target dynamic networks and are not suitable for trajectory applications, mainly for three reasons. First, the solutions are designed specifically for dynamic networks, which differ substantially from two-dimensional trajectory data. Second, the movement in trajectories is generally much faster than the evolution of dynamic networks, which renders the temporal smoothness used in existing studies too ”strict” for trajectories. Third, existing studies often optimize the clustering quality iteratively at each timestamp (Kim and Han, 2009; Chakrabarti et al., 2006; Yin et al., 2021; Folino and Pizzuti, 2013; Liu et al., 2020, 2019), which is computationally costly and is infeasible for large-scale trajectories.

We propose an efficient and effective method for evolutionary clustering of streaming trajectories (ECO). First, we adopt the idea of neighbor-based smoothing (Kim and Han, 2009) and develop a structure called minimal group that is summarized by a seed point in order to facilitate smoothing. Second, following existing studies (Chakrabarti et al., 2006; Yin et al., 2021; Xu et al., 2014; Folino and Pizzuti, 2013; Liu et al., 2020, 2019), we formulate ECO as an optimization problem that employs the new notions of snapshot cost and historical cost. The snapshot cost evaluates the true concept shift of clustering defined according to the distances between smoothed and original locations. The historical cost evaluates the temporal distance between locations at adjacent timestamps by the degree of closeness. Next, we prove that the proposed optimization function can be decomposed and that each component can be solved approximately in constant time. The effectiveness of smoothing is further improved by a seed point shifting strategy. Finally, we introduce a grid index structure and present algorithms for each component of evolutionary clustering along with a set of optimization techniques, to improve clustering performance. The paper’s main contributions are summarized as follows,

• We formalize ECO problem. To the best of our knowledge, this is the first proposal for streaming trajectory clustering that takes into account temporal smoothness.

• We formulate ECO as an optimization problem, based on the new notions of snapshot cost and historical cost. We prove that the optimization problem can be solved approximately in linear time.

• We propose a minimal group structure to facilitate temporal smoothing and a seed point shifting strategy to improve clustering quality of evolutionary clustering. Moreover, we present all algorithms needed to enable evolutionary clustering, along with a set of optimization techniques.

• Extensive experiments on two real-life datasets show that ECO advances the state-of-the-arts in terms of both clustering quality and efficiency.

The rest of paper is organized as follows. We present preliminaries in Section 2. We formulate the problem in Section 3 and derive its solution in Section 4. Section 5 presents the algorithms and optimization techniques. Section 6 covers the experimental study. Section 7 reviews related work, and Section 8 concludes and offers directions for future work.

## 2. Preliminaries

### 2.1. Data Model

###### Definition 1 ().

A GPS record is a pair , where is a timestamp and is the location, with being a longitude and being a latitude.

###### Definition 2 ().

A streaming trajectory is an unbounded ordered sequence of GPS records, .

The GPS records of a trajectory may be transmitted to a central location in an unsynchronized manner. To avoid this affecting the subsequent processing, we adopt an existing approach (Chen et al., 2019) and discretize time into short intervals that are indexed by integers. We then map the timestamp of each GPS record to the index of the interval that the timestamp belongs to. In particular, we assume that the start time is 00:00:00 UTC, and we partition time into intervals of duration . Then time series 00:00:01, 00:00:12, 00:00:20, 00:00:31, 00:00:44 and 00:00:00, 00:00:13, 00:00:21, 00:00:31, 00:00:40 are both mapped . We call such a sequence a discretized time sequence and call each discretized timestamp a time step dt. We use trajectory and streaming trajectory interchangeably.

###### Definition 3 ().

A trajectory is active at time step if it contains a GPS record such that .

###### Definition 4 ().

A snapshot is the set of trajectories that are active at time step .

Figure 1 shows three snapshots , , and , each of which contains twelve trajectories. Given the start time 00:00:00 and , arrives at because 00:00:12 is mapped to 1. For simplicity, we use in figures to denote . The interval duration is the default sample interval of the dataset. Since deviations between the default sample interval and the actual intervals are small (Li et al., 2020b), we can assume that each trajectory has at most one GPS record at each time step . If this is not the case for a trajectory , we simply keep ’s earliest GPS at the time step. This simplifies the subsequent clustering. Thus, the GPS record of at is denoted as . If a trajectory is active at both and and the current time step is , and are simplified as and , and and are simplified as and . At time step () in Figure 1, , 00:00:12, , and 00:00:22.

###### Definition 5 ().

A -neighbor set of a streaming trajectory at the time step is ,where is Euclidean distance and is a distance threshold. is called the local density of w.r.t. at .

Figure 2 plots at from Figure 1, where .

### 2.2. Dbscan

We adopt a well-known density-based clustering approach, DBSCAN (Ester et al., 1996), for clustering. DBSCAN relies on two parameters to characterize density or sparsity, i.e., positive values and minPts.

###### Definition 6 ().

A trajectory is a core point w.r.t. and minPts, if .

###### Definition 7 ().

A trajectory is density reachable from another trajectory , if a sequence of trajectories exists such that (i) and ; (ii) are core points; and (iii) .

###### Definition 8 ().

A trajectory is connected to another trajectory if a trajectory exists such that both and are density reachable from .

###### Definition 9 ().

A non-empty subset of trajectories of is called a cluster , if satisfies the following conditions:

• Connectivity: , is connected to ;

• Maximality: , if and is density reachable from , then .

Definition 9 indicates that a cluster is formed by a set of core points and their density reachable points. Given and minPts, is an outlier, if it is not in any cluster; is a border point, if and , where is a core point.

###### Definition 10 ().

A clustering result is a set of clusters obtained from the snapshot .

###### Example 2 ().

In Figure 1, has two clusters
and . Further,  () in Figure 2a are core points.

### 2.3. Evolutionary Clustering

Evolutionary clustering is the problem producing a sequence of clusterings from streaming data; that is, clustering for each snapshot. It takes into account the smoothness characteristics of streaming data to obtain high-quality clusterings (Chakrabarti et al., 2006). Specifically, two quality aspects are considered:

• High historical quality: clustering should be similar to the previous clustering ;

• High snapshot quality: should reflect the true concept shift of clustering, i.e., remain faithful to the data at each time step.

Evolutionary clustering uses a cost function that enables trade-offs between historical quality and snapshot quality at each time step  (Chakrabarti et al., 2006),

 (1) Fk=SCk(Co,Ck)+α⋅TCk(Ck−1,Ck)

is the sum of two terms: a snapshot cost () and a historical cost (). The snapshot cost captures the similarity between clustering and clustering that is obtained without smoothing. The smaller is, the better the snapshot quality is. The historical cost measures how similar clustering and the previous clustering are. The smaller is, the better the historical quality is. Parameter enables controlling the trades-off between snapshot quality and historical quality.

## 3. Problem Statement

We start by presenting two observations, based on which, we define the problem of evolutionary clustering of streaming trajectories.

### 3.1. Observations

#### Gradual evolutions of travel companions

As pointed out in a previous study (Tang et al., 2012), movement trajectories represent continuous and gradual location changes, rather than abrupt changes, implying that co-movements among trajectories also change only gradually over time. Co-movement may be caused by (i) physical constraints of both road networks and vehicles, and vehicles may have close relationships, e.g., they may belong to the same fleet or may target the same general destination (Tang et al., 2012).

#### Uncertainty of ”border” points

Even with the observation that movements captured by trajectories are not dramatic during a short time, border points are relatively more likely to leave their current cluster at the next time step than core points. This is validated by statistics from two real-life datasets. Specifically, among the trajectories shifting to another cluster or becoming an outlier during the next time steps, 75.0% and 61.5% are border points in the two real-life datasets.

### 3.2. Problem Definition

#### Cost embedding

Existing evolutionary clustering studies generally perform temporal smoothing on the clustering result (Folino and Pizzuti, 2013; Chakrabarti et al., 2006; Chi et al., 2007; Yin et al., 2021). Specifically, they adjust iteratively so as to minimize Formula 1, which incurs very high cost. We adopt cost embedding (Kim and Han, 2009), which pushes down the cost formula from the clustering result level to the data level, thus enabling flexible and efficient temporal smoothing. However, the existing cost embedding technique (Kim and Han, 2009) targets dynamic networks only. To apply cost embedding to trajectories, we propose a minimal group structure and snapshot and historical cost functions.

#### Snapshot costSCk

We first define the notion of an ”adjustment” of a trajectory.

###### Definition 11 ().

An adjustment is a location of a trajectory obtained through smoothing at . Here, if . The set of adjustments in is denoted as .

We simplify to if the context is clear. In Figure 1, is an adjustment of at . According to Formula 1, the snapshot cost measures how similar the current clustering result is to the original clustering result . Since we adopt cost embedding that smooths trajectories at the data level, the snapshot cost of a trajectory w.r.t. its adjustment at (denoted as ) is formulated as the deviation between and at :

 (2) SCk(r(o))=d(r(o),o.l)2s.t.d(r(o),o.~l)≤μ⋅(o.t−o.~t),

where is a speed constraint of the road network. Formula 2 requires that any adjustment must follow the speed constraint. Obviously, the larger the distance between and its adjustment , the higher the snapshot cost.

#### Historical costTCk

As discussed in Section 2.3, one of the goals of evolutionary clustering is smoothing the change of clustering results during adjacent time steps. Since we push down the smoothing from the cluster level to trajectory level, the problem becomes one of ensuring that each trajectory represents a smooth movement. According to the first observation in Section 3.1, gradual location changes lead to stable co-movement relationships among trajectories during short periods of time. Thus, similar to neighbor-based smoothing in dynamic communities (Kim and Han, 2009), it is reasonable to smooth the location of each trajectory in the current time step using its neighbours at the previous time step. However, the previous study (Kim and Han, 2009) smooths the distance between each pair of neighboring nodes. Simply applying this to trajectories may degrade the performance of smoothing if a ”border” point is involved. Recall the second observation of Section 3.1 and assume that is smoothed according to at in Figures 1 and 2 . As is a border point at

with a higher probability to leave the cluster

at , using to smooth may result in also leaving or being located at the border of at . The first case may incur an abrupt change to the clustering while the second case may degrade the intra-density of and increase the inter-density of clusters in . To tackle this problem, we model neighboring trajectories as minimal groups summarized by seed points.

###### Definition 12 ().

A seed point summarizes a minimal group at , where is a given parameter, and is a seed point set at . The cardinality of , , exceeds a parameter . Any trajectory o in that is different from is called a non-seed point. Note that, if .

Given the current time step , we use to denote the seed point of at (i.e., ), while use to denote that at (i.e., ).

###### Example 3 ().

In Figure 2b, there are two minimal groups, i.e.,
and . In Figure 3, there is only one minimal group before smoothing, i.e., . Further, given the current , both and of is and .

We propose to use the location of a seed point to smooth the location of a non-seed point at . In order to guarantee the effectiveness of smoothing, Definition 12 gives two constraints when generating minimal groups: (i) and (ii) . Setting to a small value, the first constraint ensures that are close neighbors at . Specifically, we require because this makes it very likely that trajectories in the same minimal group are in the same cluster. The second constraint avoids small neighbor sets . Specifically, using an ”uncertain border” point as a ”pivot” to smooth the movement of other trajectories may lead to an abrupt change between clusterings or a low-quality clustering (according to the quality metrics of traditional clustering). We present the algorithm for generating minimal groups in Section 5.2.

Based on the above analysis, we formalize the historical cost of w.r.t. its adjustment at , denoted as , as follows.

 (3) TCk(r(o))=(⌈d(r(o),~s.l)δ⌉−1)2 s.t.d(r(o),o.~l)≤μ⋅(o.t−o.~t),

where . Given the threshold , the larger the distance between and , the higher the historical cost. Here, we use the degree of closeness (i.e., ) instead of to evaluate the historical cost, due to two reasons. First, constraining the exact relative distance between any two trajectories during a time interval may be too restrictive, as it varies over time in most cases. Second, using the degree of closeness to constrain the historical cost is sufficient to obtain a smooth evolution of clusterings.

#### Total cost Fk

Formulas 2 and 3 give the snapshot cost and historical cost for each trajectory w.r.t. its adjustment , respectively. However, the first measures the distance while the latter evaluates the degree of proximity. Thus, we normalize them to a specific range :

 (4) SCk(r(o))=(d(r(o),o.l)4μ⋅Δt+δ)2s.t.d(r(o),o.~l)≤μ⋅(o.t−o.~t)
 (5) TCk(r(o))=⎛⎜⎝⌈d(r(o),~s.l)δ⌉−14μ⋅Δt+δδ⎞⎟⎠2 s.t.d(r(o),o.~l)≤μ⋅(o.t−o.~t),

where and is the duration of a time step. Clearly, and . Thus, we only need to prove and .

If then .

###### Proof.

According to our strategy of mapping original timestamps (Section 2.1), . Considering the speed constraint of the road network, . Further, due to the triangle inequality. Thus, we get . Since , . ∎

It follows from Lemma 1 that if . However, does not necessarily hold. To address this problem, we pre-process according to so that it follows the speed constraint before conducting evolutionary clustering. The details are given in Section 4.3.

If then .

###### Proof.

We have according to Lemma 1. Further, according to Definition 12. Since , we get . ∎

According to Lemma 2, we can derive and thus . Letting , the total cost is:

 (6) Fk= ∑o,~s∈Θk∧o≠~s1π2⎛⎝d(r(o),o.l)2+α⋅(δ⋅(⌈d(r(o),~s.l)δ⌉−1))2⎞⎠ s.t.∀o∈Θk(d(r(o),o.~l)≤μ⋅(o.t−o.~t)),

where . Formula 6 indicates that we do not smooth the location of at if is not summarized in any minimal group at . This is in accordance with the basic idea that we conduct smoothing by exploring the neighboring trajectories. We can now formulate our problem.

###### Definition 13 ().

Given a snapshot , a set of previous minimal groups , a time duration , a speed constraint , and parameters , , , minPts and , evolutionary clustering of streaming trajectories (ECO) is to

• find a set of adjustments , such that ;

• compute a set of clusters over .

Specifically, each adjustment of is denoted as and is then used as the previous location of (i.e. ) at for evolutionary clustering.

###### Example 4 ().

Following Example 3, ECO first finds a set of adjustments at . Then, it performs clustering over and gets , where and . Note that we only show and in Figures 1 and 3 because at .

Clearly, the objective function in Formula 6 is neither continuous nor differentiable. Thus, computing the optimal adjustments using existing solvers involves iterative processes (Song et al., 2015) that are too expensive for online scenarios. We thus prove that Formula 6 can be solved approximately in linear time in Section 4.

Given the current time step , we start by decomposing at the unit of minimal groups as follows,

 (7) Fk =∑~s∈Sk−1fk(~s.l) =∑~s∈Sk−1∑o∈Ω⎛⎝d(r(o),o.l)2+α⋅(δ⋅(⌈d(r(o),~s.l)δ⌉−1))2⎞⎠ s.t.∀o∈Θk(d(r(o),o.~l)≤μ⋅(o.t−o.~t)),

where , , is the adjustment of at , is the seed point of at , and is the location of at . We omit the multiplier from Formula 6 because , , and are constants and do not affect the results.

### 4.1. Linear Time Solution

We show that Formula 7 can be solved approximately in linear time. However, Formula 7 uses each previous seed point for smoothing, and such points may also exhibit unusual behaviors from to . Moreover, may not be in . We address these problems in Section 4.2 by proposing a seed point shifting strategy, and we assume here that has already been smoothed, i.e., .

###### Lemma 3 ().

achieves the minimum value if each achieves the minimum value.

###### Proof.

To prove this, we only need to prove that and do not affect each other. This can be established easily, as we require . We thus omit the details due to space limitation. ∎

Lemma 3 implies that Formula 7 can be solved by minimizing each (). Next, we further ”push down” the cost shown in Formula 7 to each pair of and .

 (8) fk(r(o),~s.l) = ⎛⎝d(r(o),o.l)2+α⋅(δ⋅(⌈d(r(o),~s.l)δ⌉−1))2⎞⎠ s.t. d(r(o),o.~l)≤μ⋅(o.t−o.~t)
###### Lemma 4 ().

achieves the minimum value if each
achieves the minimum value.

###### Proof.

The proof is straightforward, because and are independent of each other. ∎

According to Lemma 4, the problem is simplified to computing () given . However, Formula 8 is still intractable as its objective function is not continuous. We thus aim to transform it into a continuous function. Before doing so, we cover the case where the computation of w.r.t a trajectory can be skipped.

If then .

###### Proof.

Let be an adjustment of . Given , . On the other hand, as , the snapshot cost . Thus, if . ∎

A previous study (Kim and Han, 2009) smooths the distance between each pair of neighboring nodes no matter their relative distances. In contrast, Lemma 5 suggests that if a non-seed point remains close to its previous seed point at the current time step, smoothing can be ignored. This avoids over-smoothing close trajectories. Following Example 4, .

###### Definition 14 ().

A circle is given by , where is the center and is the radius.

###### Definition 15 ().

A segment connecting two locations and is denoted as . The intersection of a circle and a segment is denoted as .

Figure 3 shows a circle that contains , , and . Further, .

.

###### Proof.

In Section 3.2, we constrain before smoothing, which implies that . Hence, . ∎

In Figure 4, given , .

#### Omitting the speed constraint

We first show that without utilizing the speed constraint, an optimal adjustment of that minimizes can be derived in constant time. Based on this, we explain how to compute based on .

.

###### Proof.

Let . First, we prove that
. Two cases are considered, i.e., (i) and (ii) . For the first case, we can always find an adjustment , such that . Hence, . However, we have due to . Thus, . For the second case, it is clear that . Thus, .

Second, we prove that . We can always find , such that . Hence,
. However, in this case due to . Thus, we have
. ∎

In Figure 3, and due to . Lemma 7 indicates that if we ignore the speed constraint in Formula 8, we can search just on