1. Introduction
It is increasingly possible to equip moving objects with positioning devices that are capable of transmitting object positions to a central location in real time. Examples include people with smartphones and vehicles with builtin navigation devices or tracking devices. This scenario opens new opportunities for the realtime discovery of hidden mobility patterns. These patterns allow characterizing individual mobility for a certain time interval and enable a broad range of important services and applications such as route planning (Zeng et al., 2019; Wang et al., 2020), intelligent transportation management (Wang et al., 2021), and road infrastructure optimization (Wu et al., 2015).
As a typical moving pattern discovery approach, clustering aims to group a set of trajectories into comparatively homogeneous clusters to extract representative paths or movement patterns shared by moving objects. Considering a streaming setting, many works are proposed to cluster the trajectories in realtime (Jensen et al., 2007; Li et al., 2010; Yu et al., 2013a; Costa et al., 2014; Deng et al., 2015; Da Silva et al., 2016; Chen et al., 2019; Tang et al., 2012; Li et al., 2012). However, existing realtime clustering methods focus on the most current data, achieving low computational cost at the expense of clustering quality (Xu et al., 2014). In streaming settings, clusterings should be robust to shortterm fluctuations in the underlying trajectory data, which may be achieved by means of smoothing (Chi et al., 2007). An example illustrates this.
Example 1 ().
Figure 1 shows the trajectories of 12 moving objects at three timestamps, .
Traditional clustering algorithms return the two clusters and
at the first timestamp, the three clusters , , and at the second timestamp, and the same two clusters at the third timestamp as at the first timestamp.
The underlying reason for this result is the unusual behavior of objects and at the second timestamp. Clearly, returning the same two stable clusters for all three timestamps is a more robust and betterquality result. A naive approach to eliminating the effect of the two objects’ unusual behavior is to perform cleaning before clustering. However, studies of on two reallife datasets show that among the trajectories that cause the mutations of clusterings, 88.9% and 75.9% of the trajectories follow the speed constraint, while 97.8% and 96.1% of them are categorized as inliers (Ester et al., 1996). Moreover, in realtime applications, it is impractical to correct previous clusterings retroactively. Hence, it is difficult for existing cleaning techniques to facilitate smoothly shifted clustering sequences (Li et al., 2020a; Patil et al., 2018; Idrissov and Nascimento, 2012).
However, this problem can be addressed by applying evolutionary clustering (Kim and Han, 2009; Fenn et al., 2009; Chen et al., 2020; Chakrabarti et al., 2006; Chi et al., 2007; Gupta et al., 2011; Xu et al., 2014; Yin et al., 2021; Ma and Dong, 2017; Liu et al., 2020), where a good current clustering result is one that fits the current data well, while not deviating too much from the recent history of clusterings. Specifically, temporal smoothness is integrated into the measure of clustering quality (Chi et al., 2007). This way, evolutionary clustering is able to outperform traditional clustering as it can reflect long term trends while being robust to shortterm variability. Put differently, applying evolutionary clustering to trajectories can mitigate adverse effects of intermittent noise on clustering and present users with smooth and consistent movement patterns. In Example 1, clustering with temporal consistency is obtained if is smoothed to and is smoothed to at the second timestamp. Motivated by this, we study evolutionary clustering of trajectories.
Existing evolutionary clustering studies target dynamic networks and are not suitable for trajectory applications, mainly for three reasons. First, the solutions are designed specifically for dynamic networks, which differ substantially from twodimensional trajectory data. Second, the movement in trajectories is generally much faster than the evolution of dynamic networks, which renders the temporal smoothness used in existing studies too ”strict” for trajectories. Third, existing studies often optimize the clustering quality iteratively at each timestamp (Kim and Han, 2009; Chakrabarti et al., 2006; Yin et al., 2021; Folino and Pizzuti, 2013; Liu et al., 2020, 2019), which is computationally costly and is infeasible for largescale trajectories.
We propose an efficient and effective method for evolutionary clustering of streaming trajectories (ECO). First, we adopt the idea of neighborbased smoothing (Kim and Han, 2009) and develop a structure called minimal group that is summarized by a seed point in order to facilitate smoothing. Second, following existing studies (Chakrabarti et al., 2006; Yin et al., 2021; Xu et al., 2014; Folino and Pizzuti, 2013; Liu et al., 2020, 2019), we formulate ECO as an optimization problem that employs the new notions of snapshot cost and historical cost. The snapshot cost evaluates the true concept shift of clustering defined according to the distances between smoothed and original locations. The historical cost evaluates the temporal distance between locations at adjacent timestamps by the degree of closeness. Next, we prove that the proposed optimization function can be decomposed and that each component can be solved approximately in constant time. The effectiveness of smoothing is further improved by a seed point shifting strategy. Finally, we introduce a grid index structure and present algorithms for each component of evolutionary clustering along with a set of optimization techniques, to improve clustering performance. The paper’s main contributions are summarized as follows,

We formalize ECO problem. To the best of our knowledge, this is the first proposal for streaming trajectory clustering that takes into account temporal smoothness.

We formulate ECO as an optimization problem, based on the new notions of snapshot cost and historical cost. We prove that the optimization problem can be solved approximately in linear time.

We propose a minimal group structure to facilitate temporal smoothing and a seed point shifting strategy to improve clustering quality of evolutionary clustering. Moreover, we present all algorithms needed to enable evolutionary clustering, along with a set of optimization techniques.

Extensive experiments on two reallife datasets show that ECO advances the stateofthearts in terms of both clustering quality and efficiency.
The rest of paper is organized as follows. We present preliminaries in Section 2. We formulate the problem in Section 3 and derive its solution in Section 4. Section 5 presents the algorithms and optimization techniques. Section 6 covers the experimental study. Section 7 reviews related work, and Section 8 concludes and offers directions for future work.
2. Preliminaries
Notation  Description 

A trajectory  
The time step  
,  The location and timestamp of at 
,  A simplification of , at 
,  A simplification of , at 
A set of trajectories at  
An adjustment of  
The set of adjustments of  
A seed point of at the current time step  
A seed point of at the previous time step  
The set of seed points at  
A minimal group summarized by a seed point at  
The snapshot cost of a trajectory w.r.t. at  
The historical cost of a trajectory w.r.t. at  
A cluster  
The set of clusters obtained at  
2.1. Data Model
Definition 1 ().
A GPS record is a pair , where is a timestamp and is the location, with being a longitude and being a latitude.
Definition 2 ().
A streaming trajectory is an unbounded ordered sequence of GPS records, .
The GPS records of a trajectory may be transmitted to a central location in an unsynchronized manner. To avoid this affecting the subsequent processing, we adopt an existing approach (Chen et al., 2019) and discretize time into short intervals that are indexed by integers. We then map the timestamp of each GPS record to the index of the interval that the timestamp belongs to. In particular, we assume that the start time is 00:00:00 UTC, and we partition time into intervals of duration . Then time series 00:00:01, 00:00:12, 00:00:20, 00:00:31, 00:00:44 and 00:00:00, 00:00:13, 00:00:21, 00:00:31, 00:00:40 are both mapped . We call such a sequence a discretized time sequence and call each discretized timestamp a time step dt. We use trajectory and streaming trajectory interchangeably.
Definition 3 ().
A trajectory is active at time step if it contains a GPS record such that .
Definition 4 ().
A snapshot is the set of trajectories that are active at time step .
Figure 1 shows three snapshots , , and , each of which contains twelve trajectories. Given the start time 00:00:00 and , arrives at because 00:00:12 is mapped to 1. For simplicity, we use in figures to denote . The interval duration is the default sample interval of the dataset. Since deviations between the default sample interval and the actual intervals are small (Li et al., 2020b), we can assume that each trajectory has at most one GPS record at each time step . If this is not the case for a trajectory , we simply keep ’s earliest GPS at the time step. This simplifies the subsequent clustering. Thus, the GPS record of at is denoted as . If a trajectory is active at both and and the current time step is , and are simplified as and , and and are simplified as and . At time step () in Figure 1, , 00:00:12, , and 00:00:22.
Definition 5 ().
A neighbor set of a streaming trajectory at the time step is ,where is Euclidean distance and is a distance threshold. is called the local density of w.r.t. at .
2.2. Dbscan
We adopt a wellknown densitybased clustering approach, DBSCAN (Ester et al., 1996), for clustering. DBSCAN relies on two parameters to characterize density or sparsity, i.e., positive values and minPts.
Definition 6 ().
A trajectory is a core point w.r.t. and minPts, if .
Definition 7 ().
A trajectory is density reachable from another trajectory , if a sequence of trajectories exists such that (i) and ; (ii) are core points; and (iii) .
Definition 8 ().
A trajectory is connected to another trajectory if a trajectory exists such that both and are density reachable from .
Definition 9 ().
A nonempty subset of trajectories of is called a cluster , if satisfies the following conditions:

Connectivity: , is connected to ;

Maximality: , if and is density reachable from , then .
Definition 9 indicates that a cluster is formed by a set of core points and their density reachable points. Given and minPts, is an outlier, if it is not in any cluster; is a border point, if and , where is a core point.
Definition 10 ().
A clustering result is a set of clusters obtained from the snapshot .
2.3. Evolutionary Clustering
Evolutionary clustering is the problem producing a sequence of clusterings from streaming data; that is, clustering for each snapshot. It takes into account the smoothness characteristics of streaming data to obtain highquality clusterings (Chakrabarti et al., 2006). Specifically, two quality aspects are considered:

High historical quality: clustering should be similar to the previous clustering ;

High snapshot quality: should reflect the true concept shift of clustering, i.e., remain faithful to the data at each time step.
Evolutionary clustering uses a cost function that enables tradeoffs between historical quality and snapshot quality at each time step (Chakrabarti et al., 2006),
(1) 
is the sum of two terms: a snapshot cost () and a historical cost (). The snapshot cost captures the similarity between clustering and clustering that is obtained without smoothing. The smaller is, the better the snapshot quality is. The historical cost measures how similar clustering and the previous clustering are. The smaller is, the better the historical quality is. Parameter enables controlling the tradesoff between snapshot quality and historical quality.
3. Problem Statement
We start by presenting two observations, based on which, we define the problem of evolutionary clustering of streaming trajectories.
3.1. Observations
Gradual evolutions of travel companions
As pointed out in a previous study (Tang et al., 2012), movement trajectories represent continuous and gradual location changes, rather than abrupt changes, implying that comovements among trajectories also change only gradually over time. Comovement may be caused by (i) physical constraints of both road networks and vehicles, and vehicles may have close relationships, e.g., they may belong to the same fleet or may target the same general destination (Tang et al., 2012).
Uncertainty of ”border” points
Even with the observation that movements captured by trajectories are not dramatic during a short time, border points are relatively more likely to leave their current cluster at the next time step than core points. This is validated by statistics from two reallife datasets. Specifically, among the trajectories shifting to another cluster or becoming an outlier during the next time steps, 75.0% and 61.5% are border points in the two reallife datasets.
3.2. Problem Definition
Cost embedding
Existing evolutionary clustering studies generally perform temporal smoothing on the clustering result (Folino and Pizzuti, 2013; Chakrabarti et al., 2006; Chi et al., 2007; Yin et al., 2021). Specifically, they adjust iteratively so as to minimize Formula 1, which incurs very high cost. We adopt cost embedding (Kim and Han, 2009), which pushes down the cost formula from the clustering result level to the data level, thus enabling flexible and efficient temporal smoothing. However, the existing cost embedding technique (Kim and Han, 2009) targets dynamic networks only. To apply cost embedding to trajectories, we propose a minimal group structure and snapshot and historical cost functions.
Snapshot cost
We first define the notion of an ”adjustment” of a trajectory.
Definition 11 ().
An adjustment is a location of a trajectory obtained through smoothing at . Here, if . The set of adjustments in is denoted as .
We simplify to if the context is clear. In Figure 1, is an adjustment of at . According to Formula 1, the snapshot cost measures how similar the current clustering result is to the original clustering result . Since we adopt cost embedding that smooths trajectories at the data level, the snapshot cost of a trajectory w.r.t. its adjustment at (denoted as ) is formulated as the deviation between and at :
(2) 
where is a speed constraint of the road network. Formula 2 requires that any adjustment must follow the speed constraint. Obviously, the larger the distance between and its adjustment , the higher the snapshot cost.
Historical cost
As discussed in Section 2.3, one of the goals of evolutionary clustering is smoothing the change of clustering results during adjacent time steps. Since we push down the smoothing from the cluster level to trajectory level, the problem becomes one of ensuring that each trajectory represents a smooth movement. According to the first observation in Section 3.1, gradual location changes lead to stable comovement relationships among trajectories during short periods of time. Thus, similar to neighborbased smoothing in dynamic communities (Kim and Han, 2009), it is reasonable to smooth the location of each trajectory in the current time step using its neighbours at the previous time step. However, the previous study (Kim and Han, 2009) smooths the distance between each pair of neighboring nodes. Simply applying this to trajectories may degrade the performance of smoothing if a ”border” point is involved. Recall the second observation of Section 3.1 and assume that is smoothed according to at in Figures 1 and 2 . As is a border point at
with a higher probability to leave the cluster
at , using to smooth may result in also leaving or being located at the border of at . The first case may incur an abrupt change to the clustering while the second case may degrade the intradensity of and increase the interdensity of clusters in . To tackle this problem, we model neighboring trajectories as minimal groups summarized by seed points.Definition 12 ().
A seed point summarizes a minimal group at , where is a given parameter, and is a seed point set at . The cardinality of , , exceeds a parameter . Any trajectory o in that is different from is called a nonseed point. Note that, if .
Given the current time step , we use to denote the seed point of at (i.e., ), while use to denote that at (i.e., ).
Example 3 ().
We propose to use the location of a seed point to smooth the location of a nonseed point at . In order to guarantee the effectiveness of smoothing, Definition 12 gives two constraints when generating minimal groups: (i) and (ii) . Setting to a small value, the first constraint ensures that are close neighbors at . Specifically, we require because this makes it very likely that trajectories in the same minimal group are in the same cluster. The second constraint avoids small neighbor sets . Specifically, using an ”uncertain border” point as a ”pivot” to smooth the movement of other trajectories may lead to an abrupt change between clusterings or a lowquality clustering (according to the quality metrics of traditional clustering). We present the algorithm for generating minimal groups in Section 5.2.
Based on the above analysis, we formalize the historical cost of w.r.t. its adjustment at , denoted as , as follows.
(3)  
where . Given the threshold , the larger the distance between and , the higher the historical cost. Here, we use the degree of closeness (i.e., ) instead of to evaluate the historical cost, due to two reasons. First, constraining the exact relative distance between any two trajectories during a time interval may be too restrictive, as it varies over time in most cases. Second, using the degree of closeness to constrain the historical cost is sufficient to obtain a smooth evolution of clusterings.
Total cost
Formulas 2 and 3 give the snapshot cost and historical cost for each trajectory w.r.t. its adjustment , respectively. However, the first measures the distance while the latter evaluates the degree of proximity. Thus, we normalize them to a specific range :
(4) 
(5)  
where and is the duration of a time step. Clearly, and . Thus, we only need to prove and .
Lemma 1 ().
If then .
Proof.
According to our strategy of mapping original timestamps (Section 2.1), . Considering the speed constraint of the road network, . Further, due to the triangle inequality. Thus, we get . Since , . ∎
It follows from Lemma 1 that if . However, does not necessarily hold. To address this problem, we preprocess according to so that it follows the speed constraint before conducting evolutionary clustering. The details are given in Section 4.3.
Lemma 2 ().
If then .
According to Lemma 2, we can derive and thus . Letting , the total cost is:
(6)  
where . Formula 6 indicates that we do not smooth the location of at if is not summarized in any minimal group at . This is in accordance with the basic idea that we conduct smoothing by exploring the neighboring trajectories. We can now formulate our problem.
Definition 13 ().
Given a snapshot , a set of previous minimal groups , a time duration , a speed constraint , and parameters , , , minPts and , evolutionary clustering of streaming trajectories (ECO) is to

find a set of adjustments , such that ;

compute a set of clusters over .
Specifically, each adjustment of is denoted as and is then used as the previous location of (i.e. ) at for evolutionary clustering.
Example 4 ().
Clearly, the objective function in Formula 6 is neither continuous nor differentiable. Thus, computing the optimal adjustments using existing solvers involves iterative processes (Song et al., 2015) that are too expensive for online scenarios. We thus prove that Formula 6 can be solved approximately in linear time in Section 4.
4. Computation of Adjustments
Given the current time step , we start by decomposing at the unit of minimal groups as follows,
(7)  
where , , is the adjustment of at , is the seed point of at , and is the location of at . We omit the multiplier from Formula 6 because , , and are constants and do not affect the results.
4.1. Linear Time Solution
We show that Formula 7 can be solved approximately in linear time. However, Formula 7 uses each previous seed point for smoothing, and such points may also exhibit unusual behaviors from to . Moreover, may not be in . We address these problems in Section 4.2 by proposing a seed point shifting strategy, and we assume here that has already been smoothed, i.e., .
Lemma 3 ().
achieves the minimum value if each achieves the minimum value.
Proof.
To prove this, we only need to prove that and do not affect each other. This can be established easily, as we require . We thus omit the details due to space limitation. ∎
Lemma 3 implies that Formula 7 can be solved by minimizing each (). Next, we further ”push down” the cost shown in Formula 7 to each pair of and .
(8)  
Lemma 4 ().
achieves the minimum value if each
achieves the minimum value.
Proof.
The proof is straightforward, because and are independent of each other. ∎
According to Lemma 4, the problem is simplified to computing () given . However, Formula 8 is still intractable as its objective function is not continuous. We thus aim to transform it into a continuous function. Before doing so, we cover the case where the computation of w.r.t a trajectory can be skipped.
Lemma 5 ().
If then .
Proof.
Let be an adjustment of . Given , . On the other hand, as , the snapshot cost . Thus, if . ∎
A previous study (Kim and Han, 2009) smooths the distance between each pair of neighboring nodes no matter their relative distances. In contrast, Lemma 5 suggests that if a nonseed point remains close to its previous seed point at the current time step, smoothing can be ignored. This avoids oversmoothing close trajectories. Following Example 4, .
Definition 14 ().
A circle is given by , where is the center and is the radius.
Definition 15 ().
A segment connecting two locations and is denoted as . The intersection of a circle and a segment is denoted as .
Figure 3 shows a circle that contains , , and . Further, .
Lemma 6 ().
.
Proof.
In Section 3.2, we constrain before smoothing, which implies that . Hence, . ∎
In Figure 4, given , .
Omitting the speed constraint
We first show that without utilizing the speed constraint, an optimal adjustment of that minimizes can be derived in constant time. Based on this, we explain how to compute based on .
Lemma 7 ().
.
Proof.
Let .
First, we prove that
.
Two cases are considered, i.e., (i) and (ii) . For the first case, we can always find an adjustment , such that . Hence, . However, we have due to . Thus, .
For the second case, it is clear that . Thus, .
Second, we prove that . We can always find , such that . Hence,
. However, in this case due to .
Thus, we have
.
∎