1. Introduction
Discovering moving objects that travel together, i.e., traveling companions, is an interesting problem in many realworld applications. For example, in mobile advertising it is proven that consumers who occurred at the same location at the same time generally exhibit commonalities in their taste (Ghose et al., 2019). Thus detecting groups of consumers who walk together and advertising to them using coupons in the same product category might increase the advertising response. Other applications concerning with finding traveling companions include intelligent transportation systems, animal migration monitoring and public procession management.
Traveling companions can be discovered from the trajectories of moving objects. An object’s trajectory is a sequence of its location points sorted chronologically. Traveling companions are thus moving objects with trajectories that are both spatially and temporally close. Existing methods for mining traveling companions can be broadly divided into two categories: methods based on pattern mining (Jeung et al., 2008; Li et al., 2010; Zheng et al., 2013) and methods based on representation learning (Yao et al., 2017; Li et al., 2018; Yao et al., 2019)
. Pattern mining based methods often first define a particular movement pattern pertaining to traveling companions based on some similarity measurement of their trajectories, and then develop specific algorithms to extract the predefined patterns. The similarity measurements are often based on pointwise Euclidean distance and thereby require trajectories are aligned along timestamp. However, real trajectories often contain many missing location points and have to be interpolated, which introduces additional measurement errors. Representation learning based methods, on the other hand, do not rely on pattern definition or pointwise comparison. Instead, they learn the representations of the trajectories using machine/deep learning models and then cluster the representations to discover similar ones, which in turn represent similar trajectories. Existing models either require certain amount of feature engineering, or specific labels for supervised learning. But in practice, feature engineering is often problem dependent and time consuming, which introduces extra overhead to representation learning itself. Yet trajectory labels are usually unavailable as they are difficult to collect and often have ethics issues. Moreover, existing models focus more on learning spatial proximity between trajectories, therefore they can only extract objects with similar spatial features, which are not necessarily traveling companions.
To tackle the aforementioned issues, we develop in this paper an unsupervised model using autoencoders, which learns the representations directly from original trajectories. The original trajectories are not necessarily aligned along timestamp, and can be of various lengths, i.e., having different numbers of location points. The learned representations contain both spatial and temporal features of the original trajectories, and thereby can be clustered to discover traveling companions.
The model is inspired by the work of text summarization
(Chu and Liu, 2019), where paragraphs of similar topics are grouped and summarized. Thus we propose to first group the original trajectories using the SortTileRecursive (STR) algorithm (Leutenegger et al., 1997). STR is originally used to group spatial data for the bulkloading construction of Rtree. We use it to group the trajectories so that the trajectories within the same group already have certain spatial and temporal proximity. Then we feed each group of trajectories independently into an attentionbased encoder. The idea behind is to encourage closer trajectories to learn more similar representations from each other. Then we use an encoderdecoder structure to reconstruct the input trajectories, and use a decoderencoder structure to learn the similarity between the input trajectories, as we will show in Section 1. The similarity is computed between the mean encodings of the first encoder and the intermediate encodings of the second encoder. As we use global attentions to produce the encoding and a mean operation to aggregate a group of encodings, we call our model ATTNMEAN Autoencoder. Once we obtain the trajectory encodings using the trained model, we use DBSCAN (as always used in literature (Yao et al., 2017; Li et al., 2018)) to cluster the encodings. The corresponding trajectories in the same cluster are therefore considered as traveling companions.2. Related Work
The problem of finding moving objects that travel together has been extensively studied over the past decade using data mining algorithms. Some representative studies include Flock (AlNaymat et al., 2007), Convoy (Jeung et al., 2008), Swarm (Li et al., 2010) and Gathering (Zheng et al., 2013). The authors in (Jeung et al., 2008) define convoy to describe a generic pattern of traveling companions of any shape. A convoy is a group of at least m moving objects that are densityconnected with respect to a distance e during at least k consecutive time points. A simple algorithm for discovering convoys is to perform a densityconnected cluster algorithm at each time point and maintain convoy candidates who have at least k clusters during consecutive time points. Then for each candidate, an intersection of its clusters is conducted to test whether there are at least m objects shared by all the clusters. To overcome the high computational complexity, the authors propose to first extract candidate convoys based on simplified trajectories and then decide whether each candidate is indeed qualified in the refinement step. The pattern of swarms (Li et al., 2010) further relaxes the constraint of convoys by defining traveling companions as moving objects that move within densityconnected clusters for at least k time points that are possibly nonconsecutive. The goal of the work is to discover all closed swarms, such that neither the object set nor the number of time points of the swarms can be enlarged. As the number of candidate closed swarms is prohibitively huge, the authors propose two pruning strategies to effectively reduce the search space.
Recently, representation learning of trajectories has drawn lots of attention because it requires very little feature engineering and similarity computation, since using representations naturally avoids the problem of point matching. Studies mostly related to our work include trajectory clustering (Yao et al., 2017) and trajectory similarity computation (Li et al., 2018; Yao et al., 2019; Zhang et al., 2019) via representation learning. The authors in (Yao et al., 2017) use a seq2seq autoencoder to learn trajectory representations for clustering tasks. They first extract trajectory features such as speed and rate of turn and transform them into a feature sequence that describes the movement pattern of the corresponding object. Then, they feed the feature sequences into their model to learn fixedlength representations. Later the work in (Li et al., 2018)
proposes an RNN based model to learn trajectory representations for similarity computation. The model does not require feature extraction and directly learns from original trajectories represented by trajectory tokens. However, the model training is supervised and requires to construct training pairs by sampling from the original trajectories. Both studies in
(Yao et al., 2017; Li et al., 2018) focus on finding trajectories with similar shapes, regardless of their actual timestamps. Other studies like (Zhang et al., 2019) propose to inject additional semantic information such as environmental constraints and trajectory activities into deep models, in order to obtain more accurate trajectory representations for similarity computation. Nonetheless, none of the existing models is designed to learn simultaneously spatial and temporal proximity between trajectories. As such they cannot be directly applied on discovering traveling companions that are both spatially and temporally close.3. The models
3.1. Representation of Trajectory Points
Firstly, we divide the entire space into square cells of the same size. Then, following the idea in (Li et al., 2018), we use the skipgram technique (Mikolov et al., 2013) to pretrain the cell representations (or token) such that spatially close cells have similar representations. So, each location point of a trajectory can be represented using the token of the corresponding cell the point falls in.
Secondly, to capture temporal characteristics, we propose to use the positional encoding (Gehring et al., 2017)
technique to inject time information into the above trajectory tokens. Positional encodings have been used in the selfattention model to capture information about token positions in a sequence. As a trajectory is an ordered sequence of tokens, we could use the same mechanism to model the sequence. We compute the positional encodings as Equation
1 and 2:(1) 
(2) 
where is the timestamp of the entire time duration in the data, indicates the embedding size of a cell, and . Each positional encoding has the same size with the above token, and is therefore added directly to it.
3.2. ATTNMEAN Autoencoder
In order to encourage trajectories to learn more from their neighbours, we propose to first roughly divide them into groups w.r.t their spatialtemporal proximity, and then feed each group independently into the other autoencoder. We use the SortTileRecursive (STR) algorithm (Leutenegger et al., 1997) to group the trajectories. The STR algorithm is originally used to pack spatiallyclose objects into minimum bounding rectangles (MBRs) and construct a bulkloading Rtree index of the objects. Since each trajectory can be considered as an object in a threedimensional space , where is the time dimension and stands for the two spatial dimensions, we apply the idea of STR to pack trajectories. Trajectories in the same MBR or group are more likely to be traveling companions, and thus fed as an independent group into the model.
The entire model is constructed using two autoencoders, as depicted in Figure 1. The encoder and decoder on the lefthand side constitute the first autoencoder, which is an ordinary LSTM autoencoder used to reconstruct the trajectories. On the righthand side, the encodings are firstly averaged using a mean operation and then fed into a decoder. The decoded intermediate trajectories are again input into an encoder to obtain the encoded intermediate mean trajectories. Then we force the encodings output by the left encoder to learn from each other by computing the similarity between them and the encodings produced by the right encoder. Thus the autoencoder on the righthand side is indeed a decoderencoder structure. The two encoders and the two decoders have the same structures and share the same parameters, respectively.
On the two LSTM encoders, we additionally use a global attention mechanism to aggregate the hidden states of each step to form the output encodings. The encoding contains more overall information of the entire trajectory and boosts the model’s performance in the experiments. In particular, we firstly initialize a global attention vector
to calculate an attention score on each trajectory token, as shown in equation 3,(3) 
, where
is the hidden representation of point
, is the attention weight of this representation and is the length of each trajectory. Then, we use these attention scores to conduct a weighted sum of the hidden state vectors of each trajectory, denoted by ,(4) 
, where is the final representation of the trajectory.
For optimization, we compute the trajectory reconstruction loss using mean squared error (MSE) for the left component, as shown in Equation 5,
(5) 
where is the batch size, denotes the trajectory and is the corresponding reconstructed trajectory. Then we compute the representation similarity loss between the trajectory encodings produced by the two encoders using average cosine distance, as shown in Equation 6,
(6) 
where denotes the aggregated encoding of trajectory generated by the encoder in the left component and denotes the intermediate trajectory encoding generated by the encoder in the right component.
Through minimizing both the reconstruction loss and similarity loss, we force the encodings produced by the encoder to keep the distinctive features of their own original trajectories as well as learn similar features from their neighbours in the same group. As we use global attentions and a mean operation inside and outside the encoder, we call this model ATTNMEAN Autoencoder.
4. Performance Evaluation
4.1. The Dataset and Overall Settings
We obtain a trajectory dataset of the passengers in our collaboration with an airport in Asia. The dataset contains 14605 trajectories with about 719,507 points, and each trajectory has 20 to 120 location points. We attempt to use this data to find passengers who walk together.
Using the dataset, we compare ATTNMEAN with an LSTM autoencoder (LSTMAE), Convoy (Jeung et al., 2008), Swarm (Li et al., 2010), T2VEC (Li et al., 2018), BFEA (Yao et al., 2017). For two pattern mining based methods Convoy and Swarm, we compare the number of extracted clusters with that discovered using our learned representations. To comply with the two algorithms, we conduct linear interpolations in the dataset, such that we generate a synthetic point every 10 seconds for each trajectory, if necessary. For T2VEC and BFEA, we also inject positional encodings into their original input for a fair comparison. We use DBSCAN to cluster the learned encodings and compare the clustering performance. We only list the main results due to page limit. More experimental results are available upon request.
4.2. Parameter Settings and Training Details
Group size and batch size. We set the group size and batch size to 64 and 8, respectively. The group size is the size of MBR capacity used in the STR algorithm.
Cell Size. We divide the entire airport space into square cells with length 5 meters on each side, and obtain cells.
Embedding Size. We set the embedding size of the cell representations to be 256. Therefore the positional encoding for each timestamp also has 256 dimensions.
Training Details. We use Adam stochastic optimization for training with the learning rate and weight decay rate
. The training process is terminated when trajectory reconstruction loss and representation similarity loss both converge. We observe that all the three models converge after 20 epochs.
4.3. Main Results
We employ three metrics to evaluate the clustering results, namely, DaviesBouldin Index, Silhouette Coefficient and weighted average entropy. A smaller DaviesBouldin index and a high Silhouette Coefficient value indicate better clustering performance in general. To calculate the weighted average entropy of each cluster, we use the nearest gate of each trajectory’s last point as its label. The idea behind is that traveling companions are more likely to walk to the same gate. For all these measurements, we discard all clusters of size one, i.e., trajectories that are considered as traveling alone. We vary the distance parameter of DBSCAN from 0 to 0.2 with increments and plot the three metrics for each . The results are presented in Figure 2(a), 2(b) and 2(c), respectively.
We observe in Figure 2(a) and 2(b) that in general ATTNMEAN has smaller DaviesBouldin Index and larger Silhouette Coefficient for varying , suggesting that ATTNMEAN performs better than LSTMAE, T2VEC and BFEA for both internal and external criteria evaluation. We also observe in Figure 2(c) that ATTNMEAN produces smaller weighted average entropy than LSTMAE, T2VEC and BFEA for varying . This means that clusters produced by ATTNMEAN have fewer different gate labels, which means the clustered trajectories are more likely to be travelling companion.
Besides, We compare the number of clusters that have at least two trajectories found by the two models with that extracted by Convoy and Swarm. The results are presented in Table 1. For Convoy and Swarm, we set to be 18 (i.e., at least 3 minutes), to be 2, and vary in . For example, the first row in the table means that a convoy should have at least two trajectories with distance less than 3 meters in at least 18 consecutive time points, where two consecutive time points have 10second offset. For our two models, we simply show the results when they discover the largest number of clusters.
Algorithm  Parameters  Number of Clusters  Single trajectories 

Convoy  k=18,m=2,e=3  308  4629 
k=18,m=2,e=5  814  3875  
Swarm  k=18,m=2,e=3  756  4086 
k=18,m=2,e=5  2431  2710  
LSTMAE  786  895  
ATTNMEAN  727  1112 
We observe that even for these relaxed parameter settings, Convoy and Swarm generate lots of single trajectories that do not belong to any cluster. By contrast, our models can group most trajectories into clusters.
4.4. The Effect of Positional Encodings
We also conduct an ablation study to show the effect of positional encodings. We observe in Figure 3, ATTNMEAN performs generally better than ATTNMEAN without positional encodings for varying on all the three metrics. This proves that the injection of positional encodings helps our model to find trajectories that are both spatially and temporally close. The model without positional encodings would find trajectories with similar shape in different time periods.
5. Conclusion and Future Work
In this work, we propose an unsupervised deep representation model ATTNMEAN for the discovery of traveling companions. We first employ positional encoding and skipgram techniques to embedding the trajectories. The input trajectory token representations are collectively embedded with spatial and temporal information of the location points. Then we use STR to group original trajectories to encourage them learn from the neighbours. A double autoencoder architecture with global attentions and a mean operation is used to construct the model. Experimental results show that ATTNMEAN learns overall better trajectory representations than LSTMAE, T2VEC and BFEA for discovering traveling companions. In future, we plan to explore other mechanisms to fuse the two autoencoders, and further improve ATTNMEAN.
References
 Dimensionality reduction for long duration and complex spatiotemporal queries. In ACM Symposium on Applied computing, pp. 393–397. Cited by: §2.

MeanSum: a neural model for unsupervised multidocument abstractive summarization.
In
International Conference on Machine Learning
, pp. 1223–1232. Cited by: §1.  Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1243–1252. Cited by: §3.1.
 Mobile targeting using customer trajectory patterns. Management Science. Cited by: §1.
 Discovery of convoys in trajectory databases. Proceedings of the VLDB Endowment 1 (1), pp. 1068–1080. Cited by: §1, §2, §4.1.
 STR: a simple and efficient algorithm for rtree packing. In Proceedings 13th International Conference on Data Engineering, pp. 497–506. Cited by: §1, §3.2.
 Deep representation learning for trajectory similarity computation. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 617–628. Cited by: §1, §1, §2, §3.1, §4.1.
 Swarm: mining relaxed temporal moving object clusters. Proceedings of the VLDB Endowment 3 (12), pp. 723–734. Cited by: §1, §2, §4.1.
 Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §3.1.
 Computing trajectory similarity in linear time: a generic seedguided neural metric learning approach. In 2019 International Conference on Data Engineering (ICDE), pp. 1358–1369. Cited by: §1, §2.

Trajectory clustering via deep representation learning.
In
2017 international joint conference on neural networks (IJCNN)
, pp. 3880–3887. Cited by: §1, §1, §2, §4.1.  Deep representation learning of activity trajectory similarity computation. In 2019 IEEE International Conference on Web Services (ICWS), pp. 312–319. Cited by: §2.
 Online discovery of gathering patterns over trajectories. IEEE TKDE 26 (8), pp. 1974–1988. Cited by: §1, §2.