Smartphone devices are equipped with multiple sensors that can record user behavior on the handsets. With the help of a large-scale smartphone usage data, researchers are able to study human behavior in the real world. Since location information is one of the crucial aspects of human behaving, investigating human mobility from mining mobile data has become recently a popular research topic.
Previous research in this filed mainly only focus on discovering the significant places or predicting the transition among the significant places [do2014places], [baumann2018selecting], [mcinerney2013modelling]. However, these research neglect the data sampled at the places where one stay for a relatively short time, for instance, in the middle of transitions. As opposed to this point of view, we believe that these data is important for revealing human mobility patterns as well.
To characterize human mobility, we should realize that there are multiple patterns lying in the mobility data even for the same individuals. For example, normally, on workdays, one goes to work or school at daytime while on weekends, he/she may prefer to stay at home. As a result, the mobility on weekends is different from the one on workdays. Therefore, for each user, it is reasonable to depict the trajectory for each day and then discover the common mobility patterns shared by the trajectories of different days.
In our work, the human mobility is recorded by the GPS module embedded on the smartphone devices. It should be emphasized that the GPS data (longitudes and latitudes) is not evenly distributed spatially because one may stay longer at a significant place (i.e, home or workplace/school) than at a less significant place (i.e, restaurants or the roads). Thus, an appropriate description for human mobility is to treat the location of an individual as a set of data points randomly distributed in the space with respect to different probabilities. Moreover, in practice, the data collecting procedure may not be continuous all the time because the GPS module is turned off or does not function sometimes. As a consequence, it arises the issue of data sparsity. In general, the human mobility GPS data has the following properties:
The data has latent structures.
The data is not evenly distributed in space.
The data is sparse and noisy.
These unique data characteristics prevent researchers adopting some conventional methods. Therefore, in our work, we adopt a probabilistic approach to describe the daily human mobility. As compared to the conventional methods, we believe our approach can explore more information from the original GPS data and decrease the impact of data sparsity. The approach presented in this paper is aimed at investigating this hypothesis and is structured into three steps as shown in Fig. 1.
The first stage of the method is to estimate the probability density for each day’s trajectories. For such task, Gaussian Mixture Model [reynolds2015gaussian] is a possible solution. However, the standard Gaussian Mixture Model needs to set the number of components in advance, which is not practical to implement because trajectory data can be statistically heterogeneous and a fixed component number for every daily trajectories is not appropriate. To handle this problem, we adopted the Infinite Gaussian Mixture Model [rasmussen2000infinite], in which the Dirichlet process prior is used to modify the mixed weights of components.
To measure the difference between different mobility probability densities, the Kullback-Leibler divergence [kullback1997information] estimator is used. The Kullback-Leibler divergence is an asymmetric metric, which means the distance from distribution p to distribution q is not the same as the distance from distribution q to distribution p unless they are identical distributions. We exploit the inequality property of KL divergence to reveal the subordinate relationship of one trajectory to another.
Finally, we devise a clustering algorithm using the Infinite Gaussian Mixture Models with Kullback-Leibler divergence to discover the mobility patterns existing in human mobility data. More importantly, as compared to traditional methods, our clustering algorithm is automatic because it does not require a preset of the pattern number.
The work presented in this paper has then 3 main contributions:
For estimating probability density of daily mobility, we illustrate that the Infinite Gaussian Mixture Models outperform the Gaussian Mixture Models.
We prove that Kullback-Leibler divergence is an appropriate metrics to measure the closeness of mobility probability densities.
We develop a clustering algorithm based on Infinite Gaussian Mixture Model and Kullback-Leibler divergence to find the human mobility patterns.
The reminder of the paper is organized as follows. Section II surveys the related work. Section III addresses the problem we are tackling in this paper. In Section IV, the proposed method is depicted. In Section V presents the conducted experiment and its results to evaluate our method with real user data. Finally, we conclude our paper and discuss about the future work in Section VI.
Ii Related Work
In literature, previous research such as [lu2013approaching], [ye2012situation], [lin2014mining], [pirozmand2014human], [zheng2015trajectory], [cao2007discovery] which were studying human mobility with mobile data, mainly focused on tasks such as extracting significant places, predicting next visiting places, predicting visit duration or clustering trajectories.
A widespread topic is to predict human mobility with the smartphone usage contextual information, e.g., temporal information, application usage, call logs, WiFi status, Cell ID, etc. In [baumann2018selecting] and [do2014and]
for instance, the researchers applied various machine learning techniques to accomplish prediction tasks such as next-time slot location prediction and next-place prediction. In particular, they exploited how different combinations of contextual features related to smartphone usage can affect prediction accuracy. Meanwhile, they also compared the predicting performance of individual models and generic models.
Another frequently-used method for such tasks is to use probabilistic models. Through calculating the conditional probabilities between contextual features, [do2012contextual] developed the contextual conditional models for the next-place prediction and visit duration prediction. In [do2015probabilistic] and [peddemors2010predicting]
, the researchers presented the probabilistic prediction frameworks based on kernel density estimation.[do2015probabilistic] utilized conditional kernels density estimation to predict the mobility events while [peddemors2010predicting] devised different kernels for different context information types. And in [mcinerney2013modelling], the authors developed a location Hierarchical Dirichlet Process (HDP) based approach to model heterogeneous location habits under data sparsity.
In addition, generative models can also be applied to predict human mobility. This type of models can be Naïve Bayes [muhlenbrock2004learning]yu2017modeling]
, Hidden Markov Model (HMM)[cho2016exploiting]
or Dynamic Bayesian Network (DBN)[etter2013go], [patterson2003inferring], etc. These generative models attempt to predict the future states of human behavior by computing the state transition probabilities. However, when the number of states expands, the calculation grows exponentially.
Among the other possible approaches, [zheng2009mining] proposed a Hypertext Induced Topic Search-based inference model for mining interesting locations and travel sequences using a large GPS dataset in certain region. In [do2014places]scellato2011nextplace] made use of nonlinear time series analysis of the arrival time and residence time for location prediction.
In particular, for clustering user trajectories, there exists several different methods. However, these conventional algorithms are not applicable to our objectives. For example, some researchers used K-means[jiang2012clustering], [ashbrook2003using] in their work, whereas K-means can not handle the trajectories with complex shapes or noisy data because it is based on Euclidean distance. Besides, it also need the pre-knowledge of cluster number, which is not acquirable in many cases.
Though DBSCAN [tang2015uncovering], [yu2017modeling], a density based clustering techniques, can deal with data with arbitrary shapes and does not require the number of cluster in advance. However, it still needs to set the minimum points number and neighbourhood radius to recognize core areas and it treats the non-core data points as noise. From our study, we argue that the trajectory parts with less data density are also essential to demonstrate the human mobility trajectories. And the grid searching algorithm [do2012contextual] focus on detecting the stay points within a set of square regions and fails to reveal the mobility at a larger scale.
In this paper we decided to focus on a probabilistic point of view. As compared to aforementioned previous works, we aim at describing the daily trajectories using their probability densities. Moreover, to discover the common mobility patterns shared among these trajectories, we devise an automatic clustering algorithm. As opposed to traditional clustering algorithms, our method is able to exploit more information from the sparse and noisy original GPS data and free from pre-defining clusters number.
Iii Problem Formulation
As expressed in introduction, our purpose is to discover the mobility patterns for each individual from their GPS location data.
As shown in Fig. 2, the mobility for one individual consists of many different trajectories (the data is from the MDC dataset, the detailed data description will be in following experiments). We believe that one’s daily mobility is rather regular and there are common mobility patterns shared among different daily trajectories. From a common sense, one may follow the regular daily itineraries, for instance, home-work place/school-home. Yet, on different days the daily itineraries may not be the same, for instance, on the way to home, one may take a detour to do shopping in a supermarket sometimes. Hence, our objective is to discover all the potential daily mobility from the data with location information.
We extract each day’s trajectory from the whole dataset as shown in Fig. 3. Fig. 3 reveals that a daily trajectories recorded by GPS data is not distributed evenly in space, and is even not continuous in some areas. It may be caused by the data collecting procedure: some data collecting time range is actually relatively short (less than 24 hours, in fact, only few hours sometimes), which leads to the data sparsity problem.
In order to overcome this problem and exploit as much information as possible from the GPS data, we argue that a reasonable way to describe the daily trajectories is to estimate the probability density of the location data. And the relationship among the trajectories can be represented by their probability densities. As a result, we can discover all the mobility patterns for each user.
The tasks in this paper will be as follows:
Task 1: Estimate the probability density for mobility for each day. We will compare results of GMM and IGMM.
Task 2: Measure the closeness between different trajectories. We will use the KD divergence as metrics.
Task 3: Discover the similar mobility patterns among all the recorded daily trajectories. This can be regarded as a clustering problem.
Task 4: Compare the IGMM algorithm with the GMM based algorithms.
Task 5: Identify the minimum data length for discovering all mobility patterns.
Iv Proposed Method
Iv-a Estimate Daily Trajectories Probability Density
We assume that the GPS location data points are distributed randomly spatially. Besides, the distribution of each day consists of unknown number of heterogeneous sub-distributions. Therefore, it is reasonable to adopt mixed Gaussian models for estimating probability density of daily mobility.
Iv-A1 Gaussian Mixture Model
A Gaussian Mixture Model (GMM) is composed of a fixed number of sub components. The probability distribution of a GMM can be described as follows:
where, is the observable variable, is the assignment probability for each model, with , and is the internal parameters of the base distribution.
Let be the latent variables for indicating categories.
where, , in which only one element . It means is correspondent to .
If the base distribution is a Gaussian, then:
is the mean vector andis the precision matrix.
Therefore, an observable sample is drawn from GMM according to:
As it is illustrated above, one crucial issue of GMM is to pre-define the number of components . This is tricky because the probability distribution for each day’s mobility is not identical. Thus, to define a fixed for all mobility GMM models is not suitable in our case.
Iv-A2 Infinite Gaussian Mixture Model
Alternatively, we resort to the Infinite Gaussian Mixture Model (IGMM) [rasmussen2000infinite]. As compared to finite Gaussian Mixture Model, by using a Dirichlet process (DP) prior, IGMM does not need to specify the number of components in advance. Fig. 4 presents the graphical structure of the Infinite Gaussian Mixture Model.
In Fig. 4
, the nodes represents the random variables and especially, the shaded node is observable and the unshaded nodes are unobservable. The edges represent the conditional dependencies between variables. And the variables are within the plates means that they are drawn repeatedly.
According to Fig. 4, the Dirichlet process can be depicted as:
where, is a random measure, which consists of infinite base measure and is the hyper-parameter of
. In our case, it is a series of Gaussian distributions. Andis the concentration parameter. is the total samples number. is the parameters of base distribution. is the observable data for . is the latent variables that indicates the category of .
Alternatively, can be explicitly depicted as follow:
where, , and is Dirac function. determines the proportion weights of the clusters and the is the prior of the to determine the location of clusters in space.
We choose the Stick-breaking process (SBP) [sethuraman1994constructive] to implement the Dirichlet process as the prior for . The the Stick-breaking process can be described as follow:
Since is Gaussian, . Further, let be a Gaussian-Wishart distribution, then, . Therefore, similarly, draw an observable sample from IGMM:
Then, Variational Inference is used to solve the IGMM models. As compared to Gibbs sampling or to a Markov chain Monte Carlo (MCMC) method which consumes a large mount of calculating time, Variational Inference is relatively fast[blei2006variational]. The results will be demonstrated in the later experiments.
Iv-B Measure Daily Trajectories Similarities
The Kullback-Leibler (KL) divergence is a metric to evaluate the closeness between two distributions. For continuous variables, the KL divergence the expectation of the logarithmic difference between the and with respect to probability and vice versa. From (9) and (10), it can be seen that the KL divergence is non-negative and asymmetric. In many occasions, the inequality of the KL divergence is notorious. However, in our methodology, on the contrary, we take advantage of the characteristics of inequality to reveal the similarities among different trajectories instead of the Jensen-Shannon divergence which is a symmetric metrics.
There is no closed form to implement the KL divergence by the definition of (9) and (10) for Gaussian Mixture Models. Instead, we resort to the Monte Carlo simulation method proposed in [hershey2007approximating]. Then, the KL divergence can be caculated by:
This method is to draw a large amount of i.i.d samples from distribution to calculate according to (11) and as . And it is the same for implementing (10) by using (12). The results will be demonstrated in the later experiments. Furthermore, if we define a representative trajectory for a mobility pattern then we can distinguish whether a new trajectory belong to this cluster by comparing it to the representative trajectory. To do so, we need to set a threshold with a lower bound and an upper bound for the KL divergence, then it can be used as the metrics to cluster mobility patterns.
Iv-C Discover Mobility Patterns
As mentioned before, our task is to find the trajectories which are mutually similar. For this reason, we treat the different mobility patterns as different clusters in which the daily trajectories are their sub-members. Even so, the trajectories within the same clusters still can not be treated as identically distributed as other conventional clustering methods because of different trajectory lengths. Hence, we need to devise a algorithm that is able to cluster the trajectories based on the distribution similarity and the aforementioned KL divergence can be applicable as closeness metrics. Note that due to the large data scale and the number of the potential clusters, a high accuracy solution is intractable sometimes. Therefore, instead of pursuing a very accurate result, our purpose is to reach a relative accurate result in a reasonable amount of calculating time.
The first step of the clustering algorithm is to calculate the probability densities using the Infinite Gaussian Mixture Models. At this step, we create a list, in which the members are the probability densities of each. Then, the first cluster is created with one trajectory as its first member and it also will be compared with other trajectories.
Afterwards, we select another daily trajectory in the list and calculate the KL divergences, both and . And the new trajectory is added to the current cluster if the minimum and maximum of the KL-divergences are smaller than the lower bound and upper bound of the thresholds at the same time, respectively. And if the is smaller than , the new trajectory become the benchmark for the current cluster. An alternative way to do this is to compute the probability density of the current cluster using all the data of the discovered trajectories, however, the calculation will be massive.
This step will be repeated until all the trajectories belonging to the current cluster are discovered at the end of this iteration. Then, all the members of the current cluster are removed from iteration because, we assume that each trajectories can only be a member of one mobility pattern. At the start of new iteration, a new cluster is created, repeat the above steps until the list is empty. Finally, all the mobility patterns are discovered.
|Number of data collecting day|
|Total GPS data (longitudes, latitudes|
|Probability density for|
|Total mobility patterns|
|Discovered mobility pattern|
|Threshold for distinguishing patterns|
As it can be seen that our algorithm is designed to discover the latent mobility patterns automatically without the pre-knowledge of the numbers of existing patterns.
V Experiments and Results
V-a Dataset Description
We use the Mobile Data Challenge (MDC) dataset [kiukkonen2010towards], [laurila2012mobile] to validate our method. This dataset records comprehensive smartphone usages with fine granularity of time. The participants of the MDC dataset are up to nearly and the data collection campaign lasts more than months. This abundant information thus can be used to investigate individual mobility patterns.
To collect the individual location information, as compared to other methods, for instance, through stand-alone GPS devices, using GPS-equipped smartphones is a more practical way to have a larger group of participants without affecting their daily life.
In our study, we attempt to find the trajectories that belong to the same mobility patterns, thus we focus the spatial information of the GPS records, namely, the latitudes and longitudes and the time-stamps of the data are not considered. Meanwhile, since we consider not only the significant places but all location records, we use the unlabeled data without any semantic information.
V-B Experimental Setup
In the conducted experiment, we randomly select users with sufficient data. And each user’s is segmented by the time range of one day. Fig. 5 demonstrates the number of data collecting days for each user. It can be seen that the data collecting days for most users are more than . And with such amount of data, we believe that it is possible to discover individual’s mobility patterns from it.
However, as it is illustrated in Fig. 6, the data length of each day varies from less than hours to hours. And most of them is less than hours. Hence, we also should be aware that some data can be missing because the GPS modules were turned off or were not functioning. Consequently, it is one of the reasons that cause the data sparsity problem. In the following part, we will prove that our method can mitigate the impact of data sparsity.
Table II summarizes the temporal information about the GPS data for conducting the experiments.
|Collecting days for all users||300.25 days||6005.0 days|
|Collecting hours per day for all users||6.93 hours||41595.0 hours|
|Collecting hours per day for each user||6.67 hours||2084.65 hours|
To test the performance of our method, we will conduct three experiments from different perspectives:
We compare the IGMM model with the GMM model on estimating the daily trajectories probability density.
We use that the KL divergence to measure the closeness of different trajectories.
We test our method on each selected user data so as to find the daily mobility patterns for each individual.
We compare the results of the IGMM models to a series of fixed-number components GMM models.
We run the algorithm on the varying-length datasets, in the aim to find the minimum data length for discovering most mobility patterns of one individual.
V-C Experimental Results
V-C1 Probability Density Estimation
Fig. 7 and Fig. 8 show the density estimation results obtained by GMM and IGMM, respectively. It can be seen that, compared to the GMM model, the result of the IGMM model is more smooth. It suggests that IGMM is not affected by the number of components and it infers more information from the original data and it is less influenced by data sparsity. That is to say, on the same dataset, the computational results of IGMM have higher fidelity. Hence, in our approach, we chose IGMM to estimate probability density of daily mobility.
V-C2 Measuring Daily Trajectories Similarities
As shown in Fig. 9, we select daily trajectories from the data of one random user to present the KD divergences between different trajectories. The baseline trajectory is the Trajectory 1 and the rest of trajectories are chosen to make comparisons.
|Trajectory 1||Trajectory 2||7.21||2.82|
|Trajectory 1||Trajectory 3||1.28||1.83|
|Trajectory 1||Trajectory 4||19.07||1269.47|
|Trajectory 1||Trajectory 5||3.08||996.17|
Trajectory 2 is nearly a subset of Trajectory 1 and thus is larger than . And their values are both small, thus Trajectory 2 and Trajectory 1 can be regarded to belong to the same mobility pattern. Trajectory 3 is very similar to Trajectory 1 and almost equals to . Hence, they also are the members of the same mobility pattern. Trajectory 4 share a small part with Trajectory 1 whereas generally they are very different. and are both very large. Therefore, it is reasonable to recognize Trajectory 4 and Trajectory 1 as different patterns. And Trajectory 5 is totally different from Trajectory 1. And is small but are very large. So they naturally are not in the same pattern. According to the trajectories in the Fig. 9 and the results in Table III, it shows that the KL divergence is able to illustrate the difference among trajectories and can be the metrics for clustering.
V-C3 Discovering Daily Mobility Patterns
We run our algorithm on the data of the users to discover their daily mobility patterns.
Discovered Patterns: The partial results for different randomly selected user data are demonstrated in Fig. 10. It shows that, after clustered by our proposed algorithm, the data is split into different mobility patterns. Each cluster is composed of trajectories close to each other even if they are not distributed with the same density in the space. That proves our methodology is able to find the different mobility patterns even under the condition of noisy data and discontinuous trajectories.
Fig. 10: Discovered mobility patterns from three random selected users. Different colors represents different days.
Fig. 11 shows that our methodology is not only able to identify the different patterns in the daily trajectories data but is also able to find the most representative trajectories for each mobility pattern.
Fig. 11: Representative trajectories for each discovered mobility patterns.
Number of Patterns and Trajectories: Fig. 12 shows the number of discovered mobility pattern for all the user in our experiments. We can see that the number of mobility patterns varies from to more than and most of them are about to . It also can be found that the lengths of data collecting days are not proportional to the number of discovered mobility patterns, which indicates that the results rely more on the individual behavior rather than the data length.
Number of members for each patterns: Fig. 13 depicts the number of members for each discovered mobility patterns for all users. We can see that most mobility patterns consist of less than trajectories. And nearly of the patterns have only one trajectory, whereas few patterns have more than trajectories.
One needs to notice that the number of discovered patterns depends on the Kullback-Leibler divergence thresholds we set in the clustering algorithm. When the thresholds are small, it means that the condition to be in the same mobility pattern is more strict and naturally the discovered mobility patterns are more and the member of each patterns are less, and vice versa.
V-C4 Comparison to GMM
To compare the Infinite Gaussian Mixture Models, we use a group of Gaussian Mixture Models with different numbers of components to estimate the daily mobility probability densities in our proposed clustering algorithm. The metrics we adopt to evaluate the results is the mean log-likelihood. The results show in Table IV that changing the fixed number of component Gaussian Mixture Models can not enhance the clustering performance. On the contrary, the Infinite Gaussian Mixture Models can improve the clustering performance.
V-C5 Varying Data Length
To investigate how the data length, namely, the number of days of the data, affects the results, we utilize different data lengths which varies from days to days. The results are shown in Fig. 14. It can be seen that, from -day data length to -day data length, the average discovered mobility pattern numbers increase as the data length grows. While, when the data length is larger than days, the patterns numbers change marginally. Thus, according to the results, we can say that, generally, a -day GPS dataset is large enough to discover most of the mobility patterns of an individual.
Vi Conclusion and Perspective
In this work, we presented a probabilistic approach to discover human daily mobility patterns based on GPS data collected by smartphones.
In our approach, the human daily mobility is considered as sets of probability distributions. The proposed approach is divided into three parts. The first step is to estimate the probability densities. We argue that the Infinite Gaussian Mixture Model is more appropriate than the standard Gaussian Mixture Model to this issue, this argument being besides validated by the experimental results. Further, in order to find the similar trajectories, one needs to measure the closeness between the trajectories. For this task, we chose the Kullback-Leibler divergence as distance metrics. According to the computational results from the selected trajectories, we validated, on test sets, that KL divergence is able to measure the similarities among the trajectories. Finally, we devised a novel automatic clustering algorithm combining the advantages of both IGMM and the KL divergence so as to discover human daily mobility patterns without having the knowledge of the cluster number in advance.
For validation, we select random individual data from the MDC dataset to conduct the different experiments. The results obtained show that our proposed approach can discern different mobility patterns and select the most representative trajectories for each mobility patterns from the GPS data. In addition, we also compared the IGMM based algorithm with a group of GMM based algorithms with various fix-number components, the results reveal that the IGMM model performs better. Finally, testing varying-length dataset on our methods leads to results which suggest that a -day GPS is generally sufficient enough to discover most of the individual daily mobility patterns.
We are aware of that human mobility is also a time-related behavior. Thus, as future work, we plan to take into account the temporal information, for example, hour of day and day of week. Based on that, we will try to build a spatial-temporal probabilistic model to predict human mobility. In addition, for further study, we may exploit other smartphone usage information (i.g., application usage) in the dataset to obtain more knowledge about human behavior.
The research in this paper used the MDC Database made by Idiap Research Institute, Switzerland and owned by Nokia. The authors would like to thank the MDC team for providing the access to the database.