With the widespread use of LBSs (Location-based Services), synthesizing location traces plays an increasingly important role in analyzing spatial big data while protecting users' privacy. Although location synthesizers have been widely studied, existing synthesizers do not provide utility, privacy, or scalability sufficiently, hence are not practical for large-scale location traces. To overcome this issue, we propose a novel location synthesizer called PPMTF (Privacy-Preserving Multiple Tensor Factorization). We model various statistical features of the original traces by a transition-count tensor and a visit-count tensor. We simultaneously factorize these two tensors via multiple tensor factorization, and train factor matrices via posterior sampling. Then we synthesize traces from reconstructed tensors using the MH (Metropolis-Hastings) algorithm. We comprehensively evaluate the proposed method using two datasets. Our experimental results show that the proposed method preserves various statistical features, provides plausible deniability, and synthesizes large-scale location traces in practical time. The proposed method also significantly outperforms the state-of-the-art with regard to the utility, privacy, and scalability.READ FULL TEXT VIEW PDF
The prevalence of mobile devices and Location-Based Services (LBS)
The progress of location-based services has led to serious concerns on
Tensor factorization has been demonstrated as an efficient approach for
As network security monitoring grows more sophisticated, there is an
A point-of-interest (POI) recommendation system plays an important role ...
Given sparse multi-dimensional data (e.g., (user, movie, time; rating) f...
Accurately learning from user data while providing quantifiable privacy
LBSs (Location-based Services) have been used in a variety of applications such as POI (Point-of-Interest) search, route finding, games, and geo-social networking. Consequently, a large amount of location traces (time-series location trails) have been collected into the LBS provider. The LBS provider can provide these location traces (also called spatial big data [Shekhar_MobiDE12]) to a third party (or data analyst) to perform various types of geo-data analysis tasks; e.g., finding popular POIs [Zheng_WWW09], semantic annotation of POIs [Do_TMC13, Ye_KDD11], modeling human mobility patterns [Song_TMC06, Liu_CIKM13, Lichman_KDD14, Cranshaw_ICWSM12], road map inference [Liu_KDD12, Biagioni_TRB12].
Although such geo-data analysis is important for industry and society, there are some serious privacy issues. For example, users’ sensitive locations (e.g., homes, hospitals), profiles (e.g., age, profession) [Matsuo_IJCAI07, Zheng_LBSN09, Kawamoto_ESORICS19], or social relationship [Eagle_PNAS09, Bilogrevic_WPES13]
can be estimated from traces. Robbers or stalkers[please_rob_me, Voelcker_Spectrum06] might also exploit location data of victims. This kind of privacy is collectively called location privacy [privacy_for_LBS, Primault_CST19, Chatzikokolakis_FTPS17], and numerous studies have been made on this issue.
Synthesizing location traces [Bindschaedler_SP16, Shokri_PETS11, Chow_WPES09, Kato_SIGSPATIAL12, Suzuki_GIS10, You_MDM07, Kido_ICPS05, Herrmann_WPES13] is one of the most promising approaches to perform geo-data analysis while protecting users’ privacy. This approach first trains some generative model from the original traces (referred to as training traces). Then it generates synthetic traces (or dummy traces) using the trained generative model. The synthetic traces preserve some statistical features (e.g., population distribution, transition matrix) of the original traces, as these features are modeled by the generative model. Thus, based on the synthetic traces, a data analyst can perform a variety of geo-data analysis tasks explained above. In addition, the synthetic traces are (ideally) designed to provide little information about the original traces, and hence protect users’ privacy from a possibly malicious data analyst.
Synthetic traces can also be used for local obfuscation in LBSs. Specifically, a user (who does not trust even the LBS provider) can send synthetic traces as dummies along with her own trace to use an LBS (e.g., POI search, route finding). If the synthetic traces look to be actual traces, the adversary cannot find out which is the user’s actual trace among the traces sent to the LBS provider. Since this approach also sends the true locations (unlike adding noise, generalization, and deleting locations [Shokri_SP11]), it does not degrade the service quality of the LBS (at the cost of some communication overhead).
Ideally, a location synthesizer should satisfy the following three features: (i) high utility: it synthesizes traces that preserve various statistical features of the original traces; (ii) high privacy: it provides little information about the original traces; (iii) high scalability: it generates a large amount of traces within an acceptable time; e.g., within days or weeks at most. All of these features are vitally necessary for spatial big data analysis. They are also necessary for location obfuscation when the number of users is very large. Note that high utility is important in location obfuscation, because otherwise synthetic traces can be distinguished from real traces [Bindschaedler_SP16].
Although many location synthesizers [Bindschaedler_SP16, Shokri_PETS11, Chow_WPES09, Kato_SIGSPATIAL12, Suzuki_GIS10, You_MDM07, Kido_ICPS05, Herrmann_WPES13] have been studied, none of them are satisfactory in terms of the three features, which we explain in detail below.
Related work. Location privacy has been widely studied in the literature (see [Krumm_PUC09, privacy_for_LBS, Primault_CST19, Chatzikokolakis_FTPS17] for surveys) with numerous LPPMs (Location-Privacy Protection Mechanisms); e.g., adding noise (perturbation) [Andres_CCS13, Shokri_CCS12, Murakami_USENIX19], location generalization [Shokri_SP11, Gedik_TMC08, Chow_SIGKDD11], deleting locations [Shokri_SP11, Xue_ICDE13, Hoh_TMC10], and synthesizing traces (adding dummies) [Bindschaedler_SP16, Shokri_PETS11, Chow_WPES09, Kato_SIGSPATIAL12, Suzuki_GIS10, You_MDM07, Kido_ICPS05, Herrmann_WPES13]. Synthesizing traces is promising in terms of geo-data analysis and location obfuscation, as explained above.
Although Location synthesizers have been widely studied over a decade, Bindschaedler and Shokri [Bindschaedler_SP16] showed most of them do not satisfactorily preserve statistical features (especially, semantic features of human mobility, e.g., “many people spend night at home”), and do not provide high utility.
A synthetic location traces generator in [Bindschaedler_SP16] (denoted by SGLT) is a state-of-the-art location synthesizer. SGLT first trains semantic clusters, which group locations that are semantically similar (e.g., homes, malls) from training traces. Then it generates a synthetic trace from a training trace (called a seed trace) by replacing each location with all locations in the same cluster and then sampling a trace via the Viterbi algorithm. [Bindschaedler_SP16] showed that SGLT preserves semantic features explained above and hence provides high utility.
However, SGLT has an issue with scalability, which is essential for spatial big data analysis. Specifically, the running time of semantic clustering in SGLT is quadratic in the number of training users and cubic in the number of locations. Consequently, SGLT cannot be used for generating large-scale traces. For example, we show that when the numbers of users and locations are about and , respectively, SGLT would require over four years even using nodes of a supercomputer in parallel. In addition, SGLT uses plausible deniability using a semantic distance between two traces (introduced in [Bindschaedler_SP16]) as a privacy measure, and its relationship with DP (Differential Privacy) [Dwork_ICALP06, DP] is unclear.
Bindschaedler et al. [Bindschaedler_VLDB17] proposed a synthetic data generator (denoted by SGD) for any kind of data using a dependency graph. However, SGD was not applied to location traces, and its effectiveness for traces was unclear. We apply SGD to location traces, and show that it does not provide high utility.
Our contributions. We design a location synthesizer with high utility, privacy, and scalability. Specifically, we propose a novel approach called PPMTF (Privacy-Preserving Multiple Tensor Factorization). Our contributions are as follows:
We propose PPMTF for synthesizing traces. PPMTF models statistical features of training traces by two tensors: a transition-count tensor, which contains a transition matrix for each user, and visit-count tensor, which represents a time-dependent population distribution. PPMTF simultaneously factorizes the two tensors via MTF (Multiple Tensor Factorization) [Khan_ECMLPKDD14, Takeuchi_ICDM13] and trains factor matrices (parameters in our generative model) via posterior sampling [Wang_ICML15] to provide high utility and high privacy. To our knowledge, this work is the first to propose MTF in a privacy preserving way.
We propose a method to synthesize traces from the trained parameters using the MH (Metropolis-Hastings) algorithm [mlpp] to make the transition matrix consistent with the time-dependent population distribution.
We comprehensively show that the proposed method (denoted by Proposal) provides high utility, privacy, and scalability (for details, see below).
For utility, we show Proposal significantly outperforms the state-of-the-art in terms of the following statistical features.
(a) Time-dependent population distribution. The population distribution (i.e., distribution of visited locations) can be used for various geo-data analysis such as finding popular POIs [Zheng_WWW09] and semantic annotation of POIs [Do_TMC13, Ye_KDD11]. It can also be used to provide a user with the number of current visitors at a specific POI [Hu_SIGMOD12]. Some of these tasks are also based on the fact that the population distribution is inherently time-dependent; e.g., [Ye_KDD11] uses the population distribution for each hour as a feature; [Hu_SIGMOD12] considers a scenario where the number of current diners in a nearby restaurant is sent to the user.
(b) Transition matrix. The transition matrix is a main feature for modeling human movement patterns, and is used for predicting the next POI [Song_TMC06] or recommending POIs [Liu_CIKM13].
(c) Distribution of visit-fractions. For semantic annotation of POIs, a distribution of visit-fractions (or visit-counts) is a key feature. For example, [Do_TMC13] shows that many people spend about of the time at home and about of the time at work/school. [Ye_KDD11] shows that most users visit a hotel once or twice, while they may visit a restaurant for multiple times.
(d) Cluster-specific population distribution. At an individual level, a location distribution is different from user to user, and forms some clusters; e.g., those who commute by car, and those who often visit malls. The population distribution for such a cluster can be used for modeling human location patterns [Lichman_KDD14], road map inference [Liu_KDD12, Biagioni_TRB12], and smart cities [Cranshaw_ICWSM12].
We show that Proposal provides all of (a)-(d), whereas SGD does not consider user/cluster-specific features in a practical setting, and hence does not provide (c) nor (d). We emphasize that (a)-(d) are important also for location obfuscation, because synthetic traces that do not preserve the statistical features can be distinguished from real traces [Bindschaedler_SP16].
For a privacy measure, we use -PD (Plausible Deniability) in [Bindschaedler_VLDB17], which has a connection with DP. We show that Proposal provides -PD for reasonable and .
Regarding the scalability, for a larger number of users and a larger number of locations, Proposal’s time complexity is much smaller than SGLT’s complexity . [Bindschaedler_SP16] evaluated SGLT using training traces of only users. In this paper, we use the Foursquare dataset in [Yang_WWW19] (we use six cities; training users in total), and show Proposal generates the corresponding traces within hours (about times faster than SGLT), and can also deal with a million traces.
Note that Proposal does not outperform SGLT or SGD in all aspects; e.g., in our experiments, Proposal outperforms SGLT with respect to (a), but is worse than SGLT with respect to (b); SGD is more scalable than Proposal. However, our biggest advantage is that Proposal is the first to provide all of the utility ((a)-(d)), privacy, and scalability to our knowledge, which makes large-scale traces generation practical.
We also implemented Proposal with C++, and made it public [PPMTF]. The proposed method was also used as a part of the location synthesizer for anonymization competition [PWScup2019].
Let , , , and be the set of natural numbers, non-negative integers, real numbers, and non-negative real numbers, respectively. For , let . For a finite set , let be the set of all finite sequences of elements of . Let be the power set of , and
be the set of all probability distributions over.
We discretize locations by dividing the whole map into distinct regions or extracting frequently visited POIs (Point-of-Interests). Let be a finite set of discretized locations (i.e., regions or POIs), and be the -th location. We also discretize time into time instants (e.g., by partitioning it at a regular interval such as minutes or hour), and represent a time instant as a natural number. Let be a finite set of time instants under consideration.
In addition to the time instant, we introduce a time slot as a time resolution in geo-data analysis; e.g., if we want to compute the time-dependent population distribution for every hour, then the length of each time slot is one hour. We represent a time slot as a set of time instants. Formally, let be a finite set of time slots, and be the -th time slot. In Figure 1, , , , and . The time slot can be composed of either one time instant or multiple time instants (as in Figure 1). The time slot can also be composed of separated time instants; e.g., if we set the interval between two time instants to hour, and want to average the population distribution for every two hours over two days, then , and .
Next we formally define the notion of traces as follows. We refer to a pair of a location and a time instant as an event, and denote the set of all events by . Let be a finite set of all training users, and be the -th training user. Then we define each trace as a pair of a user and a finite sequence of events, and denote the set of all traces by . Each trace may be missing some events. Without loss of generality, we assume that each training user has provided a single training trace (if a user provides multiple temporally-separated traces, we can concatenate them into a single trace by regarding events between the traces as missing). Let be the finite set of all training traces, and be the -th training trace (i.e., training trace of ).
Figure 1 shows an example of training traces; e.g., , . We can also deal with a situation in which a user is in multiple regions in the same interval between two time instants. For example, assume that user visits at 7:00, at 7:10, and at 7:20. If we set the length of each time interval to minutes as in Figure 1, then the training trace of can be expressed (by rounding down minutes to a multiple of ) as follows: .
We train parameters of a generative model (e.g., semantic clusters in [Bindschaedler_SP16], MTF parameters in the proposed method) from training traces, and use the model to synthesize a trace. As with [Bindschaedler_SP16], we assume that, given a user ’s training trace , the trained generative model outputs a synthetic trace that resembles , while protecting the privacy of . We denote by a synthetic trace, and by a probabilistic generative model that, given a training trace , outputs a synthetic trace with probability .
Here we introduce DP (Differential Privacy) [Dwork_ICALP06, DP] and PD (Plausible Deniability) [Bindschaedler_SP16, Bindschaedler_VLDB17] as privacy measures.
Differential Privacy. We define the notion of neighboring data sets in the same way as [DP, Wang_ICML15, Liu_RecSys15] as follows. Let be two sets of training traces. We say and are neighboring if they differ from each other at most one trace and consist of the same number of traces, i.e., . For example, given a trace , in Figure 1 and are neighboring.
Then DP [Dwork_ICALP06, DP] is defined as follows:
Let . A randomized algorithm with domain provides -DP if for any neighboring and any ,
If , we abbreviate -DP as -DP.
Intuitively, -DP guarantees that an adversary who has observed the output of cannot determine, for any pair of and , whether it comes from or (i.e., a particular user’s trace is included in the training trace set) with a certain degree of confidence. As the privacy budget approaches to , and become almost equally likely, which means that a user’s privacy is strongly protected. -DP is a relaxation of -DP that allows for deviating from -DP with probability .
Plausible Deniability. The notion of PD was originally introduced by Bindschaedler and Shokri [Bindschaedler_SP16] to quantify how well a trace synthesized from a generative model provides privacy for the input training trace. However, PD in [Bindschaedler_SP16] was defined using a semantic distance between traces, and its relationship with DP was unclear. Later, Bindschaedler et al. [Bindschaedler_VLDB17] modified PD to clarify the relationship between PD and DP. In this paper, we use the definition of PD in [Bindschaedler_VLDB17]:
Let and . Let be a training user index. For a training trace set with , a synthetic trace output by a generative model with an input training trace is releasable with -PD if there exist at least distinct training user indexes such that for any ,
Note that [Bindschaedler_VLDB17] uses instead of ; i.e., -PD in Definition 2 is identical to -PD in [Bindschaedler_VLDB17], where .
Intuitively, -PD in Definition 2 guarantees that the input training trace is indistinguishable from at least other training traces out of other training traces. Note that -PD is totally different from -anonymity, as discussed in [Bindschaedler_SP16, Bindschaedler_VLDB17]. -anonymity is vulnerable to attribute inference [Machanavajjhala_ICDE06] or location inference [Shokri_SP11] by an adversary with some background knowledge, whereas -PD does not depend on the background knowledge [Bindschaedler_VLDB17].
PD is a privacy measure for an adversary who obtains synthetic traces, whereas the parameters of the generative model might also leak some information about the training trace set . However, it should be noted that this can be prevented by keeping secret the parameters of (or discarding the parameters after synthesizing traces). In fact, a location synthesizer in [Shokri_SP11] outputs only synthetic traces, and does not output the parameters of (i.e., semantic clusters) computed from . Similarly, we consider a location synthesizer that outputs only synthetic traces and keeps secret the parameters of (or discarding them after synthesizing traces).
We propose PPMTF (Privacy-Preserving Multiple Tensor Factorization) for synthesizing location traces. We first describe its overview (Section 3.1). We then explain the computation of two tensors from training traces (Section 3.2), the training of MTF parameters from the two tensors (Section 3.3), and the synthesis of traces from the MTF parameters (Section 3.4). We finally prove the DP (Differential Privacy) of PPMTF, and introduce the PD (Plausible Deniability) test (Section 3.5).
Proposed method. Figure 2 shows the overview of the proposed method. It is composed of the following five steps.
We compute a transition-count tensor and visit-count tensor from a training trace set (Sections 3.2).
The transition-count tensor is composed of the “User,” “Location,” and “Next Location” modes, and its ()-th element contains a transition-count of user from location to . In other words, this tensor represents the movement pattern of each training user in the form of transition-counts.
The visit-count tensor is composed of the “User,” “Location,” and “Time Slot” modes, and the ()-th element contains a visit-count of user at location in time slot . In other words, this tensor contains a histogram of visited locations for each user and each time slot.
We share the “User” and “Location” modes between the two tensors, and perform MTF (Multiple Tensor Factorization) [Khan_ECMLPKDD14, Takeuchi_ICDM13] for the two tensors. For a tensor factorization model, we use the CP decomposition [Cichocki_Wiley09] in the same way as [Khan_ECMLPKDD14, Takeuchi_ICDM13]. The CP decomposition factorizes a tensor into low-rank matrices called factor matrices along each mode. Since we share the “User” and “Location” modes, we factorize the two tensors into four factor matrices as in Figure 2 (Section 3.3).
Formally, let be the number of columns (factors) in each matrix, and , , , and be the factor matrices, which correspond to the “User,” “Location,” “Next Location,” and “Time Slot” mode, respectively. , , , and are the MTF parameters. Typically, the number of columns is much smaller than the numbers of users and locations; i.e., . In our experiments, we set as in [Murakami_TIFS16]. Let be the tuple of MTF parameters. We train MTF parameters from the two tensors via posterior sampling [Wang_ICML15]. The MTF parameters are used as parameters of our generative model.
Given a training trace , we compute a transition-probability matrix and visit-probability vector of the corresponding user for each time slot. We compute them from the MTF parameters via the MH (Metropolis-Hastings) algorithm [mlpp] (Section 3.4).
We generate a synthetic trace that resembles the training trace of user (Section 3.4).
We finally perform the PD test, which verifies whether is releasable with -PD (Section 3.5).
The main advantages of the proposed method are high utility, privacy, and scalability.
Utility. First, the proposed method achieves high utility by modeling statistical features of training traces using two tensors. Specifically, the transition-count tensor represents the movement pattern of each user in the form of transition-counts, whereas the visit-count tensor contains a histogram of visited locations for each user and time slot. Consequently, our synthetic traces preserve statistical features, e.g., a transition matrix, a time-dependent population distribution, and a distribution of visit-counts per location. These are key features for various geo-data analyses, e.g., modeling human movement patterns [Song_TMC06, Liu_CIKM13] or location patterns [Lichman_KDD14, Cranshaw_ICWSM12], finding popular POIs [Zheng_WWW09], and semantic annotation of POIs [Do_TMC13, Ye_KDD11].
Furthermore, the proposed method automatically finds users who have similar behavior (e.g., those who always stay in Manhattan, and those who often visit universities) and locations that are semantically similar (e.g., restaurants, parks), as factor matrices in tensor factorization represent clusters [Cichocki_Wiley09] (see [Murakami_TIFS16, Takeuchi_ICDM13] for examples of location data). Consequently, our synthetic traces preserve the mobility behavior of similar users and the semantics of similar locations. They are also useful for various geo-data analysis (e.g., modeling human location or movement patterns, road map inference [Liu_KDD12, Biagioni_TRB12], semantic annotation of POIs). In Section 4, we visualize how well similar users and similar locations are clustered.
The proposed method also addresses the sparseness of the tensors by sharing factor matrices and . In Appendix E, we show that the utility is improved by sharing and .
Privacy. Second, the proposed method achieves high privacy by training the MTF parameters via posterior sampling. As shown in [Wang_ICML15], posterior sampling-based Bayesian learning algorithms provide DP “for free” (without additional noise) by producing a sample from a posterior distribution given a dataset. We utilize this fact, and sample the MTF parameters from a posterior distribution given a training trace set .
A major issue in differentially private matrix/tensor factorization is that the privacy budget can be very large in practice. This follows from the fact that increases with increase in the number of observed elements per user, and with increase in the maximum value of elements. For example, Liu et al.[Liu_RecSys15] applied posterior sampling to matrix factorization, and reported that it needs to achieve high utility. It is important to note, however, that DP guarantees indistinguishability for all possible users including the one whose elements are completely different from normal users; e.g., a malicious user who provides a value completely opposite from what the model would predict for all elements, as pointed out in [Liu_RecSys15]. Obviously, it is not necessary for the input training trace to be indistinguishable from such anomaly traces.
Thus we use PD as a privacy measure in the same way as [Bindschaedler_SP16, Bindschaedler_VLDB17]. In other words, we guarantee that the input training trace is indistinguishable from some traces that would appear in practice. In our experiments, we show that the proposed method provides -PD for reasonable and .
Scalability. Finally, the proposed method achieves much higher scalability than the location synthesizer in [Bindschaedler_SP16]. Specifically, the time complexity of [Bindschaedler_SP16] (semantic clustering) is , which is very large for training traces with large and . On the other hand, the time complexity of the proposed method is , which is much smaller than the synthesizer in [Bindschaedler_SP16] (see Appendix B for details). In our experiments, we evaluate the running time of the proposed method and the synthesizer in [Bindschaedler_SP16], and show that our method can be applied to much larger-scale training datasets.
We now explain how to compute two tensors from a training trace set (i.e., step (i)) in detail.
The transition-count tensor contains a transition-count matrix for each user. Let be the transition-count tensor, and be its ()-th element. For example, in Figure 3, since two transitions from to are observed in the training trace of in Figure 1.
The visit-count tensor contains a histogram of visited locations for each user and each time slot. Let be the visit-count tensor, and be its ()-th element. For example, in Figure 3, as visits twice in (i.e., from time instant to ) in Figure 1.
Let . Typically, and are sparse; i.e., many elements are zeros In particular, can be extremely sparse for large , as its size is quadratic in .
For both tensors, we randomly delete positive elements of users who have provided much more positive elements than the average (i.e., outliers) in the same way as[Liu_RecSys15]. This is called trimming, and is effective for matrix completion [Keshavan_NIPS09]. It is also necessary to provide DP [Liu_RecSys15]. Similarly, we also set the maximum value of counts for each element, and truncate counts that exceed the maximum number.
Specifically, let be the maximum numbers of positive elements per user in and , respectively. Typically, and . For each user, if the number of positive elements in exceeds , then we randomly select elements from all positive elements, and delete the remaining positive elements. Similarly, we randomly delete extra positive elements in In addition, let be the maximum counts for each element in and , respectively. For each element, we truncate to if (resp. to if ).
In our experiments, we set (as in [Liu_RecSys15]), and , since the number of positive elements per user and the value of counts were less than and , respectively, in most cases.
After computing the tensors , we train the MTF parameters via posterior sampling (i.e., step (ii)). Below we describe our MTF model and the training.
Model. Let be the -th elements of , , , and , respectively. In addition, let and be the approximation of and by the proposed method, respectively, and and be their ()-th elements. Then and are given by:
where the matrices and are shared between and .
For MTF parameters , we assume a hierarchical Bayes model [Salakhutdinov_ICML08], since it outperforms the non-hierarchical one [Salakhutdinov_NIPS07] in terms of the model’s predictive accuracy. Figure 4 shows a graphical model of the proposed method. For the conditional distribution of the two tensors given the MTF parameters , we assume that each element (resp.
) is independently generated from a normal distribution with mean(resp. ) and precision (ranging over in our experiments later).
Here it is important to note that each element in contains a transition/visit count, and such a count is called implicit feedback [recommender_systems, Hu_ICDM08] (e.g., purchasing quantities in item recommendation). Unlike the explicit feedback dataset that contains negative values (e.g., integer ratings in ), zero elements in the implicit feedback dataset should be treated as s rather than missing elements to avoid overfitting [recommender_systems]. However, the treatment of all zero elements as s results in a high computational cost, especially for large tensors.
We overcome the computational issue by randomly sampling a small number of zero elements and treating them as s [recommender_systems, Pan_ICDM08]. Specifically, we randomly sample and zero elements for each user in and , respectively, where and (in our experiments, we set ). We treat the sampled zero elements as s, and the remaining zero elements as missing. Let (resp. ) be the indicator function that takes if (resp. ) is missing, and takes otherwise. Note that (resp. ) takes 1 at most (resp. ) elements for each user, where (resp. ) is the maximum number of positive elements per user in (resp. ).
Then the distribution can be written as follows:
where denotes the probability of in the normal distribution with mean and precision
Let be the -th rows of , , , and , respectively. Then, for the distribution of , we assume the multivariate normal distribution:
where , , are mean vectors, , , , are precision matrices, and , , , are called hyper-parameters. Let .
The hierarchical Bayes model assumes some distribution for the hyper-parameters. We assume each hyper-parameter follows a normal-Wishart distribution [prml]
, i.e., the conjugate prior of a multivariate normal distribution:
where , , and denotes the Wishart distribution with parameters and . ( and
are called a scale matrix and the number of degrees of freedom, respectively)., , , and are called hyper-hyper-parameters, and are determined in advance. In our experiments, we set , ,
to the identity matrix, andin the same way as [Salakhutdinov_ICML08].
Posterior sampling of . We train the MTF parameters based on the posterior sampling method [Wang_ICML15]. This method trains from by sampling from the posterior distribution . To sample from , we use the Gibbs sampling [mlpp], which samples each variable in turn, conditioned on the current values of the other variables.
Specifically, we sample , , , , , , , and in turn. We add the superscript “(t)” to these variables to denote the sampled values at the -th iteration. For initial values with “(0)”, we use a random initialization method [Albright_SAS14] that initializes each element as a random number in , as it is widely used. Then, we sample , , , , , , , and from the conditional distribution given the current values of the other variables, and iterate the sampling for a fixed number of times (see Appendix G for details).
The Gibbs sampling guarantees that the sampling distributions of approach to the posterior distributions as increases, and hence approximates sampled from the posterior distribution for large . In our experiments, we discarded the first samples as “burn-in”, and used as an approximation of . We also confirmed that the model’s predictive accuracy was converged within iterations.
After training the MTF parameters from the tensors , we generate synthetic traces via the MH (Metropolis-Hastings) algorithm [mlpp] (i.e., steps (iii) and (iv)). Here we explain how to, given a training trace , generate a synthetic trace that resembles of user .
Let be the set of transition-probability matrices, and be the set of -dimensional probability vectors (i.e., probability simplex). Given a transition-probability matrix and a probability vector , the MH algorithm modifies to so that the stationary distribution of is equal to . Note that is a conditional distribution called a proposal distribution, and is called a target distribution.
In the step (iii), given , we reconstruct the transition-count matrix and visit-count matrix of user (note that is included in ), and use the MH algorithm to make a transition-probability matrix of consistent with a visit-probability vector of for each time slot. Figure 5 shows its overview. Specifically, let and be the -th matrices in and , respectively (i.e., reconstructed transition-count matrix and visit-count matrix of user ). We first compute and from by (3). Then we compute a transition-probability matrix of user from by normalizing counts to probabilities. Similarly, we compute a visit-probability vector of user for each time slot from by normalizing counts to probabilities. Then, for each time slot , we modify to via the MH algorithm so that the stationary distribution of is equal to . In the subsequent step (iv), we generate a synthetic trace using .
Computing via MH. We first compute the -th matrix in from by (3), and then compute from by normalizing counts to probabilities as follows. We assign a very small positive value ( in our experiments) to elements in whose values are smaller than , and then normalize to so that the sum over each row in is . Since we assign () to elements with smaller values in , the transition-probability matrix is regular [mlpp]; i.e., it is possible to get from any location to any location in one step. This allows to be the stationary distribution of , as explained later in detail.
We then compute the -th matrix in from by (3). For each time slot , we assign () to elements with smaller values in , and then normalize the -th column of to so that the sum of is one.
We use as a proposal distribution and as a target distribution, and apply the MH algorithm to obtain a transition-probability matrix whose stationary distribution is . For and , we denote by the transition probability from to (i.e., the -th element of ). Similarly, given , we denote by the visit probability at . Then, the MH algorithm computes for as follows:
and computes as follows: . Note that is regular, as all elements in and are positive. Then the MH algorithm guarantees that is a stationary distribution of [mlpp].
Generating traces. After computing via the MH algorithm, we synthesize a trace of user as follows. We first randomly generate the first location at time slot from the visit-probability distribution . Then we randomly generate the subsequent location at time slot using the transition-probability matrix . Figure 6 shows an example of synthesizing a trace of user , where a location at time instant is randomly generated from the conditional distribution given the location at time instant .
The synthetic trace is generated in such a way that a visit probability in time slot is given by . In addition, the transition matrix is computed by using as a proposal distribution. Therefore, we can synthesize traces that preserve the statistical feature of training traces such as the time-dependent population distribution and the transition matrix.
Here we prove the DP of PPMTF, and describe the PD test.
DP of PPMTF. Since we train the MTF parameters via posterior sampling, the training algorithm provides DP for free (without additional noise).
Specifically, let be our training algorithm in the step (ii), which takes as input the training trace set and outputs the MTF parameters . Recall that the maximum counts in and are and , respectively, as defined in Section 3.2. Let satisfy that and for each . can be made small by iterating the sampling of until we find with small [Liu_RecSys15]. In addition, assume that is sampled from the exact posterior distribution . Then we obtain:
provides -DP, where
See Appendix C for the proof. Since a trace is synthesized from after outputs , all synthetic traces provide -DP thanks to the immunity to post-processing [DP].
However, can be very large in practice. Both the number of observed elements per user and the maximum value of elements with increase in the length of the training traces. For example, we set , , , and in our experiments, and consequently is larger than . We emphasize again that DP guarantees that the input training trace is indistinguishable from all possible traces including anomaly traces. Thus we use PD as a privacy measure, and perform the PD test as follows.
PD test. We perform the PD test to guarantee -PD in Definition 2 for synthetic traces.
Let be our generative model in the steps (iii) and (iv), which maps a training trace to a synthetic trace with probability . Let be a function that, given time instant , outputs an index of the location at time instant in ; e.g., in Figure 6. Furthermore, let be a function that, given time instant , outputs an index of the corresponding time slot; e.g., in Figure 6.
Recall that the first location in is randomly generated from , and the subsequent location at time instant is randomly generated from . Thus we obtain:
Hence, given , we can compute for any as follows: (i) compute for each time slot via the MH algorithm (as described in Section 3.4); (ii) compute . Then, we can verify whether is releasable with -PD by counting the number of training traces such that (2) holds for each other.
For example, we can perform the following PD test in [Bindschaedler_VLDB17]:
Let and . Given a training trace set , input training trace , and synthetic trace , output pass or fail as follows:
Let be a non-negative integer that satisfies:
Let be the number of traces such that:
If , then return pass, otherwise return fail.
It is easy to check by (2), (7), and (8) that if passes Privacy Test 1, then is releasable with -PD. In addition, -PD is guaranteed even if is not sampled from the exact posterior distribution (unlike -DP). The time complexity of Privacy Test 1 is linear in .
In this paper, we randomly select a subset of training traces from (as in [Bindschaedler_VLDB17]) to check faster whether or not. Specifically, we initialize to , and check (8) for each training trace in (increment if (8) holds). If , then we return pass (otherwise, return fail). Let be the set of training users corresponding to . Then the time complexity of this faster version of Privacy Test 1 is linear in (). A lower value of leads to a faster -PD test at the expense of fewer synthetic traces passing the test (and hence the loss of utility). In Section 4, we use the faster version of Privacy Test 1 with , , and (note that is considered to be reasonable in DP [Dwork_JPC09, DP_Li]). In Appendix F, we also analyze the effect of on the performance of the proposed method.
We can also use a randomized test [Bindschaedler_VLDB17], which adds Laplacian noise to , to provide -DP for a single synthetic trace generated from a randomly selected input training trace . However, and can be large for multiple synthetic traces from the same input training trace (as discussed in [Bindschaedler_VLDB17]). Thus we did not use the randomized test in our experiments.
We conducted experiments to show the utility, privacy, and scalability of the proposed method.
In our experiments, we used two publicly available datasets: the SNS-based people flow data [SNS_people_flow] and the Foursquare dataset in [Yang_WWW19]. The former is a relatively small-scale dataset with no missing events, and is used for comparing the proposed method with two state-of-the-art synthesizers [Bindschaedler_SP16, Bindschaedler_VLDB17]. The latter is recently published, and is one of the largest publicly available location datasets; e.g., much larger than [Yang_TIST16, Gowalla, Geolife, CRAWDAD]. Since the location synthesizer in [Bindschaedler_SP16] cannot be applied to this large-scale dataset (as shown in Section 4.4), we compare the proposed method with the synthesizer in [Bindschaedler_VLDB17].
SNS-based People Flow Data. The SNS-based people flow data [SNS_people_flow] (denoted by PF
) contains artificial traces around the Tokyo metropolitan area. The traces were generated from real geo-tagged tweets by interpolating locations every five minutes using railway and road information[Sekimoto_PC11].
In our experiments, we divided the Tokyo metropolitan area into regions; i.e., . Then we set the interval between two time instants to minutes, and extracted traces from 9:00 to 19:00 for users (each user has a single trace comprising events). We also set time slots to minutes long from 9:00 to 19:00. In other words, we assumed that each time slot is composed of one time instant; i.e., . We randomly divided the traces into train traces and testing traces; i.e., . The training traces were used for training generative models and synthesizing traces, whereas the testing traces were used for evaluating the utility.
Since the number of users is small in PF, we generated ten synthetic traces from each training trace (each synthetic trace is from 9:00 to 19:00) and averaged the utility and privacy results over the ten traces to stabilize the performance.
Foursquare Dataset. The Foursquare dataset (Global-scale Check-in Dataset with User Social Networks) [Yang_WWW19] (denoted by FS) is a large-scale real location dataset, which contains check-ins by users all over the world.
In our experiments, we selected six cities with a large number of check-ins and with a cultural diversity in the same way as [Yang_WWW19]: Istanbul (IST), Jakarta (JK), New York City (NYC), Kuala Lumpur (KL), San Paulo (SP), and Tokyo (TKY). For each city, we extracted the most popular POIs, whose number of visits from all users was the largest; i.e., . We set the interval between two time instants to hour (we rounded down minutes), and assigned every hours into one of time slots ( to AM), , ( to PM) in a cyclic manner; i.e., . For each city, we randomly selected of traces as training traces and used the remaining traces as testing traces. The number of users in IST, JK, NYC, KL, SP, and TKY was respectively: , , , , , and . There were many missing events in FS, as FS is a location check-in dataset. The number of transitions (temporally-continuous events) in the training traces of IST, JK, NYC, KL, SP, and TKY was respectively: , , , , , and .
In FS, we generated one synthetic trace for one day from each training trace, and evaluated the utility and privacy.
We evaluated the proposed method (Proposal), the synthetic location traces generator in [Bindschaedler_SP16] (SGLT), and the synthetic data generator in [Bindschaedler_VLDB17] (SGD).
In Proposal, we set the parameters , , , and , as explained in Section 3. For the hyper-hyper parameters, we set , , to the identity matrix, and (as in [Salakhutdinov_ICML08]). Then we set the precision to various values from to , and evaluated the utility and privacy for each value. We implemented Proposal with C++, and made it public [PPMTF].
In SGLT [Bindschaedler_SP16], we used the SGLT tool (C++) in [SGLT]. We set the location-removal probability to , the location merging probability to , and the randomization multiplication factor to in the same way as [Bindschaedler_SP16] (for details of the parameters in SGLT, see [Bindschaedler_SP16]). For the number of semantic clusters, we attempted various values: , , , or (as shown later, SGLT provided the best performance when or ). For each case, we set the probability of removing the true location in the input training trace to various values from to ( in [Bindschaedler_SP16]) to evaluate the trade-off between the utility and privacy.
In SGD [Bindschaedler_VLDB17], we trained the transition matrix for each time slot ( elements in total) and the visit-probability vector for the first time instant ( elements in total) from the training traces via maximum likelihood estimation. Note that the transition matrix and the visit-probability vector are common to all users. Then we generated a synthetic trace from an input training trace by copying the first events and generating the remaining events using the trained transition matrix. When , we randomly generated a location at the first time instant using the visit-probability vector. For more details of SGD, see Appendix D. We implemented SGD for location traces with C++, and made it public [PPMTF].
Utility. We evaluated the utility listed in Section 1 as follows.
(a) Time-dependent population distribution. We first computed a frequency distribution (-dimensional vector) of the testing traces and that of the synthetic traces for each time slot. Then we evaluated the average total variation between the two distributions over all time slots (denoted by TP-TV).
Note that frequently visited locations are especially important for some tasks [Zheng_WWW09, Do_TMC13]. Thus for each time slot, we also selected the top locations, whose frequencies in the testing traces were the largest, and regarded the absolute error for the remaining locations in TP-TV as (TP-TV-Top50).
(b) Transition matrix. We computed an average transition-probability matrix ( matrix) over all users and all time instances from the testing traces. Similarly, we computed an average transition-probability matrix from the synthetic traces. Since each row of the transition-probability matrix represents a conditional distribution, we evaluated the EMD (Earth Mover’s Distance) between the two conditional distributions over the -axis (longitude) and -axis (latitude), and averaged it over all rows (TM-EMD-X and TM-EMD-Y). TM-EMD-X and TM-EMD-Y represent how the two transition matrices differ over the -axis and -axis, respectively.
(c) Distribution of visit-fractions. Since we used POIs in FS (regions in PF), we evaluated how well the synthetic traces preserve a distribution of visit-fractions in FS. We first excluded testing traces whose number of events are small (less than ). Then, for each of the remaining traces, we computed a fraction of visits for each POI. Based on this, we computed a distribution of visit-fractions for each POI by dividing the fraction into bins as follows: