Privacy-Aware Location Sharing with Deep Reinforcement Learning

by   Ecenaz Erdemir, et al.
Imperial College London

Location based mobile applications have become widely popular. Despite their utility, these services raise concerns for privacy since they require sharing location information with untrusted third parties. In this work, we study privacy-utility trade-off in location sharing mechanisms. Existing approaches are mainly focused on privacy of sharing a single location or myopic location trace privacy; neither taking into account the temporal correlations between past and current locations. Although these methods preserve the privacy for the current time, they may leak significant amount of information at the trace level as the adversary can exploit temporal correlations in a trace. We propose an information theoretically optimal privacy preserving location release mechanism that takes temporal correlations into account. We measure the privacy leakage by the mutual information between the user's true and released location traces. To tackle the history dependent mutual information minimization, we reformulate the problem as a Markov decision process (MDP), and solve it using asynchronous actor-critic deep reinforcement learning (RL).



There are no comments yet.


page 1

page 2

page 3

page 4


Privacy-Aware Time-Series Data Sharing with Deep Reinforcement Learning

Internet of things (IoT) devices are becoming increasingly popular thank...

Active Privacy-utility Trade-off Against a Hypothesis Testing Adversary

We consider a user releasing her data containing some personal informati...

Location Trace Privacy Under Conditional Priors

Providing meaningful privacy to users of location based services is part...

Designing a Location Trace Anonymization Contest

For a better understanding of anonymization methods for location traces,...

Tagvisor: A Privacy Advisor for Sharing Hashtags

Hashtag has emerged as a widely used concept of popular culture and camp...

Cryptanalysis of the Privacy-Preserving Ride-Hailing Service TRACE

In a typical ride-hailing service, the service provider (RS) matches a c...

Artificial Impostors for Location Privacy Preservation

The progress of location-based services has led to serious concerns on l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Fast advances in mobile devices and positioning technologies have fostered the development of many location-based services (LBSs), such as Google Maps, Uber, Forsquare and Tripadvisor. These services provide users with useful information about their surroundings, transportation services, friends’ activities, or nearby attraction points. Moreover, the integration of LBSs with social networks, such as Facebook, Twitter, Youtube, has rapidly increased indirect location sharing, e.g., via image or video sharing. However, location is one of the most sensitive private information for users, since a malicious adversary can use this information to derive users’ habits, health condition, social relationships, or religion. Therefore, location trace privacy has been an important concern in LBS, and there is an increasing pressure from consumers to keep their traces private against malicious attackers or untrusted service providers (SPs), while preserving a certain level of utility obtained from these applications.00footnotetext: This work was partially supported by the European Research Council (ERC) through project BEACON (No. 677854).

In the literature, a large body of research has focused on location-privacy protection mechanisms (LPPMs) against an untrusted service provider [1]. These methods can be categorized as spatial-location and temporal-location privacy preserving methods [2]. While the former focuses on protecting a single location data [3, 4, 5, 6, 7], the latter aims at providing location trace privacy [8, 9, 10]. Individual locations on a trace are highly correlated, and the strategies focusing on the current location privacy might reveal sensitive information about the past or future locations.

Differential privacy, k-anonymity and information theoretic metrics are commonly used as privacy measures [3, 4, 5, 6, 7, 8, 9, 10]. By definition, differential privacy prevents the service provider from inferring the current location of the user, even if the SP has the knowledge of all the remaining locations. K-anonymity ensures that a location is indistinguishable from at least other location points. However, differential privacy and k-anonymity are meant to ensure the privacy of a single location, and they are shown not to be appropriate measures for location privacy in [11]. Instead, we treat the true and released location traces as random sequences, and measure the privacy leakage by mutual information [12].

In [7]

, the authors introduce location distortion mechanisms to keep the user’s trajectory private. Privacy is measured by mutual information between the true and released traces and constrained by the average distortion for a specific distortion measure. The true trajectory is assumed to form a Markov chain. Due to the computational complexity of history-dependent mutual information optimization, authors propose bounds which take only the current and one step past locations into account. However, due to temporal correlations in the trajectory, the optimal distortion introduced at each time instance depends on the entire distortion and location history. Hence, the proposed bounds do not guarantee optimality.

In this work, similarly to [7], we consider the scenario in which the user follows a trajectory generated by a first-order Markov process, and periodically reports a distorted version of her location to an untrusted service provider. We assume that the true locations become available to the user in an online manner. We use the mutual information between the true and distorted location traces as a measure of privacy loss. For the privacy-utility trade-off, we introduce an online LPPM minimizing the mutual information while keeping the distortion below a certain threshold. Unlike [7], we consider location release policies which take the entire released location history into account, and show its optimality. To tackle the complexity, we exploit the Markovity of the true user trajectory and recast the problem as a Markov decision process (MDP). After identifying the structure of the optimal policy, we use advantage actor-critic (A2C) deep reinforcement learning (RL) framework as a tool to evaluate our continuous state and action space MDP numerically.

Ii Problem Statement

We consider a user who shares her location with a service provider to gain utility through some location-based service. We denote the true location of the user at time by , where is the finite set of possible locations. We assume that the user trajectory

follows a first-order time-homogeneous Markov chain with transition probabilities

, and initial probability distribution

. The user shares a distorted version of her location, denoted by , with the untrusted service provider due to privacy concerns. We assume that the user shares the distorted location in an online manner; that is, the released location at time does not depend on the future true locations; i.e., for any , form a Markov chain, where we have denoted the sequence by , and the sequence by .

Our goal is to characterize the trade-off between the privacy and utility. We quantify privacy by the information leaked to the untrusted service provider, measured by the mutual information between the true and released location trajectories. The information leakage of the user’s location release strategy for a time period is given by


where the first equality follows from the chain rule of mutual information, and the second from the Markov chain


Releasing distorted locations also reduces the utility received from the service provider. Therefore, the distortion applied by the user should be limited. The distortion between the true location and the released location is measured by a specified distortion measure (e.g., Manhattan distance or Euclidean distance).

Our goal is to minimize the information leakage rate to the service provider while keeping the average distortion below a specified level for utility. The infinite-horizon optimization problem can be written as:

such that (3)

where is the specified distortion constraint on the utility loss, and represent the realizations of and ,

is a conditional probability distribution which represents the user’s randomized

location release policy at time . The expectation in (3) is taken over the joint probabilities of and , where the randomness stems from both the Markov process generating the true trajectory, and the random release mechanism . The mutual information induced by policy is calculated using the joint probability distribution


where . In the next section, we characterize the structure of the optimal location release policy, and using this structure recast the problem as an MDP and evaluate using deep RL.

Iii Privacy-utility trade-off for
online location sharing

In this section, we analyze the privacy-utility trade-off of user location release mechanism under the notion of mutual information minimization with a distortion constraint. Moreover, we propose simplified location release policies that still preserve the optimality.

By the definition of mutual information, the objective in (2) depends on the entire history of and . Therefore, the user must follow a history-dependent location release policy , where a feasible set satisfies . As a result of strong history dependence, computational complexity of the minimization problem increases exponentially with the increasing length of user trajectory. To tackle this problem, we introduce a class of simplified policies.

Iii-a Simplified Location Release Policies

In this section we introduce a set of policies of the form

, which samples the distorted location only by considering the last two true locations and the entire released location history. Hence, the joint distribution (

4) induced by , where can be written as


Next, we show that considering location release policies in set is without loss of optimality.

Theorem 1.

In the minimization problem (2), there is no loss of optimality in restricting our location release policies to the set of policies . Furthermore, information leakage induced by any can be written as:


The proof of Theorem 1 relies on the following lemmas and will be presented later.

Lemma 1.

For any ,


with equality if and only if .


For any ,


where (9) follows from (1), and (10) from the non-negativity of mutual information. ∎

Lemma 2.

For any , there exists a such that


For any , we choose the policy such that


and we show that . Then, holds, which proves the statement in Lemma 2. The proof of the equality is derived by induction as follows,


where (12) holds, and is for the initialization of the induction. ∎


Following Lemmas 1 and 2, for any , there exists a such that


Hence, there is no loss of optimality in using the location release policies of the form , and information leakage reduces to (2). ∎

Fig. 1: Markov chain for the simplified location release policy.

Restricting our attention to the user location release policies , we can write the minimization problem (2) as


The location release strategy followed by the user is illustrated by the Markov chain in Fig. 1, where denotes the released location history, i.e., . That is, the user samples a distorted location, , at time t by considering the current and previous true locations, , and released location history, .

Minimization of the mutual information subject to a utility constraint can be converted into an unconstrained minimization problem using Lagrange multipliers. Since the distortion constraint is memoryless, we can integrate it into the additive objective function easily. Hence, the unconstrained minimization problem for online location release privacy-utility trade-off can be rewritten as


Iii-B MDP Formulation

Markovity of the user’s true location trace and the additive objective function in (16) allow us to represent the problem as an MDP with state . However, the information leakage at time depends on , resulting in a growing state space in time. Therefore, for a given policy and any realization of , we define a belief state as a probability distribution over the state space:


This represents the service provider’s belief on the user’s true location at the beginning of time instance , i.e., after receiving the distorted location at the end of the previous time instance . The MDP actions are defined as the probability distributions sampling the released location at time , and determined by the randomized location release policies. The user’s action induced by a policy can be denoted by [13, 14]. At each time , the service provider updates its belief on the true location after observing the distorted location by


We define per-step information leakage of the user due to taking action at time as,


The expectation of step sum of (19) over the joint probability is equal to the mutual information expression in the original problem (15). Therefore, given belief and action probabilities, average information leakage at time can be formulated as,


We remark that the representation of average distortion in terms of belief and action probabilities is straightforward due to its additive form. Similarly to (20), average distortion at time can be written as,


Finally, we can recast the original problem in (16) as a continuous state and action space MDP. Evaluation of the MDP relies on minimizing the objective


at each time step for a trajectory of length .

Finding optimal policies for continuous state and action space MDPs is a PSPACE-hard problem [15]. In practice, they can be solved by various finite-state MDP evaluation methods, e.g., value iteration, policy iteration and gradient-based methods. These are based on the discretization of the continuous belief states to obtain a finite state MDP [16]

. While finer discretization of the belief reduces the loss from the optimal solution, it causes an increase in the state space; hence, in the complexity of the problem. Therefore, we use a deep learning based method as a tool to numerically solve our continuous state and action space MDP problem.

Iii-C Advantage Actor-Critic (A2C) Deep RL

In RL, an agent discovers the best action to take in a particular state by receiving instantaneous rewards/costs from the environment [17]. On the other hand, in our problem, we have the knowledge of state transitions and the cost for every state-action pair without a need for interacting with the environment. We use advantage actor-critic deep RL (A2C-DRL) as a computational tool to numerically evaluate the optimal location release policies for our continuous state and action space MDP.

Fig. 2: RL for a known model.

To integrate RL framework into our problem, we create an artificial environment which inputs the user’s current action, , samples an observation , and calculates the next state, , using Bayesian belief update (18). Instantaneous cost revealed by the environment is calculated by (22). The user receives the experience tuple from the environment, and refines her policy accordingly. Fig. 2 illustrates the interaction between the artificial environment and the user, which is represented by the RL agent. The corresponding Bellman equation induced by the location release policy can be written as


where is the state-value function, is the updated belief state according to (18), represents action probability distributions, and is the cost-to-go function, i.e., the expected future cost induced by policy [18].

RL methods can be divided into three groups: value-based, policy-based, and actor-critic [19]

. Actor-critic methods combine the advantages of value-based (critic-only) and policy-based (actor-only) methods, such as low variance and continuous action producing capability. The actor represents the policy structure, while the critic estimates the value function


. In our setting, we parameterize the value function by the parameter vector

as , and the stochastic policy by as . The difference between the right and the left hand side of (23) is called temporal difference (TD) error, which represents the error between the critic’s estimate and the target differing by one-step in time [20]. The TD error for the experience tuple is estimated as


where is called TD target, and is a discount factor that we choose very close to to approximate the Bellman equation in (23) for our infinite-horizon average cost MDP. To implement RL in the infinite-horizon problem, we take sample averages over independent finite trajectories, which are generated by experience tuples at each time , via Monte-Carlo roll outs.

Instead of using value functions in actor and critic update, we use advantage function to reduce the variance in policy gradient methods. The advantage can be approximated by TD error. Hence, the critic is updated by gradient descent as:


where is the critic loss and is the learning rate of the critic at time . The actor is updated similarly as,


where is the actor loss and is the actor’s learning rate. This method is called advantage actor-critic RL.

In our deep-A2C implementation, we represent the actor and critic mechanisms by fully connected feed-forward deep neural networks (DNNs) with two hidden layers as illustrated in Fig.

3. The critic DNN takes the current belief state as input, where is the location vector of size , and outputs the value of the belief state for the current action probabilities . The actor takes the belief state as input, and outputs the parameters used for determining the action probabilities of the corresponding belief. Here, are the concentration parameters which are used to generate a Dirichlet distribution to represent the action probabilities. The overall A2C deep RL algorithm for online LPPM is described in Algorithm 1.




Fig. 3: Critic (a) and actor (b) neural network structures.
Initialize the DNNs with random weights and
Initialize environment
for episode= do
       Initialize belief state ;
       for  do
             Sample action probability vector  according to the current policy;
             Perform the action and calculate cost in ;
             Sample an observation and calculate the next belief state in ;
             Set TD target ;
             Minimize the loss ;
             Update the critic ;
             Minimize the loss ;
             Update the actor ;
             Update the belief state
       end for
end for
Algorithm 1 Advantage actor-critic deep RL (A2C-DRL) algorithm for online LPPM

Iv Numerical Results

In this section, we evaluate the performance of the proposed LPPM policy for a simple grid-world example, and compare the results with the myopic Markovian location release mechanism proposed in [7].

In [7], an upper bound on the privacy-utility trade-off is given by a myopic policy as follows:


Exploiting the fact that (27) is similar to the rate-distortion function, Blahut-Arimoto algorithm is used in [7] to minimize the conditional mutual information at each time step. Finite-horizon solution of the objective function (27) is obtained by applying alternating minimization sequentially. In our simulations, we obtained the average information leakage and distortion for this approach by normalizing for .

We consider a simple scenario in which the region of interest is a grid-world, that is, =. User’s trajectory forms a first-order Markov chain with the transition probability matrix . The user can start its movement at any square with equal probability . The Lagrangian multiplier denotes the user’s choice of privacy-utility balance. We train two fully connected feed-forward DNNs, representing the actor and critic, by utilizing ADAM optimizer [21]

. Both networks contain two hidden layers with leaky-ReLU activation

[22]. We obtain the corresponding privacy-utility trade-off by averaging the total information leakage and distortion over a time horizon of .

Fig. 4: Average information leakage as a function of the allowed average distortion for the myopic and proposed LPPM policies.

In Fig. 4, privacy-distortion trade-off curves are obtained assuming that and are Markov transition matrices with different correlation levels. In both case, the user can move from any square to any square at each step; however, the probability of moving to a closer square is greater than taking a larger step to a more distant one. This is simply done by normalizing the transition probabilities in and by the Euclidean distances covered by the corresponding transitions. While represents a uniform trajectory with equal probabilities of going to equal distances, is more arbitrary which leads the user follow a certain path with high probability. That is, causes higher temporal correlations in the user trajectory compared to . Distortion is measured by the Euclidean distance between and . We train our DNNs for a time horizon in each episode, and over Monte Carlo roll-outs. Fig. 4 shows that, for the proposed LPPM obtained through deep RL leaks much less information than the myopic LPPM for the same distortion level, indicating the benefits of considering all the history when taking actions at each time instant. Their performances are the same for

, since the user movement with uniform distribution does not contain temporal correlations between the past locations. Therefore, taking these correlations into account does not outperform the myopic policy.

V Conclusions

We have studied the privacy-utility trade-off in LPPMs, using mutual information as a privacy measure. Having identified some properties of the optimal policy, we recast the problem as an MDP. Due to continuous state and action spaces, we used advantage actor-critic deep RL as a computational tool. Utilizing DNNs, we numerically evaluated the privacy-utility trade-off curve of the proposed location release policy. We compared the results with a myopic LPPM, and observed the effect of considering temporal correlations on information leakage-distortion performance. According to the simulation results, we have seen that the proposed LPPM policy provides significant privacy advantage, especially when the user trajectory has higher temporal correlations.


  • [1] S. B. M. V. Primault, A. Boutet and L. Brunie., “The long road to computational location privacy: A survey,” IEEE Communications Surveys & Tutorials, pp. 1–1, 2018.
  • [2] C.-Y. Chow and M. F. Mokbel, “Trajectory privacy in location-based services and data publication,” SIGKDD Explorations, vol. 13, pp. 19–29, 2011.
  • [3] K. P. N. Puttaswamy, S. Wang, T. Steinbauer, D. Agrawal, A. E. Abbadi, C. Kruegel, and B. Y. Zhao, “Preserving location privacy in geosocial applications,” IEEE Transactions on Mobile Computing, vol. 13, no. 1, pp. 159–173, Jan 2014.
  • [4] R. Shokri, C. Troncoso, C. Diaz, J. Freudiger, and J.-P. Hubaux, “Unraveling an old cloak: k-anonymity for location privacy,” Proceedings of the ACM Conference on Computer and Communications Security, Sep. 2010.
  • [5] R. Shokri, G. Theodorakopoulos, C. Troncoso, J.-P. Hubaux, and J.-Y. Le Boudec, “Protecting location privacy: Optimal strategy against localization attacks,” ACM Conference on Computer and Communications Security, pp. 617–627, Oct. 2012.
  • [6] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax rates,” in 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Oct 2013, pp. 429–438.
  • [7] W. Zhang, M. Li, R. Tandon, and H. Li, “Online location trace privacy: An information theoretic approach,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 1, pp. 235–250, Jan 2019.
  • [8] V. Bindschaedler and R. Shokri, “Synthesizing plausible privacy-preserving location traces,” in 2016 IEEE Symposium on Security and Privacy (SP), May 2016, pp. 546–563.
  • [9] W. Luo, Y. Lu, D. Zhao, and H. Jiang, “On location and trace privacy of the moving object using the negative survey,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 1, no. 2, pp. 125–134, April 2017.
  • [10] J. Hua, W. Tong, F. Xu, and S. Zhong, “A geo-indistinguishable location perturbation mechanism for location-based services supporting frequent queries,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 5, pp. 1155–1168, May 2018.
  • [11] R. Shokri, G. Theodorakopoulos, J. Le Boudec, and J. Hubaux, “Quantifying location privacy,” in 2011 IEEE Symposium on Security and Privacy, May 2011, pp. 247–262.
  • [12] I. Wagner and D. Eckhoff, “Technical privacy metrics: A systematic survey,” ACM Comput. Surv., vol. 51, no. 3, pp. 57:1–57:38, Jun. 2018.
  • [13] S. Li, A. Khisti, and A. Mahajan, “Information-theoretic privacy for smart metering systems with a rechargeable battery,” IEEE Transactions on Information Theory, vol. 64, no. 5, pp. 3679–3695, May 2018.
  • [14] G. Giaconi and D. Gündüz, “Smart meter privacy with renewable energy and a finite capacity battery,” in 2016 IEEE 17th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), July 2016, pp. 1–5.
  • [15] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of markov decision processes,” Mathematics of Operations Research, vol. 12, no. 3, pp. 441–450, 1987.
  • [16] N. Saldi, T. Linder, and S. Yuksel, “Approximations for partially observed Markov decision processes,” in Finite Approximations in Discrete-Time Stochastic Control.   Cham: Birkhäuser, 2018, ch. 5, pp. 99–124.
  • [17] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed.   The MIT Press, 2018.
  • [18] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II, 3rd ed.   Athena Scientific, 2007.
  • [19] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM J. Control Optim., vol. 42, no. 4, pp. 1143–1166, Apr. 2003.
  • [20] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “A survey of actor-critic reinforcement learning: Standard and natural policy gradients,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1291–1307, Nov 2012.
  • [21] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
  • [22] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.