Asymptotic Loss in Privacy due to Dependency in Gaussian Traces

09/27/2018 ∙ by Nazanin Takbiri, et al. ∙ University of Massachusetts Amherst 0

Rapid growth of the Internet of Things (IoT) necessitates employing privacy-preserving techniques to protect users' sensitive information. Even when user traces are anonymized, statistical matching can be employed to infer sensitive information. In our previous work, we have established the privacy requirements for the case that the user traces are instantiations of discrete random variables and the adversary knows only the structure of the dependency graph, i.e., whether each pair of users is connected. In this paper, we consider the case where data traces are instantiations of Gaussian random variables and the adversary knows not only the structure of the graph but also the pairwise correlation coefficients. We establish the requirements on anonymization to thwart such statistical matching, which demonstrate the (significant) degree to which knowledge of the pairwise correlation coefficients further significantly aids the adversary in breaking user anonymity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Internet of Things (IoT) enables users to share and access information on a large scale and provides many benefits for individuals (e.g., smart homes, healthcare) and industries (e.g., digital tracking, data collection, disaster management) by tuning the system to user characteristics based on (potentially sensitive) information about their activities. Thus, the use of IoT comes with a significant threat to users’ privacy: leakage of sensitive information.

Two main privacy-preserving techniques are anonymization [1] and obfuscation [2], where the former is hiding the mapping between data and the users by replacing the identification fields of users with pseudonyms, and the latter is perturbing the user data such that the adversary observes false but plausible data. Although these methods have been addressed widely, statistical inference methods can be applied to them to break the privacy of the users [3, 4]. Furthermore, achieving privacy using these methods comes with a cost: reducing the utility to the user of the system. Hence, it is crucial to consider the trade-off between privacy and utility when employing privacy-preserving techniques, and to seek to achieve privacy with minimal loss of functionality and usability [5, 6, 7]. Despite the growing interest in IoT privacy [8, 9], previous works do not offer theoretical guarantees on the trade-off between privacy and utility. The works of Shokri et al. [10, 1, 11] and Ma et al. [12] provide significant advances in the quantitative analyses of privacy; however, they are not based on a solid theoretical framework.

In [13, 14, 15], the data traces of different users are modeled as independent, and the asymptotic limits of user privacy are presented for the case when both anonymization and obfuscation are applied to users’ time series of data. In  [13, 14, 16]

, each user’s data trace are governed by: 1) independent identically distributed (i.i.d.) samples of a Multinoulli distribution (generalized Bernoulli distribution); or, 2) Markov chain samples of a Multinoulli distribution. In  

[15], the case of independent users with Gaussian traces was addressed. However, the data traces of different users are dependent in many applications (e.g. friends, relatives), and the adversary can potentially exploit such. In [17, 18], we extended the results of [13, 14] to the case where users are dependent and the adversary knows only the structure of the association graph, i.e., whether each pair of users are linked. As expected, knowledge of the dependency graph results in a significant degradation in privacy [13, 14].

In this paper, we turn our attention to the case where user traces are i.i.d. samples of Gaussian random variables and the adversary knows not just the dependency graph but also the degree to which different users are correlated. Data points of users are continuous-valued, independently and identically distributed (i.i.d.) with respect to time, and the adversary knows the probability distribution function (pdf) of the data generated by all of the users. To preserve the privacy of users, anonymization is employed, i.e., the mapping between users and data sequences is randomly permuted for each set of

consequent users’ data. We derive the minimum number of adversary’s observations per user that ensures privacy, with respect to the number of users and the size of the sub-graph to which the user belongs.

The rest of the paper is organized as follows. In Section II we present the framework: system model, metrics, and definitions. Then, we present construction and analysis in Section III, and in Section IV, we provide some discussion about Section V we conclude from the results.

Ii Framework

Consider a system with users. Denote by the data point of user at time , and by the vector containing the data points of user ,

To preserve the privacy of the users, anonymization is employed, i.e., the mapping between users and data sequences is randomly permuted. As shown in Figure 1, denote by the output of the anonymizer, which we term as the “reported data point” of user at time . The permuted version of is

Fig. 1: Applying anonymization to the data point of user at time . denotes the actual data point of user at time , and denotes the reported data point of user at time .

There exists an adversary who wishes to break the anonymity and thus privacy of the users. He observes which are the reported data points of users at times

, and combines them with his/her statistical knowledge to estimate the users’ actual data points. For each set of

time instances of user data points, a new anonymization mapping is employed.

Ii-a Models and Metrics

Data Points Model: Data points are independent and identically distributed (i.i.d.) with respect to time, i.e., , , is independent of . At time

, the vector of user data points is drawn from a multivariate normal distribution; that is,

where is the mean vector, is the covariance matrix, is the covariance between users and

, and the variances of the user data points are equal for all users (

). Following our previous work [17], the parameters of the distribution governing users’ behavior are in turn drawn randomly. In particular, we assume the means are finite and are drawn independently from a continuous distribution , where for all in the support of

(1)

and the correlations are finite and are drawn independently from a continuous distribution , where for all in the support of

(2)

Note that the Cauchy–Schwarz inequality upper bounds the correlations.

Association Graph: The dependencies between users are modeled by an association graph in which two users are connected if they are dependent. Denote by the association graph where is the set of nodes and is the set of edges. Also, denote by the correlation coefficient between users and . Observe

Similar to [17], the association graph consists of disjoint subgraphs , where each subgraph is connected and refers to a group of “friends” or “associates”. Let denotes the number of nodes in , i.e., .

Fig. 2: The association graph consists of disjoint subgraphs , where is a connected graph on vertices.

Anonymization Model: As we discussed in Section I, anonymization is employed to randomly permute the mapping between users and data sequences. We model the anonymization technique by a random permutation function on the set of users. Then, ,

Adversary Model: The adversary knows the multivariate normal distribution from which the data points of users are drawn. Therefore, in contrast to [17], the adversary knows both the structure of the association graph as well as the correlation coefficients for each pair of users . The adversary also knows the anonymization mechanism; however, he does not know the realization of the random permutation function.

The situation in which the user has no privacy is defined as follows [13]:

Definition 1.

User has no privacy at time , if and only if there exists an algorithm for the adversary to estimate perfectly as goes to infinity. In other words, as ,

where is the estimated value of by the adversary.

Iii Impact of Dependency on Privacy Using Anonymization

Here, we consider to what extent inter-user dependency limits privacy in the case users’ data points are governed by a Gaussian distribution.

First, we consider the ability of the adversary to fully reconstruct the structure of the association graph of the anonymized version of the data with arbitrarily small error probability.

Lemma 1.

If the adversary obtains anonymized observations, he/she can reconstruct , where , such that with high probability, for all ; iff . We write this statement as .

Proof.

From the observations, the adversary can calculate the empirical covariance for each pair of user and user ,

(3)

where

(4)

Define the event

By Chebyshev’s inequality,

(5)

Define

Consider the following fact.

Fact 1.

If and are Gaussian random variables with finite means and variances, and , then the moment of is finite.

Proof.

The proof follows from the Cauchy-Schwarz inequality and the fact that the moments of and are finite. ∎

By Fact 1, . Therefore, the Marcinkiewicz-Zygmund inequality [19] yields

(6)

where is a constant independent of .

Now let

Since is a convex function of when , Jensen’s inequality yields:

(7)

By (6) and (7),

Combined with (5),

(8)

where is finite when , following from Fact 1.

Define and ,

Since , we conclude

(9)

Similar to the approach leading to (9) we show that

(10)

For each pair of , the union bound yields

In addition, applying the union bound again and substituting the values of , and yields:

as . Thus, for all , with high probability we achieve

Consequently, (3) yields

Similarly, we can show that

Thus,

Now, if , then , and thus,

Similarly, if , then , and thus,

Consequently, the adversary can reconstruct the association graph of the anonymized version of the data with arbitrarily small error probability. ∎

Next, we demonstrate how the adversary can identify group among all of the groups. Note that this is the key step which speeds up the adversary’s algorithm relative to the case where user traces are independent.

Lemma 2.

If the adversary obtains anonymized observations and knows the structure of the association graph, he/she can identify group among all of the groups with arbitrarily small error probability.

Proof.

Note that there are at most groups of size which we denote . Without loss of generality, we assume the members of group are users .

By (4), for all members of group (), the empirical mean is:

(11)

For , define vectors of and with length :

and for , define

Also, define triangular arrays :

Let be the set of all permutations on elements; for , is a one-to-one mapping. By [18, Equation 6],

(12)

Next, we show when and is large enough,

  • ,

where .

First, we prove with high probability. Substituting in (8) and (9) yields:

(13)

and

(14)

By the union bound, as ,

(15)

Consequently, with high probability as .

Next, we show

Note that by (1) and (2), for all groups other than group ,

Similarly, for any ,

Thus, as , the union bound yields:

Since all the distances between ’s and are larger than , we show

By the union bound, for ,

as Consequently, for all , ’s are close to ’s,;thus, for large enough ,

Hence, the adversary can successfully identify group among all of the groups with arbitrary small error probability. ∎

Finally, we show that the adversary can identify all of the members of group with arbitrarily small error probability.

Lemma 3.

If the adversary obtains anonymized observations, and group is identified among all the groups, the adversary can identify user with arbitrarily small error probability.

Proof.

Define sets and around (See Figure 3):

where

Fig. 3: , sets and for case .

Next, we show that when and is large enough,

In other words, the adversary examines ’s which are defined according to (11) and chooses the only one that belongs to .

Substituting in (9) yields:

Thus, for large enough ,

Next, we show that when ,

By (1) and (2),

Therefore, the union bound yields:

Consequently, all ’s are outside of with high probability. Next, we prove is small. Observe:

hence, by union bound, when is large enough,

Thus, if , there exists an algorithm for the adversary to successfully identify user among all the users. ∎

Next, we present Theorem 1 which follows from Lemmas 12, and 3. In this theorem, we determine the required number of observations per user () for the adversary to break the privacy of each user, in terms of number of users and size of group in which the user of interest belongs to ().

Theorem 1.

If the adversary knows both the structure of the association graph and the correlation coefficient between users, and

  • , for any ;

then, user has no privacy at time .

Lastly, in Theorem 2, we consider the case where the adversary knows only the association graph, but not necessarily the correlation coefficients between the users. Similar to the arguments leading to Theorem 1 and [17, Theorem 1] we show that if is significantly larger than , then the adversary can successfully break the privacy of the user of the interest, i.e., he’/she can find an algorithm to estimate the actual data points of the user with vanishing small error probability.

Theorem 2.

If adversary knows the structure of the association graph, and

  • , for any ;

then, user has no privacy at time .

Iv Discussion

Here, we compare our results with previous work. When the users are independent, the adversary can break the privacy of each user if the number of the adversary’s observations per user is [15] (Case 1 in Figure 4). However, when the users are dependent, and the adversary knows their association graph (and not the correlation coefficients), each user will have no privacy if (Theorem 1: Case 2 in Figure 4). Note that smaller means rapid changes in pseudonyms, which reduces the utility. The required number of per-user observation for the adversary to break the privacy of each user reduces further () when the adversary has more information: the correlation coefficients between users (Theorem 2: Case 3 in Figure 4). In other words, the more the adversary knows, the smaller must be, and we have characterized the significance in the loss of privacy of various degrees of knowledge of the dependency in this paper.

Fig. 4: Comparing the required number of observation per user for the adversary to break the privacy of each user for three cases: 1) independent users; 2) dependent users, adversary knows only the association graph; 3) dependent users, the adversary knows both the association graph and the correlation coefficient between users.

V Conclusion

Many popular applications use traces of user data, e.g., users’ location information or medical records, to offer various services to the users. However, revealing user information to such applications puts users’ privacy at stake, as adversaries can infer sensitive private information about the users such as their behaviors, interests, and locations. In this paper, anonymization is employed to protect users’ privacy when data traces of each user observed by the adversary are governed by i.i.d. Gaussian sequences, and data traces of different users are dependent. An association graph is employed to show the dependency between users, and both the structure of this association graph and nature of dependency between users are known to the adversary. We show dependency has a disastrous effect on the privacy of users. In comparison to the case in which data traces of different users are independent, here we must use a stronger anonymization technique by drastically increasing the rate at which user pseudonyms are changed, which degrades system utility.

References

  • [1] R. Shokri, G. Theodorakopoulos, G. Danezis, J.-P. Hubaux, and J.-Y. Le Boudec, “Quantifying location privacy: the case of sporadic location exposure,” in International Symposium on Privacy Enhancing Technologies Symposium.   Springer, 2011, pp. 57–76.
  • [2] C. A. Ardagna, M. Cremonini, S. D. C. di Vimercati, and P. Samarati, “An obfuscation-based approach for protecting location privacy,” IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 1, pp. 13–27, 2011.
  • [3] A. C. Polak and D. L. Goeckel, “Identification of wireless devices of users who actively fake their rf fingerprints with artificial data distortion,” IEEE Transactions on Wireless Communications, vol. 14, no. 11, pp. 5889–5899, Nov 2015.
  • [4] A. Ukil, S. Bandyopadhyay, and A. Pal, “IoT-privacy: To be private or not to be private,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).   Toronto, ON, Canada: IEEE, 2014, pp. 123–124.
  • [5] G. Loukides and A. Gkoulalas-Divanis, “Utility-preserving transaction data anonymization with low information loss,” Expert systems with applications, vol. 39, no. 10, pp. 9764–9777, 2012.
  • [6] H. Lee, S. Kim, J. W. Kim, and Y. D. Chung, “Utility-preserving anonymization for health data publishing,” BMC medical informatics and decision making, vol. 17, no. 1, p. 104, 2017.
  • [7] M. Batet, A. Erola, D. Sánchez, and J. Castellà-Roca, “Utility preserving query log anonymization via semantic microaggregation,” Information Sciences, vol. 242, pp. 49–63, 2013.
  • [8] A. Ukil, S. Bandyopadhyay, and A. Pal, “Iot-privacy: To be private or not to be private,” in Computer Communications Workshops (INFOCOM WKSHPS), 2014 IEEE Conference on.   IEEE, 2014, pp. 123–124.
  • [9] H. Lin and N. W. Bergmann, “Iot privacy and security challenges for smart home environments,” Information, vol. 7, no. 3, p. 44, 2016.
  • [10] R. Shokri, G. Theodorakopoulos, J.-Y. Le Boudec, and J.-P. Hubaux, “Quantifying location privacy,” in 2011 IEEE symposium on security and privacy.   IEEE, 2011, pp. 247–262.
  • [11] R. Shokri, G. Theodorakopoulos, C. Troncoso, J.-P. Hubaux, and J.-Y. Le Boudec, “Protecting location privacy: optimal strategy against localization attacks,” in Proceedings of the 2012 ACM conference on Computer and communications security.   ACM, 2012, pp. 617–627.
  • [12] Z. Ma, F. Kargl, and M. Weber, “A location privacy metric for v2x communication systems,” in Sarnoff Symposium, 2009. SARNOFF’09. IEEE.   IEEE, 2009, pp. 1–6.
  • [13] N. Takbiri, A. Houmansadr, D. L. Goeckel, and H. Pishro-Nik, “Limits of location privacy under anonymization and obfuscation,” in International Symposium on Information Theory (ISIT).   Aachen, Germany: IEEE, 2017, pp. 764–768.
  • [14] N. Takbiri, A. Houmansadr, D. L. Goeckel, and H. Pishro-Nik, “Matching anonymized and obfuscated time series to users’ profiles,” IEEE Transactions on Information Theory, accepted for publication, 2018, Available at https://arxiv.org/abs/1710.00197.
  • [15] K. Le, H. Pishro-Nik, and D. Goeckel, “Bayesian time series matching and privacy,” in 51th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 2017.
  • [16] N. Takbiri, A. Houmansadr, D. Goeckel, and H. Pishro-Nik, “Fundamental limits of location privacy using anonymization,” in 51st Annual Conference on Information Science and Systems (CISS).   Baltimore, MD, USA: IEEE, 2017.
  • [17] N. Takbiri, A. Houmansadr, D. L. Goeckel, and H. Pishro-Nik, “Privacy of dependent users against statistical matching,” submitted to IEEE Transactions on Information Theory, Available at https://arxiv.org/abs/1710.00197.
  • [18] N. Takbiri, A. Houmansadr, D. L. Goeckel, and H. Pishro-Nik, “Privacy against statistical matching: Inter- user correlation,” in International Symposium on Information Theory (ISIT).   Vail, Colorado, USA: IEEE, 2018, pp. 1036–1040.
  • [19] Y. S. Chow and H. Teicher, Probability theory: independence, interchangeability, martingales.   Springer Science & Business Media, 2012.