I Introduction
The Internet of Things (IoT) enables users to share and access information on a large scale and provides many benefits for individuals (e.g., smart homes, healthcare) and industries (e.g., digital tracking, data collection, disaster management) by tuning the system to user characteristics based on (potentially sensitive) information about their activities. Thus, the use of IoT comes with a significant threat to users’ privacy: leakage of sensitive information.
Two main privacypreserving techniques are anonymization [1] and obfuscation [2], where the former is hiding the mapping between data and the users by replacing the identification fields of users with pseudonyms, and the latter is perturbing the user data such that the adversary observes false but plausible data. Although these methods have been addressed widely, statistical inference methods can be applied to them to break the privacy of the users [3, 4]. Furthermore, achieving privacy using these methods comes with a cost: reducing the utility to the user of the system. Hence, it is crucial to consider the tradeoff between privacy and utility when employing privacypreserving techniques, and to seek to achieve privacy with minimal loss of functionality and usability [5, 6, 7]. Despite the growing interest in IoT privacy [8, 9], previous works do not offer theoretical guarantees on the tradeoff between privacy and utility. The works of Shokri et al. [10, 1, 11] and Ma et al. [12] provide significant advances in the quantitative analyses of privacy; however, they are not based on a solid theoretical framework.
In [13, 14, 15], the data traces of different users are modeled as independent, and the asymptotic limits of user privacy are presented for the case when both anonymization and obfuscation are applied to users’ time series of data. In [13, 14, 16]
, each user’s data trace are governed by: 1) independent identically distributed (i.i.d.) samples of a Multinoulli distribution (generalized Bernoulli distribution); or, 2) Markov chain samples of a Multinoulli distribution. In
[15], the case of independent users with Gaussian traces was addressed. However, the data traces of different users are dependent in many applications (e.g. friends, relatives), and the adversary can potentially exploit such. In [17, 18], we extended the results of [13, 14] to the case where users are dependent and the adversary knows only the structure of the association graph, i.e., whether each pair of users are linked. As expected, knowledge of the dependency graph results in a significant degradation in privacy [13, 14].In this paper, we turn our attention to the case where user traces are i.i.d. samples of Gaussian random variables and the adversary knows not just the dependency graph but also the degree to which different users are correlated. Data points of users are continuousvalued, independently and identically distributed (i.i.d.) with respect to time, and the adversary knows the probability distribution function (pdf) of the data generated by all of the users. To preserve the privacy of users, anonymization is employed, i.e., the mapping between users and data sequences is randomly permuted for each set of
consequent users’ data. We derive the minimum number of adversary’s observations per user that ensures privacy, with respect to the number of users and the size of the subgraph to which the user belongs.Ii Framework
Consider a system with users. Denote by the data point of user at time , and by the vector containing the data points of user ,
To preserve the privacy of the users, anonymization is employed, i.e., the mapping between users and data sequences is randomly permuted. As shown in Figure 1, denote by the output of the anonymizer, which we term as the “reported data point” of user at time . The permuted version of is
There exists an adversary who wishes to break the anonymity and thus privacy of the users. He observes which are the reported data points of users at times
, and combines them with his/her statistical knowledge to estimate the users’ actual data points. For each set of
time instances of user data points, a new anonymization mapping is employed.Iia Models and Metrics
Data Points Model: Data points are independent and identically distributed (i.i.d.) with respect to time, i.e., , , is independent of . At time
, the vector of user data points is drawn from a multivariate normal distribution; that is,
where is the mean vector, is the covariance matrix, is the covariance between users and
, and the variances of the user data points are equal for all users (
). Following our previous work [17], the parameters of the distribution governing users’ behavior are in turn drawn randomly. In particular, we assume the means are finite and are drawn independently from a continuous distribution , where for all in the support of(1) 
and the correlations are finite and are drawn independently from a continuous distribution , where for all in the support of
(2) 
Note that the Cauchy–Schwarz inequality upper bounds the correlations.
Association Graph: The dependencies between users are modeled by an association graph in which two users are connected if they are dependent. Denote by the association graph where is the set of nodes and is the set of edges. Also, denote by the correlation coefficient between users and . Observe
Similar to [17], the association graph consists of disjoint subgraphs , where each subgraph is connected and refers to a group of “friends” or “associates”. Let denotes the number of nodes in , i.e., .
Anonymization Model: As we discussed in Section I, anonymization is employed to randomly permute the mapping between users and data sequences. We model the anonymization technique by a random permutation function on the set of users. Then, ,
Adversary Model: The adversary knows the multivariate normal distribution from which the data points of users are drawn. Therefore, in contrast to [17], the adversary knows both the structure of the association graph as well as the correlation coefficients for each pair of users . The adversary also knows the anonymization mechanism; however, he does not know the realization of the random permutation function.
The situation in which the user has no privacy is defined as follows [13]:
Definition 1.
User has no privacy at time , if and only if there exists an algorithm for the adversary to estimate perfectly as goes to infinity. In other words, as ,
where is the estimated value of by the adversary.
Iii Impact of Dependency on Privacy Using Anonymization
Here, we consider to what extent interuser dependency limits privacy in the case users’ data points are governed by a Gaussian distribution.
First, we consider the ability of the adversary to fully reconstruct the structure of the association graph of the anonymized version of the data with arbitrarily small error probability.
Lemma 1.
If the adversary obtains anonymized observations, he/she can reconstruct , where , such that with high probability, for all ; iff . We write this statement as .
Proof.
From the observations, the adversary can calculate the empirical covariance for each pair of user and user ,
(3) 
where
(4) 
Define the event
By Chebyshev’s inequality,
(5) 
Define
Consider the following fact.
Fact 1.
If and are Gaussian random variables with finite means and variances, and , then the moment of is finite.
Proof.
The proof follows from the CauchySchwarz inequality and the fact that the moments of and are finite. ∎
By Fact 1, . Therefore, the MarcinkiewiczZygmund inequality [19] yields
(6) 
where is a constant independent of .
Now let
Since is a convex function of when , Jensen’s inequality yields:
(7) 
Combined with (5),
(8) 
where is finite when , following from Fact 1.
Define and ,
Since , we conclude
(9) 
Similar to the approach leading to (9) we show that
(10) 
For each pair of , the union bound yields
In addition, applying the union bound again and substituting the values of , and yields:
as . Thus, for all , with high probability we achieve
Consequently, (3) yields
Similarly, we can show that
Thus,
Now, if , then , and thus,
Similarly, if , then , and thus,
Consequently, the adversary can reconstruct the association graph of the anonymized version of the data with arbitrarily small error probability. ∎
Next, we demonstrate how the adversary can identify group among all of the groups. Note that this is the key step which speeds up the adversary’s algorithm relative to the case where user traces are independent.
Lemma 2.
If the adversary obtains anonymized observations and knows the structure of the association graph, he/she can identify group among all of the groups with arbitrarily small error probability.
Proof.
Note that there are at most groups of size which we denote . Without loss of generality, we assume the members of group are users .
By (4), for all members of group (), the empirical mean is:
(11) 
For , define vectors of and with length :
and for , define
Also, define triangular arrays :
Let be the set of all permutations on elements; for , is a onetoone mapping. By [18, Equation 6],
(12) 
Next, we show when and is large enough,


,
where .
First, we prove with high probability. Substituting in (8) and (9) yields:
(13) 
and
(14) 
By the union bound, as ,
(15) 
Consequently, with high probability as .
Next, we show
Note that by (1) and (2), for all groups other than group ,
Similarly, for any ,
Thus, as , the union bound yields:
Since all the distances between ’s and are larger than , we show
By the union bound, for ,
as Consequently, for all , ’s are close to ’s,;thus, for large enough ,
Hence, the adversary can successfully identify group among all of the groups with arbitrary small error probability. ∎
Finally, we show that the adversary can identify all of the members of group with arbitrarily small error probability.
Lemma 3.
If the adversary obtains anonymized observations, and group is identified among all the groups, the adversary can identify user with arbitrarily small error probability.
Proof.
Next, we show that when and is large enough,
In other words, the adversary examines ’s which are defined according to (11) and chooses the only one that belongs to .
Next, we show that when ,
Therefore, the union bound yields:
Consequently, all ’s are outside of with high probability. Next, we prove is small. Observe:
hence, by union bound, when is large enough,
Thus, if , there exists an algorithm for the adversary to successfully identify user among all the users. ∎
Next, we present Theorem 1 which follows from Lemmas 1, 2, and 3. In this theorem, we determine the required number of observations per user () for the adversary to break the privacy of each user, in terms of number of users and size of group in which the user of interest belongs to ().
Theorem 1.
If the adversary knows both the structure of the association graph and the correlation coefficient between users, and

, for any ;
then, user has no privacy at time .
Lastly, in Theorem 2, we consider the case where the adversary knows only the association graph, but not necessarily the correlation coefficients between the users. Similar to the arguments leading to Theorem 1 and [17, Theorem 1] we show that if is significantly larger than , then the adversary can successfully break the privacy of the user of the interest, i.e., he’/she can find an algorithm to estimate the actual data points of the user with vanishing small error probability.
Theorem 2.
If adversary knows the structure of the association graph, and

, for any ;
then, user has no privacy at time .
Iv Discussion
Here, we compare our results with previous work. When the users are independent, the adversary can break the privacy of each user if the number of the adversary’s observations per user is [15] (Case 1 in Figure 4). However, when the users are dependent, and the adversary knows their association graph (and not the correlation coefficients), each user will have no privacy if (Theorem 1: Case 2 in Figure 4). Note that smaller means rapid changes in pseudonyms, which reduces the utility. The required number of peruser observation for the adversary to break the privacy of each user reduces further () when the adversary has more information: the correlation coefficients between users (Theorem 2: Case 3 in Figure 4). In other words, the more the adversary knows, the smaller must be, and we have characterized the significance in the loss of privacy of various degrees of knowledge of the dependency in this paper.
V Conclusion
Many popular applications use traces of user data, e.g., users’ location information or medical records, to offer various services to the users. However, revealing user information to such applications puts users’ privacy at stake, as adversaries can infer sensitive private information about the users such as their behaviors, interests, and locations. In this paper, anonymization is employed to protect users’ privacy when data traces of each user observed by the adversary are governed by i.i.d. Gaussian sequences, and data traces of different users are dependent. An association graph is employed to show the dependency between users, and both the structure of this association graph and nature of dependency between users are known to the adversary. We show dependency has a disastrous effect on the privacy of users. In comparison to the case in which data traces of different users are independent, here we must use a stronger anonymization technique by drastically increasing the rate at which user pseudonyms are changed, which degrades system utility.
References
 [1] R. Shokri, G. Theodorakopoulos, G. Danezis, J.P. Hubaux, and J.Y. Le Boudec, “Quantifying location privacy: the case of sporadic location exposure,” in International Symposium on Privacy Enhancing Technologies Symposium. Springer, 2011, pp. 57–76.
 [2] C. A. Ardagna, M. Cremonini, S. D. C. di Vimercati, and P. Samarati, “An obfuscationbased approach for protecting location privacy,” IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 1, pp. 13–27, 2011.
 [3] A. C. Polak and D. L. Goeckel, “Identification of wireless devices of users who actively fake their rf fingerprints with artificial data distortion,” IEEE Transactions on Wireless Communications, vol. 14, no. 11, pp. 5889–5899, Nov 2015.
 [4] A. Ukil, S. Bandyopadhyay, and A. Pal, “IoTprivacy: To be private or not to be private,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). Toronto, ON, Canada: IEEE, 2014, pp. 123–124.
 [5] G. Loukides and A. GkoulalasDivanis, “Utilitypreserving transaction data anonymization with low information loss,” Expert systems with applications, vol. 39, no. 10, pp. 9764–9777, 2012.
 [6] H. Lee, S. Kim, J. W. Kim, and Y. D. Chung, “Utilitypreserving anonymization for health data publishing,” BMC medical informatics and decision making, vol. 17, no. 1, p. 104, 2017.
 [7] M. Batet, A. Erola, D. Sánchez, and J. CastellàRoca, “Utility preserving query log anonymization via semantic microaggregation,” Information Sciences, vol. 242, pp. 49–63, 2013.
 [8] A. Ukil, S. Bandyopadhyay, and A. Pal, “Iotprivacy: To be private or not to be private,” in Computer Communications Workshops (INFOCOM WKSHPS), 2014 IEEE Conference on. IEEE, 2014, pp. 123–124.
 [9] H. Lin and N. W. Bergmann, “Iot privacy and security challenges for smart home environments,” Information, vol. 7, no. 3, p. 44, 2016.
 [10] R. Shokri, G. Theodorakopoulos, J.Y. Le Boudec, and J.P. Hubaux, “Quantifying location privacy,” in 2011 IEEE symposium on security and privacy. IEEE, 2011, pp. 247–262.
 [11] R. Shokri, G. Theodorakopoulos, C. Troncoso, J.P. Hubaux, and J.Y. Le Boudec, “Protecting location privacy: optimal strategy against localization attacks,” in Proceedings of the 2012 ACM conference on Computer and communications security. ACM, 2012, pp. 617–627.
 [12] Z. Ma, F. Kargl, and M. Weber, “A location privacy metric for v2x communication systems,” in Sarnoff Symposium, 2009. SARNOFF’09. IEEE. IEEE, 2009, pp. 1–6.
 [13] N. Takbiri, A. Houmansadr, D. L. Goeckel, and H. PishroNik, “Limits of location privacy under anonymization and obfuscation,” in International Symposium on Information Theory (ISIT). Aachen, Germany: IEEE, 2017, pp. 764–768.
 [14] N. Takbiri, A. Houmansadr, D. L. Goeckel, and H. PishroNik, “Matching anonymized and obfuscated time series to users’ profiles,” IEEE Transactions on Information Theory, accepted for publication, 2018, Available at https://arxiv.org/abs/1710.00197.
 [15] K. Le, H. PishroNik, and D. Goeckel, “Bayesian time series matching and privacy,” in 51th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 2017.
 [16] N. Takbiri, A. Houmansadr, D. Goeckel, and H. PishroNik, “Fundamental limits of location privacy using anonymization,” in 51st Annual Conference on Information Science and Systems (CISS). Baltimore, MD, USA: IEEE, 2017.
 [17] N. Takbiri, A. Houmansadr, D. L. Goeckel, and H. PishroNik, “Privacy of dependent users against statistical matching,” submitted to IEEE Transactions on Information Theory, Available at https://arxiv.org/abs/1710.00197.
 [18] N. Takbiri, A. Houmansadr, D. L. Goeckel, and H. PishroNik, “Privacy against statistical matching: Inter user correlation,” in International Symposium on Information Theory (ISIT). Vail, Colorado, USA: IEEE, 2018, pp. 1036–1040.
 [19] Y. S. Chow and H. Teicher, Probability theory: independence, interchangeability, martingales. Springer Science & Business Media, 2012.
Comments
There are no comments yet.