Mixes aim at providing anonymity in communication networks by acting as routers that hide the correspondence between senders and receivers of messages. These anonymous communication channels operate by gathering the messages they receive, changing their appearance cryptographically and outputting them in batches, in what are called rounds of mixing. However, providing perfect anonymity through mixes is not possible in practice, due to constraints in the bandwidth of the communication channel and the delay tolerated by users. Because of this, an adversary observing the system in the long-term may infer the frequency with which a certain sender communicates with a certain receiver by means of a disclosure attack [1, 2, 3, 4, 5]. One of these strategies, called the Least Squares Disclosure Attack (LSDA) [5, 6, 7], has been proven to outperform previous statistical variants  while keeping its computational cost much lower than more sophisticated approaches, such as . One advantage of LSDA is that it is particularly suitable for analysis, due to the availability of closed-form expressions for its prediction error in terms of the system parameters. Such performance analysis is of paramount importance since it helps the designer of mix-based anonymous communication systems to understand how to improve the protection of the users.
under specific assumptions on the users’ and mix behavior. However, these results have only been confirmed by computer-generated observations and therefore it is not clear whether they apply in real-world scenarios. In this document, we delve into how users behave in reality. We gather data from real databases of different nature, which we then use to show that previous analyses of the attack fall short when tested against real data. We analyze the hypotheses that are needed for the performance analysis of LSDA to be applicable in real-world scenarios and develop a new generalized closed-form expression for the attacker’s error when estimating the relationships between users in mixes, which we then evaluate with real traffic. Real-world datasets have been used in other works to compare between different disclosure attacks or to analyze the properties of real traffic . Our approach is different, as we are interested in understanding the effects of real-world user behavior on the performance of the least squares disclosure attack.
The document is structured as follows: we describe the least squares attack in the following section, together with the system model and notation we use in the paper. In Sect. III, we study the statistical properties of real-world behavior in our system. We carry out and evaluate a new performance analysis of LSDA in Sect. IV, and conclude in Sect. V.
Ii The Least Squares Disclosure Attack
The Least Squares Disclosure Attack (LSDA), introduced by Pérez-González and Troncoso in , estimates the intensity of the communication between each sender-receiver pair in a mix-based anonymous channel by solving a least squares problem. This intensity is represented by the transition probabilities , which model the average probability that a message sent by sender is addressed to receiver
. These probabilities are commonly grouped per sender in the so-calledsending profiles, . An attacker that observes the number of messages sent and received during communication rounds obtains the LSDA estimator by solving
where is a matrix containing the estimation of the transition probability in its th entry, is a matrix containing the amount of messages sent by sender in round , denoted , in its th entry, and is a matrix with the number of messages received by receiver in round , denoted , in its th entry. Figure 1 shows an example of the system and notation employed. The estimator in (1
) was proven to be unbiased and asymptotically efficient, in the sense that its variance approaches zero as the length of the observation windowincreases [5, 7], in mix-based systems where all messages leave the mix in each round.
Denoting the th column of by , and the th column of by , (1) can be decoupled as
This latter formulation is specially useful to carry out a performance analysis of the attack.
Iii Modeling Real-World Behavior
In this section, we study real-world user behavior from observations generated with real traffic, showing that previous performance analyses of LSDA are not valid in this scenario because the assumptions they are based on are rather unrealistic. We propose alternative hypotheses that are adequate to model real-world user behavior, which we then use in Sect. IV to assess the performance of the LSDA estimator.
Iii-a Generating real-world observations
In order to analyze real-world behavior, we have chosen to generate observations by taking real traffic from datasets of different nature, whose users could have relied on mix-based systems to enhance their privacy, and anonymize this traffic using different mix configurations. We work with three datasets, whose basic information is summarized in Table I:
Email: This dataset contains around emails sent from different email addresses, which have been extracted from the Enron corpus.111http://www.cs.cmu.edu/~./enron/ Messages with multiple recipients are treated as different messages sent simultaneously, one for each recipient.
Location: This dataset contains around location check-ins taken from the most active users of Gowalla social networking website.222http://snap.stanford.edu/data/loc-gowalla.html Users checking-in are considered as the senders, while the locations form the set of receivers. We consider only the most active users for computational reasons: LSDA works with large-size matrices which grow with the number of senders and receivers of the system.
MailingList: we have processed the public mailing lists of Indimedia,333http://lists.indymedia.org/ obtaining almost messages from the most active senders. Each mailing list is considered as a receiver, while users posting to these mailing lists are senders.
|Dataset||No. messages||Duration (hours)||Senders||Receivers|
We anonymize the traces from these datasets using two types of mixes, which differ in the event that triggers the flushing of messages:
Threshold mix: this mix gathers messages until it has stored of them, and then forwards each one to its correspondent recipient.
Timed mix: this mix stores the messages it receives and, after a period of time , outputs each one to their recipients.
To generate the adversary’s observations, we choose values of and that provide an acceptable degree of anonymity while keeping the delay of messages under a reasonable bound. We adopt the following criteria: in the threshold mix, we choose a value and, in the timed mix, we select values of to ensure that messages are mixed on average per round, while also considering that a delay of more that hours is intolerable for users. This makes hours for Email, hour for Location and hours for MailingList, with an average of messages per round in the first two, and in the latter. The result of this anonymization is a set of observations from and .
Iii-B Modeling the input process
The input process, , which models the amount of messages from each user arriving to the mix in each round, is determined by the frequency with which users send messages and by the firing condition of the mix. When the anonymization channel is a threshold mix, previous analyses [5, 6, 8] assume that the input process follows a multinomial distribution, and, when the channel is a timed mix, authors in  assume that the number of messages each user sends to the mix can be independently modeled as a Poisson process.
In Fig. 8, we compare the histogram of the inputs , obtained using the observations generated with our datasets, with the theoretical values given by the multinomial and Poisson models (in the threshold and the timed mixes, respectively). Here, the last bin of the histogram contains all occurrences of . We conclude that the theoretical models fit the histogram for low number of messages , but fail at capturing the large values.
In the analysis in this document, we do not assume a specific distribution for , but consider that it is a generic stationary process that satisfies the relation
for all , , except when . These assumptions mean, in other words, that the participation of a user in a given round is uncorrelated with the participation of each other user in that round. We have validated these hypotheses by computing the different sample covariances from our datasets, as shown in Table II.
Iii-C Modeling the output process
A crucial point when carrying out a performance analysis of disclosure attacks on mixes is selecting a model for the distribution , which represents how users choose the recipients of their messages in each round. A known property of this distribution, given by the definition of sending profiles, is that . However, this is true for many distributions. Every previous analysis of LSDA assumes that the choice of recipients is stationary and that follows a multinomial model, i.e.,
This model is adequate in scenarios where users choose the recipients of each of their messages in each round independently. However, when users tend to focus on a single receiver in each round, (6) is not suitable to model the output distribution.
In this work, we assume two models for that are examples of how users can distribute their messages among the receivers while satisfying :
A multinomial model, given by (6), as an example of users that cause low variance output.
A maximum variance model, given by
When using these distributions, we are implicitly assuming that the choices of recipients of different senders within the same round are uncorrelated, and that the choice of recipients of the same user between rounds can be also considered uncorrelated. Our experiments in Sect. IV-2 confirm that the results we obtain with these approximations are accurate.
To illustrate how users’ behavior changes between scenarios, we have computed the average number of recipients each sender chooses in each round of the observations generated with our datasets, as a function of the number of messages sent. This is displayed in Table III. As a reference, the average number of senders’ contacts in each dataset is in Email, in Location and in MailingList. These results show that users in the Email dataset tend to spread their messages among their contacts, behaving close to (6), while users in Location and MailingList focus on a single recipient in each round, as in (7).
|# messages ()|
Iv Extended Performance Analysis of the Least Squares Disclosure Attack
We now assess the profiling accuracy of the Least Squares Disclosure Attack with the assumptions in the input and output processes proposed in the previous section, which we have validated with traffic from real-world scenarios. The profiling accuracy is measured as the Mean Squared Error (MSE) between the attacker’s estimation of the sending profiles of the users and their real values, i.e., . This analysis generalizes previous ones [5, 6, 8, 7], accommodating different types of mixes and being able to model real-world behavior, at the expense of accuracy.
Iv-1 Theoretical approximation of the average MSE
Our goal is to obtain an approximation of the average when using (1) to estimate the sending profiles, where this average is computed over all the realizations of and obtained with users’ average behavior . For simplicity, we omit the conditioning on in the derivations below.
For the analysis in this section, we introduce additional notation regarding the statistics of the input and output processes. We use to refer to the expected value of , and is itscontains all for each sender, i.e., . Matrix contains these values arranged in its main diagonal, i.e., and, similarly, . We use the parameter , which is closely related to the variance of the outputs, and the diagonal matrix . Finally, we define the uniformity of the sending profile of user as . The uniformity gives an idea of how random the behavior of a user is, and ranges from 0, when sender only has one contact, to , when this user sends messages to all the receivers with the same probability during the observation period. Note that .
We start the derivations by showing that the LSDA estimator is unbiased. This is straightforward from the fact that, given a matrix of input messages and the average behavior of the senders , the expected value of the output is
where is taken along all the possible assignments of the messages in to the receivers, following . Using (8) together with (1), we get (alternatively, ). This property allows to write, using the law of total variance,
Since we have assumed that the input process is stationary, using the Law of Large Numbers and considering that the number of rounds observedis large enough, we approximate
where is the autocorrelation matrix of the input process, i.e., an symmetric matrix whose th element is . Using (3), we write this matrix as
where . Therefore, when the number of rounds observed is large, (9) can be approximated as
Finally, plugging (12) and (14) into (13) and performing matrix multiplications we obtain . Then, taking the -th diagonal element of this matrix, which is , adding this element along , and further considering for all , we obtain:
Maximum variance model
Operating as explained before to obtain the MSE in the estimation of the sending profile of user , we get
The formulas (15) and (17) provide new insights into how LSDA’s error depends on the system parameters. This error decreases with , since it becomes easier for the attacker to estimate the behavior of the users as more observations are available. The variance of the input process decreases the estimation error of , i.e., it is easier to separate the sending behavior of a user from the others when we have rounds where that user participates a lot as well as rounds where that user is not present. The also increases with the contribution of all senders to the output variance, more strongly when users behave as in (7) than as in (6). The role of the uniformity of the profiles in the MSE is also very relevant: estimating the sending profiles is a much easier task when users only contact very few receivers (i.e., low ) than when they distribute their messages among a larger population (i.e., close to 1).
We now evaluate our formulas, applying LSDA to the anonymized traces of real traffic. For each dataset, mix configuration, and number of rounds observed , where is the total number of rounds in the observations, we perform LSDA and compute the real . We then represent the average of those users that meet three conditions: they are among the most active users, they belong to the users that remain active for the largest number of rounds, and furthermore they participate before rounds have been observed. We do this to avoid sporadic peaks in the average , which are the result of estimating the sending profile of a user that barely participates in the system, and to be able to see the trend of the MSE with clarity.
Figure 15 shows this average for the Email, Location and MailingList datasets, together with the theoretical formulas and in (15) and (17). We only plot the theoretical approximation that better suits each scenario: in the Email experiments and in the Location and MailingList experiments. We also plot the theoretical MSE from previous works, denoted by , which has been taken from  and  for the threshold and the timed mix experiments, respectively. We set the limits of the vertical axis to the same value in all figures to ease the comparison between them. These limits make early values of the MSE (low ) to fall outside the plot, but allow to see with more detail the performance of the attack for large values of . We do this on purpose: we are predicting the asymptotic of the attack, so the results for low values of are not significant in our evaluation.
We see that our approximations improve those given by previous work, especially in those scenarios where the multinomial model for the choice of recipients is not appropriate (Location and MailingList). We note that the number of rounds we can generate with the Email database in Fig. (d)d is not large enough to appreciate this improvement, due to the spike we observe in that experiment at early values of . This sudden increase of the MSE, as well as the one in Fig. (c)c, happens for two reasons: first, when the number of rounds observed is small, it is easier for the matrix to be ill-conditioned, which results in a poor estimation of the sending profiles (cf. (1)), and therefore in a large MSE in the realization. This spike is not predicted by our theoretical formulas, since they approximate the average MSE. On the other hand, in the Email dataset, most of the users whose we average start sending messages when the adversary has observed around of the total number of rounds. This causes an increase in the at around , as we are adding users to the average that have barely participated in the system. The average stabilizes as the number of rounds observed increases since the number of users used for the computation of the average we represent remains unchanged.
In all cases, the decreases as the number of observed rounds increases, as predicted by our formulas, except for the spikes in Figs. (d)d and (c)c whose origin we have already explained. Due to these spikes, comparing the results of the experiments in the Email and MailingList datasets is not possible. However, we can see that the MSE in the experiments with Location is stable, and always larger in the threshold mix scenario (Fig. (b)b). The reason for this is the following: the variance of the the input process in a threshold mix is smaller than that in a timed mix for the same average number of messages sent per round. This is the the case in the Location experiments, since the number of rounds we generate in the threshold and timed mix experiments is approximately the same. As predicted by our theoretical formulas, a system with lower input variance provides more protection against the LSDA attacker.
We have analyzed the effects of real-world user behavior in the performance of the least squares disclosure attack  in mix-based anonymous communication systems, considering mixes that do not delay messages between communication rounds. To validate our work, we have obtained real traffic observations from three publicly available datasets of different nature: emails sent between the employees of a company, location check-ins from an online social network, and users’ posts to mailing lists. By studying these data, we confirm that the hypotheses upon which former analyses of the least squares disclosure attack are based [5, 6, 7] are not adequate to model real-world behavior, and hence we formulate new ones. Based on these new assumptions, we develop a generalized performance analysis of the attack, which we validate with our datasets, confirming that it accurately models the estimation error of the attacker in the considered realistic scenarios. This analysis accommodates a wide variety of mix and users’ behavior, and provides new insights into the statistics that affect the protection of the users: the variability in the participation of the users in the system contributes to the attacker’s success, while the variability in the messages received by users worsens the attacker’s estimation.
This work was partially funded by the Spanish Government and the ERDF under project TACTICA, by the Spanish Government under project COMPASS (TEC2013-47020-C2-1-R), by the Galician Regional Government and the ERDF under projects Consolidation of Research Units (GRC2013/009), REdTEIC (R2014/037) and AtlantTIC, and by the EU 7th Framework Programme (FP7/2007-2013) under grant agreements 610613 (PRIPARE) and 285901 (LIFTGATE).
-  G. Danezis, “Statistical disclosure attacks: Traffic confirmation in open environments,” in Proceedings of Security and Privacy in the Age of Uncertainty, Gritzalis, Vimercati, Samarati, and Katsikas, Eds., IFIP TC11. Athens: Kluwer, May 2003, pp. 421–426.
-  N. Mathewson and R. Dingledine, “Practical traffic analysis: Extending and resisting statistical disclosure,” in 4th Workshop on Privacy Enhancing Technologies, ser. LNCS, D. Martin and A. Serjantov, Eds., vol. 3424. Springer, 2004, pp. 17–34.
-  C. Troncoso, B. Gierlichs, B. Preneel, and I. Verbauwhede, “Perfect matching disclosure attacks,” in 8th Symposium on Privacy Enhancing Technologies, ser. LNCS, N. Borisov and I. Goldberg, Eds., vol. 5134. Springer-Verlag, 2008, pp. 2–23.
G. Danezis and C. Troncoso, “Vida: How to use Bayesian inference to de-anonymize persistent communications,” in9th Privacy Enhancing Technologies Symposium, ser. LNCS, I. Goldberg and M. J. Atallah, Eds., vol. 5672. Springer, 2009, pp. 56–72.
-  F. Pérez-González and C. Troncoso, “Understanding statistical disclosure: A least squares approach,” in Privacy Enhancing Technologies - 12th Symposium, ser. LNCS, vol. 7384. Springer-Verlag, 2012, pp. 38–57.
-  F. Pérez-González and C. Troncoso, “A least squares approach to user profiling in pool mix-based anonymous communication systems,” in IEEE Workshop on Information Forensics and Security, 2012, pp. 115–120.
-  S. Oya, C. Troncoso, and F. Pérez-González, “Do dummies pay off? limits of dummy traffic protection in anonymous communications,” in 14th Symposium on Privacy Enhancing Technologies, 2014.
-  S. Oya, C. Troncoso, and F. Pérez-González, “Meet the family of statistical disclosure attacks,” IEEE Global Conference on Signal and Information Processing, p. 4p, 2013.
-  G. Danezis and C. Troncoso, “You cannot hide for long: De-anonymization of real-world dynamic behaviour,” in Proceedings of the 12th ACM Workshop on Workshop on Privacy in the Electronic Society, ser. WPES ’13. ACM, 2013, pp. 49–60.
-  K. Malinka, P. Hanáček, and D. Cvrček, “Analyses of real email traffic properties,” Radioengineering, vol. 18, no. 4, p. 7, 2009.
-  J. Sherman and W. J. Morrison, “Adjustment of an inverse matrix corresponding to a change in one element of a given matrix,” The Annals of Mathematical Statistics, vol. 21, no. 1, pp. 124–127, 1950.