1 Introduction
Privacy preserving data mining, also known as statistical disclosure control, inference control or private data analysis [3, 1] seeks to protect statistical data so they can be publicly released and mined while preserving privacy [2]. Related techniques are widely used in domains like medical informatics or opinion polling, where patients feel uncomfortable about divulging information on issues like their frequency of drug use, or HIV status [4].
Storage of personally identifiable information (PII) like name, social security number (SSN), IP or MAC addresses can be easily used to compromise users’ privacy. Consider the recent Ashley Madison data disclosure [8]. So the best way to preserve user privacy is to not keep any PII. Methods like this are effective in foiling attackers, but make the task of maintaining user statistics very challenging. This paper shows how to maintain an accurate estimate of the number of unique users while protecting the privacy of each individual user by using statistical counting.
The work presented here builds on probabilistic counting introduced by Flajolet and Martin [6], which estimates the number of distinct records/users (cardinality) without keeping the records. It is based on statistical analysis of the bit positions of hashed records. The position of the least significant bit set in each hashed record is stored in a register called a BITMAP. From the bit position of the lowest bit that is not set on the BITMAP
, we get an unbiased estimate of the cardinality of the user set. For example, the input records could be the users’ SSL certificates.
This paper makes three main contributions.

Probabilistic counting has been used for estimating the cardinality of a multiset when it is unrealistic (too resourceintensive to store or count the elements) to solve precisely [11]. To the best of our knowledge, this is the first time the algorithm is used to preserve privacy and anonymity.

We propose collisionincluded probabilistic counting (CIPC), which includes the hash collisions in the estimate. The Birthday paradox [13] is used to determine the number of collisions. The results of our experiments show that adding collisions to the uncorrected estimate of the number of users, gives a more accurate count than using a constant correction factor [6]. This also verifies that hash collision is a major cause of bias for probabilistic counting.

We also provide an anonymity metric to measure the anonymity the proposed algorithm provides.
This paper was inspired by problems we faced when maintaining user privacy while collecting usage statistics for a censorship circumvention tool [9]. The tool developed is used by dissidents and journalists, working in politically challenging regions in West Africa, to circumvent DNS and IP address blocking by leveraging technologies developed for use by criminal botnet enterprises [12]. To quantify the number of unique users of the system, a secure method was needed to keep track of the statistics. To the best of our knowledge, this is the first time probabilistic counting is applied to safeguarding user privacy .
The rest of the paper is organized as follows. In section 2 we present background on probabilistic counting and the hash function used in the proposed algorithm. We present the collisionincluded probabilistic counting (CIPC) algorithm in section 3. Experimental results are given along with the recommendation for selecting the proper register/BITMAP size. We conclude the paper and point out future work for this research in section 4.
2 Background
This section covers previous probabilistic counting work.
Probabilistic counting [6] estimates the number of distinct elements (cardinality) in a large collection of data that contain duplicates. It requires only a small amount of storage and few operations per element. The algorithm works as follows:
 Inputs:

A multiset of records ; a register/BITMAP of length initialized to (see Figure 1).
 Output:

The estimate of the number of distinct records, denoted by .
 Step 1

A hash function that maps each element of
to an integer. This set of integers is uniformly distributed over the range
.  Step 2

For each record , let be the position of the least significant bit that is set in the binary representation of , . Set the corresponding BITMAP position to . That is, if , then . For example, will be set to if .
 Step 3

Let be the position of the rightmost zero in BITMAP . The estimated value of the number of unique records in is given by
(1) where is the correction factor.
The algorithm is based on statistical observations concerning the bits of hashed values. Let denote the set of distinct records in . Since is uniformly distributed over , for each,
with probability
, with probability , with probability and so on. This can be generalized as the sequence occurring with probability of [6]. So if we randomly select , the probability of being set to is . The BITMAP only depends on the least significant set bits (LSSBs) of distinct hashed values and not on the frequency of the values. If a record occurs more than once in multiset , it will be counted only once.Let be the number of distinct records in , i.e., , it is expected that will be set about times, will be set approximately times and so on. At the end of the execution, it is very likely that for and for [6]. An example is illustrated in Figure 2. Assuming there are distinct records, about records will have their LSSB be bit , that is, about of the hash values are mapped to ; about records are mapped to bit position , i.e., approximately records are mapped to , and so on [7].
It was proposed in [6] to use the position of the rightmost zero in BITMAP as an indicator of , based on the assumption that hash values will be uniformly distributed over . For example, if the BITMAP is , the cardinality , is estimated by
It is shown in [6, 5], is a biased estimate of and a correction factor can be used as a simple correction. The unbiased estimate of cardinality is given by
3 CollisionIncluded Probabilistic Counting (CIPC)
The plot of is in Figure 3, where is the position of the rightmost zero in the BITMAP. It is not hard to see that gives an underestimate of for the most part. And the gap between and gets larger as increases. According to (1), probabilistic counting uses a correction factor to compensate for the deviation of using as an indicator of . Although rigorous calculation of is provided in [6], the cause of the deviation is not discussed. In this section, we consider the effects of collisions in hashing and how including the collisions in hashing will improve the estimate. The experimentation results show that our approach produces an estimate at least as good as using , or even better under certain circumstances. Moreover, our approach provides some insight into the possible cause of the estimation bias.
3.1 Collision Estimate using Birthday Paradox
Even with uniform hash functions, it is inevitable that more than one record will be mapped to the same hash value, which is known as a collision. Although the BITMAP is set to be big enough to hold all the records (i.e., ), the probability of collisions increases with increasing number of records, especially when is fixed. Therefore, we propose collisionincluded probabilistic counting (CIPC). We calculate the expected number of collisions and add it to to improve the results. Birthday paradox [14] is adopted to estimate the expected number of collisions of multiplication hashing.
Theorem 1.
If the position of the rightmost zero in BITMAP (ranks start at ) is used as an indicator of , under the assumption that the hash values are uniformly distributed, the expected value of including collisions is:
(2) 
where is the estimate of obtained using collisionincluded probabilistic counting (CIPC).
Since the hash values are uniformly distributed over , according to birthday paradox [10], the expected number of collisions is
(3) 
where is the number of collisions. Since is the estimate of without including the number of collisions, we have
(4) 
Substituting (3) into (4) yields
(5) 
Given that , solving (5) for gives us
(6) 
∎
Figure 4 shows a flowchart of CIPC used to estimate the number of distinct system users.
Percent Error ()  Percent Error ()  
558  (502, 614)  44.1506% (min)  579  (508, 649)  42.0687% (min)  
2158  (1856, 2460)  56.8314%  2233  (1862, 2603)  55.3373%  
4236  (3613, 4859)  57.6372%  4348  (3567, 5129)  56.514%  
8869  (7580, 10158)  55.6515%  9236  (7662, 10810)  53.816%  
10590  (10590, 10590)  64.6977%  11356  (11356, 11356)  62.1466%  
18673  (16211, 21134)  53.3173%  19649  (16670, 22627)  50.8772%  
19769  (17705, 21832)  60.4614%  20940  (18351, 23529)  58.1189%  
17651  (11911, 23390)  70.5814% (max)  18283  (11083, 25483)  69.5277% (max)  
32556  (27636, 37476)  53.4906%  33476  (27553, 39400)  52.1758%  
35673  (30798, 40549)  55.4076%  37033  (30917, 43150)  53.7076%  
34216  (27734, 40697)  61.98215%  35205  (27074, 43335)  60.8833%  
34125  (28841, 39409)  65.8744%  35091  (28462, 41719)  64.9085% 
Percent Error ()  Percent Error ()  
884  (769, 1000)  11.5191%  886  (749, 1024)  11.3326%  
3530  (3097, 3962)  29.3954%  3498  (2971,4024)  30.033%  
6586  (5573, 7598)  34.1391%  6504  (5306,7702)  34.954%  
14872  (13048, 16695)  25.6398%  15000  (12802, 17197)  24.9991%  
16878  (14796,18961)  43.73699%  17414  (14873,19954)  41.9531%  
28536  (24627,32444)  28.6599%  28344  (23577, 33111)  29.1381%  
31772  (27465, 36078)  36.4558%  32248  (26922, 37574)  35.5024%  
31772  (26664, 36879)  47.0465% (max)  32404  (26193, 38615)  45.9927% (max)  
89838  (74745, 104931)  28.3403%  88414  (71114, 105714)  26.3064%  
94758  (80114, 109403)  18.4484%  93213  (76323, 110103)  16.5168%  
100233  (85382, 115083)  11.3702% (min)  99416  (82157, 116675)  10.4628% (min)  
68839  (61495, 76183)  31.1605%  71098  (62014, 80182)  28.9014% 
Percent Error ()  Percent Error ()  
1109  (919, 1299)  10.9853%  1034  (822, 1245)  3.4%  
5395  (4446, 6344)  7.905%  5248  (4168, 6329)  4.9732%  
10590  (8972, 12208)  5.9068%  10082  (8227, 11936)  0.8203%  
22140  (18757, 25522)  10.7%  21374  (17511, 25238)  6.8742%  
28241  (25109, 31374)  14.2658%  24377  (21443, 27312)  15.7221% (max)  
44833  (37089, 52578)  12.0847%  38585  (31495, 45676)  3.53508%  
54718  (46912, 62524)  9.437%  47616  (40421, 54812)  4.7661%  
56898  (49628, 64169)  5.1683% (min)  57107  (48475, 65739)  4.8209%  
77665  (64776, 90553)  10.95%  65704  (53884, 77524)  6.1366%  
92844  (78001, 107688)  16.0562% (max)  79811  (66157, 93464)  0.2361% (min)  
101317  (86236, 116398)  12.575%  87580  (73666, 101495)  2.68812%  
110613  (95856, 125371)  10.6138%  111035  (93671, 128399)  11.0354% 
3.2 Experiment Results
To evaluate the proposed algorithm, a set of experiments are performed for different values of with different register sizes ,, using both probabilistic counting and CIPC. The same multiplication hash function was used in both algorithms.
In each experiment, we randomly picked distinct integers that range between and and used them as the records. For each value of , the experiment was repeated for times using each algorithm. We then took the average as the final estimate of .
The results of the experiment are given in Table 1 and Table 2. Each table corresponds to a different register size (). Columns named “
” contain 95% confidence interval (CI) of the corresponding estimate. For easy and direct comparison of the results, we also calculated the
Percent Error (PE) of each estimate using formula(7) 
The lower the percent error (PE) is, the closer the estimate is to the true value. Based on this rule, we compare the performance of CIPC with probabilistic counting in terms of estimation accuracy.

[leftmargin=*]

: As shown in Table 1, is smaller than for all tested . This means CIPC always produces a more accurate estimate than probabilistic counting in our experiments. But the advantage is not significant (mostly 13% lower in PE rate).

: As shown in Table 2, CIPC gives about the same results as probabilistic counting. No significant difference is observed among the estimates generated using the two algorithms. But it is worth mentioning that both maximum and minimum are smaller than the corresponding parameters of probabilistic counting.

: As shown in Table 3, CIPC provides significantly superior estimate accuracy to probabilistic counting except for and . Half of the are several times bigger than the corresponding .
Based on the results of our experiments, it is safe to say that CIPC outperforms probabilistic counting in general.
4 Conclusion
To accurately estimate the number users of an anonymous system, we adopted probabilistic counting algorithm to develop collision included probabilistic counting (CIPC), which does not store identifiable information of system users. CIPC includes the hash collisions to the estimate which gives a more accurate estimate than the original probabilistic counting. Based on simulation results, we recommend using as the register size to maximize the results.
For the future work, we will explore the feasibility of probabilistic counting as an anonymity tool. We will investigate the information leakage of the CIPC, which implies how much uncertainty the system gives and probability of a user being identified by an attacker.
5 Acknowledgement
This material is based upon work sponsored by the National Science Foundation under Grants Nos. 1547164, 1544910 and 1643020. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References

[1]
B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar.
Privacy, accuracy, and consistency too: A holistic solution to contingency table release.
In Proceedings of the Twentysixth ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS ’07, pages 273–282, New York, NY, USA, 2007. ACM.  [2] J. DomingoFerrer. A survey of inference control methods for privacypreserving data mining. In C. C. Aggarwal and P. S. Yu, editors, PrivacyPreserving Data Mining, volume 34 of Advances in Database Systems, pages 53–80. Springer, 2008.
 [3] C. Dwork. Differential privacy: A survey of results. In Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, TAMC’08, pages 1–19, Berlin, Heidelberg, 2008. SpringerVerlag.
 [4] F. Esponda. Everything that is not important: Negative databases [research frontier]. Comp. Intell. Mag., 3(2):60–63, May 2008.
 [5] S. Finch. Mathematical Constants. Cambridge University Press, 2003.
 [6] P. Flajolet, G. N. Martin, and G. N. Martin. Probabilistic counting algorithms for data base applications, 1985.
 [7] J. Gama. Knowledge Discovery from Data Streams. Chapman & Hall/CRC, 1st edition, 2010.
 [8] A. Greenberg. Was the ashley madison database leaked? http://www.wired.com/2015/07/hackbriefattackersspilluserdatacheatingsiteashleymadison/, 2015.
 [9] O. Hambolu. Privacy preserving statistics. Master’s thesis, Clemson University, South Carolina, USA, 2014.
 [10] M. Matt. Counting hash collisions with the birthday paradox. http://matt.might.net/articles/countinghashcollisions/, July 2015.
 [11] T. Mcmullen. It probably works. Queue, 13(8):80, 2015.
 [12] J. Nazario and T. Holz. As the net churns: Fastflux botnet observations. In 3rd International Conference on Malicious and Unwanted Software, MALWARE 2008, Alexandria, Virginia, USA, October 78, 2008 [12], pages 24–31.
 [13] P. S. Generalized birthday paradox. http://gdtr.wordpress.com/2013/01/13/generalizedbirthdayparadoxkeygenme3bydcoder/.
 [14] D. Wagner. A generalized birthday problem. In In CRYPTO, pages 288–303. SpringerVerlag, 2002.