Many data analysis applications must count the number of distinct elements in a large stream with repetitions. These applications include network monitoring [estan2003bitmap], online analytical processing [padmanabhan2003multi, shukla1996storage], query processing, and database management [whang1990linear]. A variety of algorithms, which we call cardinality estimators, have been developed to do this efficiently, with a small memory footprint. These algorithms include PCSA [flajolet1985probabilistic], LogLog [durand2003loglog], and HyperLogLog [flajolet2008hyperloglog]. They can all be parallelized and implemented using frameworks like MapReduce [dean2008mapreduce]. Indeed, their internal memory state, called a sketch, can be saved, and sketches from different data shards can be aggregated without information loss.
Using cardinality estimators, data owners can compute and store sketches over fine-grained data ranges, for example, daily. Data analysts or tools can then merge (or aggregate) existing sketches, which enables the subsequent estimation of the total number of distinct elements over arbitrary time ranges. This can be done without re-computing the entire sketch, or even accessing the original data.
Among cardinality estimators, HyperLogLog [flajolet2008hyperloglog] and its variant HyperLogLog++ [heule2013hyperloglog] are widely used in practical data processing and analysis tasks. Implementations exist for many widely used frameworks, including Apache Spark [sparkhll], Google BigQuery [bigqueryhll], Microsoft SQL Server [tsqlhll], and PostgreSQL [postgresqlhll]. The data these programs process is often sensitive. For example, they might estimate the number of distinct IP addresses that connect to a server [tschorsch2013algorithm], the number of distinct users that perform a particular action [ashok2014scalable], or the number of distinct devices that appeared in a physical location [monreale2013privacy]. To illustrate this point, we present an example of a location-based service, which we will use throughout the paper.
A location-based service gathers data about the places visited by the service’s users. For each place and day, the service stores a sketch counting the identifiers of the users who visited the place that day. This allows the service’s owners to compute useful statistics. For example, a data analyst can merge the sketches corresponding to the restaurants in a neighborhood over a month, and estimate how many distinct individuals visited a given restaurant during that month. The cost of such an analysis is proportional to the number of aggregated sketches. If the queries used raw data instead, then each query would require a pass over the entire dataset, which would be much more costly.
Note that the type of analysis to be carried out by the data analyst may not be known in advance. The data analysts should be able to aggregate arbitrarily many sketches, across arbitrary dimensions such as time or space. The relative precision of cardinality estimation should not degrade as sketches are aggregated.
Fine-grained location data is inherently sensitive: it is extremely re-identifiable [golle2009anonymity], and knowing an individual’s location can reveal private information. For example, it can reveal medical conditions (from visits to specialized clinics), financial information (from frequent visits to short-term loan shops), relationships (from regular co-presence), sexual orientation (from visits to LGBT community spaces), etc. Thus, knowing where a person has been reveals sensitive information about them.
As the above example suggests, an organization storing and processing location data should implement risk mitigation techniques, such as encryption, access controls, and access audits. The question arises: how should the sketches be protected? Could they be considered sufficiently aggregated to warrant weaker security requirements than the raw data?
To answer this question, we model a setting where a data owner stores sketches for cardinality estimation in a database. The attacker can access some of the stored sketches and any user statistics published, but not the raw data itself. This attacker model captures the insider risk associated with personal data collections where insiders of the service provider could gain direct access to the sketch database. In this paper, we use this insider risk scenario as the default attacker model when we refer to the attacker. In the discussion, we also consider a weaker attacker model, modeling an external attacker that accesses sketches via an analytics service provided by the data owner. In both cases, we assume that the attacker knows the cardinality estimator’s internals.
The attacker’s goal is to infer whether some user is in the raw data used to compute one of the accessible sketches. That is, she picks a target user, she chooses a sketch built from a stream of users, and she must guess whether her target is in this stream. The attacker has some prior knowledge of whether her target is in the stream, and examining the sketch gives her a posterior knowledge. The increase from prior to posterior knowledge determines her knowledge gain.
Consider Example 1. The attacker could be an employee trying to determine whether her partner visited a certain restaurant on a given day, or saw a medical practitioner. The attacker might initially have some suspicion about this (the prior knowledge), and looking at the sketches might increase this suspicion. A small, bounded difference of this suspicion might be deemed to be acceptable, but we do not want the attacker to be able to increase her knowledge too much.
We show that for all cardinality estimators that satisfy our aggregation requirement, sketches are almost as sensitive as raw data. Indeed, in this attacker model, the attacker can gain significant knowledge about the target by looking at the sketch. Our results are a natural consequence of the aggregation properties: to aggregate sketches without counting the same user twice, they must contain information about which users were previously added. The attacker can use this information, even if she does not know any other users in the sketch. Furthermore, adding noise either violates the aggregation property, or has no influence on the success of the attack. Thus, it is pointless to try and design privacy-preserving cardinality estimators: privacy and accurate aggregation are fundamentally incompatible.
To show the applicability of our analysis to real-world cardinality estimators, we quantify the privacy of HyperLogLog, the most widely used cardinality estimator algorithm. We show that for common parameters, the privacy loss is significant for most users. Consider, for example, a user among the 500 users associated with the largest privacy loss, in a sketch that contains 1000 distinct users. An attacker with an initial suspicion of 1% that is in the sketch can raise her level of certainty to over 31% after observing the sketch. If her initial estimate is 10%, it will end up more than 83%. If her prior is 50%, then her posterior is as high as 98%.
Our main contributions are:
We formally define a class of algorithms, which we call cardinality estimators, that count the number of distinct elements in a stream with repetitions, and can be aggregated arbitrarily many times (Section 3).
We give a definition of privacy that is well-suited to how cardinality estimators are used in practice (Section 4.2).
2 Previous work
Prior work on privacy-preserving cardinality estimators has been primarily focused on distributed user counting, for example to compute user statistics for anonymity networks like Tor. Each party is usually assumed to hold a set of data, or a sketch built from , and the goal is to compute the cardinality of , without allowing the party to get too much information on the sets with . The attacker model presumes honest-but-curious adversaries.
Tschorsch and Scheuermann [tschorsch2013algorithm] proposed a noise-adding mechanism for use in such a distributed context. In [melis2015efficient], each party encrypts their sketch, and sends it encrypted to a tally process, which aggregates them using homomorphic encryption. Ashok et al. [ashok2014scalable] propose a multiparty computation protocol based on Bloom filters to estimate cardinality without the need for homomorphic encryption, while Egert et al. [egert2015privately] show that Ashok et al.’s approach is vulnerable to attacks and propose a more secure variant of the protocol.
Our attacker model, based on insider risk, is fundamentally different to previously considered models: the same party is assumed to have access to a large number of sketches; they must be able to aggregate them and get good estimates.
Our privacy definition for cardinality estimators is inspired from differential privacy, first proposed by Dwork et al. [dwork2006calibrating]
. Data-generating probability distributions were considered as a source of uncertainty in[rastogi2009relationship, duan2009privacy, zhou2009differential, bhaskar2011noiseless, kifer2012rigorous, bassily2013coupled, grining2017towards]; and some deterministic algorithms, which do not add noise, have been shown to preserve privacy under this assumption [bhaskar2011noiseless, bassily2013coupled, grining2017towards]. We explain in Section 4.2 why we need a custom privacy definition for our setup and how it relates to differential privacy [dwork2008differential] and Pufferfish privacy [kifer2012rigorous].
Our setting has some superficial similarities to the local differential privacy model, where the data aggregator is assumed to be the attacker. This model is often used to develop privacy-preserving systems to gather statistics [erlingsson2014rappor, fanti2016building, bassily2015local, bassily2017practical]. However, the setting and constraints of our work differ fundamentally from these protocols. In local differential privacy, each individual sends data to the server only once, and there is no need for deduplication or intermediate data storage. In our work, the need for intermediate sketches and unique counting leads to the impossibility result.
3 Cardinality estimators
In this section, we formally define cardinality estimators, and prove some of their basic properties.
Cardinality estimators estimate the number of distinct elements in a stream. The internal state of a cardinality estimator is called a sketch. Given a sketch, one can estimate the number of distinct elements that have been added to it (the cardinality).
Cardinality estimator sketches can also be aggregated: two sketches can be merged to produce another sketch, from which we can estimate the total number of distinct elements in the given sketches. This aggregation property makes sketch computation and aggregation embarrassingly parallel. The order and the number of aggregation steps do not change the final result, so cardinality estimation can be parallelized using frameworks like MapReduce.
We now formalize the concept of a cardinality estimator. The elements of the multiset are assumed to belong to a large but finite set (the universe).
A deterministic cardinality estimator is a tuple , where
is the empty sketch;
is the deterministic operation that adds the element to the sketch and returns an updated sketch;
estimates the number of distinct elements that have been added to the sketch.
Furthermore, the operation must satisfy the following axioms for all sketches and elements :
These axioms state that ignores duplicates and that the order in which elements are added is immaterial. Ignoring duplicates is a natural requirement for cardinality estimators. Ignoring order is required for this operation to be used in frameworks like MapReduce, or open-source equivalents like Hadoop or Apache Beam. Since handling large-scale datasets typically requires using such frameworks, we consider commutativity to be a hard requirement for cardinality estimators.
We denote by the sketch obtained by adding successively to , and we denote by the set of all sketches that can be obtained inductively from by adding elements from any subset of in any order. Note that is finite and of cardinality at most . Order and multiplicity do not influence sketches: we denote by the sketch obtained by adding all elements of a set (in any order) to .
for all .
This follows directly from Properties 1 and 2. ∎
In practice, cardinality estimators also have a operation. We do not explicitly require the existence of this operation, since ’s idempotence and commutativity ensure its existence as follows.
To merge two sketches and , choose some such that . We define to be the sketch obtained after adding all elements of successively to .
Note that this construction is not computationally tractable, even though in practical scenarios, the operation must be fast. This efficiency requirement is not necessary for any of our results, so we do not explicitly require it either.
The operation in Definition 4 is well-defined, i.e., it does not depend on the choice of . Furthermore, the merge operation is a commutative, idempotent monoid on with as neutral element.
The proof is given in Appendix A. Note that these properties of the merge operation are important for cardinality estimators: when aggregating different sketches, we must ensure that the result is the same no matter in which order the sketches are aggregated.
Existing cardinality estimators also satisfy efficiency requirements: they have a low memory footprint, and and run in constant time. These additional properties are not needed for our results, so we omit them in our definition.
We now define precise cardinality estimators.
Let be a set of cardinality taken uniformly at random in . The quality of a cardinality estimation algorithm is given by two metrics:
Cardinality estimators used in practice are asymptotically unbiased: . In the rest of this work, we assume that all cardinality estimators we consider are perfectly unbiased, so
. Cardinality estimators are often compared by their relative standard error (RSE), which is given by.
A cardinality estimator is said to be precise if it is asymptotically unbiased and its relative standard error is bounded by a constant. In practice, we want the relative standard error to be less than a reasonably small constant, for example less than 10%.
We give a few examples of cardinality estimators, with their memory usage in bits () and their relative standard error. As a first step, they all apply a hash function to the user identifiers. Conceptually, this step assigns to all users probabilistically a random bitstring of length 32 or 64 bits. Since hashing is deterministic, all occurrences of a user are mapped to the same bitstring.
K-Minimum Values [bar2002counting], with parameter , maintains a list of the smallest hashes that have been added to the sketch. With -bit hashes, it has a memory usage of
, and its unbiased estimator has a RSE of approximately[beyer2007synopses].
Probabilistic Counting with Stochastic Averaging (PCSA, also known as FM-sketches) [flajolet1985probabilistic] maintains a list of bit arrays. When an element is added to an FM-sketch, its hash is split into two parts. The first part determines which bit array is modified. The second part determines which bit is flipped to , depending on the number of consecutive zeroes at the beginning. With registers and -bit hashes, its memory usage is , and its RSE is approximately .
LogLog [durand2003loglog] maintains a tuple of registers. Like PCSA, it uses the first part of the hash to pick which register to modify. Then, each registers stores the maximum number of consecutive zeroes observed so far. Using registers and a -bit hash, its memory usage is bits, and its standard error is approximately .
HyperLogLog [heule2013hyperloglog] has the same structure and add operation as LogLog, only its estimate operation is different. Using registers and a -bit hash, it has a memory usage of bits, and its standard error is approximately .
Bloom filters can also be used for cardinality estimation [papapetrou2010cardinality]. However, we could not find an expression of the standard error for a given memory usage in the literature.
All these cardinality estimators have null or negligible () bias. Thus, their variance is equal to their mean squared standard error. So the first four are precise, whereas we do not know if Bloom filters are.
All examples above are deterministic cardinality estimators. For them and other deterministic cardinality estimators, bias and variance only come from the randomness in the algorithm’s inputs. We now define probabilistic cardinality estimators. Intuitively, these are algorithms that retain all the useful properties of deterministic cardinality estimators, but may flip coins during computation. We denote by the set of distributions over .
A probabilistic cardinality estimator is a tuple , where
is the empty sketch;
is the probabilitistic operation that adds the element to the sketch and returns an updated sketch;
is the probabilistic operation that merges two sketches and ; and
estimates the number of unique elements that have been added to the sketch.
Both the and operations can be extended to distributions of sketches. For a distribution of sketches and an element , denotes the distribution such that:
For two distributions of sketches and , denotes the distribution such that:
We want probabilistic cardinality estimators to have the same high-level properties as deterministic cardinality estimators: idempotence, commutativity, and the existence of a well-behaved operation. In the deterministic case, the idempotence and commutativity of the operation was sufficient to show the existence of a operation with the desired properties. In the probabilistic case, this no longer holds. Instead, we require the following two properties.
For a set , let denote the sketch distribution obtained when adding elements of successively into . The mapping from to must be well-defined: it must be independent of the order in which we add elements, and possible repetitions. This requirement encompasses both idempotence and commutativity.
For two subsets and of , we require that
These requirements encompass the results of Lemma 5.
These properties, like in the deterministic case, are very strong. They impose that an arbitrary number of sketches can be aggregated without losing accuracy during the aggregation process. This requirement is however realistic in many practical contexts, where the same sketches are used for fine-grained analysis and for large-scale cardinality estimation. If this requirement is relaxed, and the cardinality estimator is allowed to return imprecise results when merging sketches, our negative results do not hold.
For example, Tschorsch and Scheuermann proposed a cardinality estimation scheme [tschorsch2013algorithm] which adds noise to sketches to make them satisfy privacy guarantees in a distributed context. However, their algorithm is not a probabilistic cardinality estimator according to our definition: noisy sketches can no longer be aggregated. Indeed, [tschorsch2013algorithm] explains that “combining many perturbed sketches quickly drives [noise] to exceedingly high values.” In our setting, aggregation is crucial, so we do not further consider their algorithm.
4 Modeling privacy
4.1 System and attacker model
Figure 1 shows our system and attacker model. The service provider collects sensitive data from many users over a long time span. The raw data is stored in a database. Over shorter time periods (e.g. an hour, a day, or a week), a cardinality estimator aggregates all data into a sketch. Sketches of all time periods are stored in a sketch database. Sketches from different times are aggregated into sketches of longer time spans, which are also stored in the database. Estimators compute user statistics from the sketches in the database, which are published. The service provider may also publish the sketches via an analytics service for other parties.
The attacker knows all algorithms used (those for sketching, aggregation, and estimation, including their configuration parameters such as the hash function and the number of buckets) and has access to the published statistics and the analytics service. She controls a small fraction of the users that produce user data. However, she can neither observe nor change the data of the other users. She also does not have access to the database containing the raw data.
In this work, we mainly consider an internal attacker who has access to the sketch database. For this internal attacker, the goal is to discover whether her target belongs to a given sketch. We then discuss how our results extend to weaker external attackers, which can only use the analytics service. We will see that for our main results, the attacker only requires access to one sketch. The possibility to use multiple sketches will only come up when discussing mitigations strategies in Section 8.1.
4.2 Privacy definition
We now present the privacy definition used in our main result. Given the system and attacker just described, our definition captures the impossibility for the attacker to gain significant positive knowledge about a given target. We explain this notion of knowledge gain, state assumptions on the attacker’s prior knowledge, and compare our privacy notion with other well-known definitions.
We define a very weak privacy requirement: reasonable definitions used for practical algorithms would likely be stronger. Working with a weak definition strengthens our negative result: if a cardinality estimator satisfying our weak privacy definition cannot be precise, then this is also the case for cardinality estimators satisfying a stronger definition.
In Section 6, we explore even weaker privacy definitions and prove similar negative results (although with a looser bound). In Section 7, we relax the requirement that every individual user must be protected according to the privacy definition. We show then that our theorem no longer holds, but that practical uses of cardinality estimators still cannot be considered privacy-preserving.
We model a possible attack as follows. The attacker has access to the identifier of a user (her target) and a sketch generated from a set of users () unknown to her. The attacker wants to know whether . She initially has a prior
knowledge of whether the target is in the sketch. Like in Bayesian inference,represents how much more likely the user is in the database than is not, according to the attacker. After looking at the sketch , this knowledge changes: her posterior knowledge becomes .
We define privacy to capture that the attacker’s posterior knowledge should not increase too much. In other words, the attacker should not gain significant knowledge by seeing the sketch. This must hold for every possible sketch and every possible user . We show in Lemma 10 that the following definition bounds the positive knowledge gain of the attacker.
A cardinality estimator satisfies -sketch privacy above cardinality if for every , , and , the following inequality holds:
Here, the probability is taken over:
a uniformly chosen set , where is the set of all possible subsets of cardinality ; and
the coin flips of the algorithm, for probabilistic cardinality estimators.
If a cardinality estimator satisfies this definition, then for any user , the probability of observing if is not much higher than the probability of observing if . To give additional intuition on Definition 9, we now show that the parameter effectively captures the attacker’s positive knowledge gain.
A cardinality estimator satisfies -sketch privacy above cardinality if and only if the following inequality holds for every , and with :
Bayes’ law can be used to derive one inequality from the other. We have
The equivalence between the definitions follows directly. ∎
This definition has three characteristics which make it unusually weak. They correspond to an underapproximation of the attacker’s capabilities and goals.
- Uniform prior
The choice of distribution for implies that the elements of are uniformly distributed in . This corresponds to an attacker who has no prior knowledge about the data. In the absence of prior information about the elements of the set
, the attacker’s best approximation is the uniform distribution. In practice, a realistic attacker might have more information about the data, so a stronger privacy definition would model this prior knowledge by a larger family of probability distributions. More precisely, since the elements ofare uniformly distributed in , the prior knowledge from the attacker is exactly . A realistic attacker would likely have a larger prior knowledge about their target. However, any reasonable definition of privacy would also include the case where the attacker does not have more information on their target than on other users and, as such, would be stronger than -sketch privacy.
We only consider the positive information gain by the attacker. There is an upper bound on the probability that given the observation , but no lower bound. In other words, the attacker is allowed to deduce with absolute certainty that . In practice, both positive and negative information gains may present a privacy risk. In our running example (see Example 1), deducing that a user did not spend the night at his apartment could be problematic.
- Minimum cardinality
We only require a bound on the information gain for cardinalities larger than a parameter . In practice, could represent a threshold over which it is considered safe to publish sketches or to relax data protection requirements. Choosing a small (like ) strengthens the privacy definition, while choosing a large (like ) limits the utility of the data, as many smaller sketches cannot be published.
We emphasize again that these characteristics, which result in a very weak definition, make our notion of privacy well-suited to proving negative results. If satisfying our definition is impossible for an accurate cardinality estimator, then a stronger definition would similarly be impossible to satisfy. For example, any reasonable choice of distributions used to represent the prior knowledge of the attacker would include the uniform distribution.
We now compare our definition to two other notions: differential privacy [dwork2008differential] and Pufferfish privacy [kifer2012rigorous].
4.3 Relation to differential privacy
Recall the definition of differential privacy: is -differentially private if and only if for any databases and that only differ by one element. In our setup, this could be written as the two inequalities and .
Asymmetry, and minimum cardinality, are two obvious differences between our notion of privacy and differential privacy. But the major difference lies in the source of uncertainty. In differential privacy, the probabilities are taken over the coin flips of the algorithm. The attacker is implicitly assumed to know the algorithm’s input except for one user: the uncertainty comes entirely from the algorithm’s randomness. In our definition, the attacker has no prior knowledge of the input, so the uncertainty comes either entirely from the attacker’s lack of background knowledge (for deterministic cardinality estimators), or both from the attacker’s lack of background knowledge and the algorithm’s inherent randomness.
The notion of relying on the initial lack of knowledge of the attacker in a privacy definition is not new: it is for example made explicit in the definition of Pufferfish privacy, a generic framework for privacy definitions.
4.4 Relation to Pufferfish privacy
Pufferfish privacy [kifer2012rigorous] is a customizable framework for building privacy definitions. A Pufferfish privacy definition has three components: a set of potential secrets ; a set of discriminative pairs ; and a set of data evolution scenarios .
represents the facts we want the attacker to be unable to distinguish. In our case, we want to prevent the attacker from distinguishing between and : . represents what the possible distributions of the input data are. In our case, it is a singleton that only contains the uniform distribution.
Our definition is almost an instance of Pufferfish privacy. Like with differential privacy, the main difference is asymmetry.
The close link to Pufferfish privacy supports our proof of two fundamental properties of privacy definitions: transformation invariance and convexity [kifer2012rigorous]. Transformation invariance states that performing additional analysis of the output of the algorithm does not allow an attacker to gain more information, i.e., the privacy definition is closed under composition with probabilistic algorithms. Convexity states that if a data owner chooses randomly between two algorithms satisfying a privacy definition and generates the corresponding output, this procedure itself will satisfy the same privacy definition. These two properties act as sanity checks for our privacy definition.
-sketch privacy above cardinality satisfies transformation invariance and convexity.
The proof is similar to the proof of Theorem 5.1 in [kifer2012rigorous], proved in Appendix B of the same paper. ∎
5 Private cardinality estimators are imprecise
Let us return to our privacy problem: someone with access to a sketch wants to know whether a given individual belongs to the aggregated individuals in the sketch. Formally, given a target and a sketch , the attacker must guess whether with high probability. In Section 5.1, we explain how the attacker can use a simple test to gain significant information if the cardinality estimator is deterministic. Then, in Section 5.2, we reformulate the main technical lemma in probabilistic terms, and prove an equivalent theorem for probabilistic cardinality estimators.
5.1 Deterministic case
Given a target and a sketch , the attacker can perform the following simple attack to guess whether . She can try to add the target to the sketch , and observe whether the sketch changes. In other words, she checks whether . If the sketch changes, this means with certainty that . Thus, Bayes’ law indicates that if , then the probability of cannot decrease.
How large is this increase? Intuitively, it depends on how likely it is that adding an element to a sketch does not change it if the element has not previously been added to the sketch. Formally, it depends on .
If is close to , for example if the sketch is a list of all elements seen so far, then observing that will lead the attacker to believe with high probability that .
If is close to , it means that adding an element to a sketch often does not change it. The previous attack does not reveal much information. But then, it also means that many elements are ignored when they are added to the sketch, that is, the sketch does not change when adding the element. Intuitively, the accuracy of an estimator based solely on a sketch that ignores many elements cannot be very good.
We formalize this intuition in the following theorem.
An unbiased deterministic cardinality estimator that satisfies -sketch privacy above cardinality is not precise. Namely, its variance is at least , for any and , where
Note that if we were using differential privacy, this result would be trivial: no deterministic algorithm can ever be differentially private. However, this is not so obvious for our definition of privacy: prior work [bhaskar2011noiseless, bassily2013coupled, grining2017towards] shows that when the attacker is assumed to have some uncertainty about the data, even deterministic algorithms can satisfy the corresponding definition of privacy.
Figure 2 shows plots of the lower bound on the standard error of a cardinality estimator with -sketch privacy at two cardinalities (100 and 500). It shows that the standard error increases exponentially with the number of elements added to the sketch. This demonstrates that even if we require the privacy property for a large value of (500) and a large (which is generally less than ), the standard error of a cardinality estimator will become unreasonably large after 20,000 elements.
Proof of Theorem 12.
The proof is comprised of three steps, following the intuition previously given.
We show that a sketch , computed from a random set with an -sketch private estimator above cardinality , will ignore many elements after (Lemma 13).
We prove that if a cardinality estimator ignores a certain ratio of elements after adding elements, then it will ignore an even larger ratio of elements as increases (Lemma 14).
We conclude by proving that an unbiased cardinality estimator that ignores many elements must have a large variance (Lemma 15).
The theorem follows directly from these lemmas. ∎
Let . A deterministic cardinality estimator with -sketch privacy above cardinality satisfies for .
We first prove that such an estimator satisfies
We decompose the left-hand side of the inequality over all possible values of which that . If we call this set , we have:
where the first inequality is obtained directly from the definition of -sketch privacy.
Now, Lemma 3 gives , and finally . ∎
Let . Suppose a deterministic cardinality estimator satisfies for any . Then for any integer , it also satisfies , for .
First, note that if , and , then . This is a direct consequence of Lemma 5: , so:
We show next that when , generating a set uniformly randomly can be seen as generating independent sets in , then merging them. Indeed, generating such a set can be done by as follows:
For , generate a set uniformly randomly. Let .
Count the number of elements appearing in multiple : . Generate a set uniformly randomly.
is then defined by . Step ensures that we used independent sets of cardinality to generate , and step ensures that has exactly elements.
Intuitively, each time we generate a set of cardinality uniformly at random in , we have one chance that will be ignored by (and thus by ). So can be ignored by with a certain probability because it was ignored by . Similarly, it can also be ignored because of , etc. Since the choice of is independent of the choice of elements in , we can rewrite:
using the hypothesis of the lemma. Thus:
Suppose a deterministic cardinality estimator satisfies for any and all . Then its variance for is at least .
The proof’s intuition is as follows. The hypothesis of the lemma requires that the cardinality estimator, on average, ignores a proportion of new elements added to a sketch (once elements have been added): the sketch is not changed when a new element is added. The best thing that the cardinality estimator can do, then, is to store all elements that it does not ignore, count the number of unique elements among these, and multiply this number by to correct for the elements ignored. It is well-known that estimating the size of a set based on the size of a uniform sample of sampling ratio has a variance of . Hence, our cardinality estimator has a variance of at least .
Formalizing this idea requires some additional technical steps. The full proof is given in Appendix B. ∎
5.2 Probabilistic case
Algorithms that add noise to their output, or more generally, are allowed to use a source of randomness, are often used in privacy contexts. As such, even though all cardinality estimators used in practical applications are deterministic, it is reasonable to hope that a probabilistic cardinality estimator could satisfy our very weak privacy definition. Unfortunately, this is not the case.
In the deterministic case, we showed that for any element , the probability that has an influence on a random sketch decreases exponentially with the sketch size. Or, equivalently, the distribution of sketches of size that do not contain is “almost the same” (up to a density of probability ) as the distribution of sketches of the same size, but containing .
The following lemma establishes the same result in the probabilistic setting. Instead of reasoning about the probability that an element is “ignored” by a sketch , we reason about the probability that has a meaningful influence on this sketch. We show that this probability decreases exponentially, even if is very high.
First, we prove a technical lemma on the structure that the operation imposes on the space of sketch distributions. Then, we find an upper bound on the “meaningful influence” of an element , when added to a random sketch of cardinality . We then use this upper bound, characterized using the statistical distance, to show that the estimator variance is as imprecise as for the deterministic case.
The operation defines a commutative and associative algebra on .
By the properties required from probabilistic cardinality estimators in Definition 8, the operation is commutative and associative on the family . By linearity of the operation, these properties are preserved for any linear combination of vectors . ∎
Suppose a cardinality estimator satisfies -sketch privacy above cardinality , and let . Let be the distribution of sketches obtained by adding uniformly random elements of into (or, equivalently, ). Then:
where is the statistical distance between probability distributions.
Let be the distribution of sketches obtained by adding , then uniformly random elements of into (or, equivalently, ). Then the definition of -sketch privacy gives that for every sketch , . So we can express as the sum of two distributions:
for a certain distribution .
First, we show that for a certain distribution . Indeed, to generate a sketch of cardinality that does not contain uniformly randomly, one can use the following process.
Generate random sketches of cardinality which do not contain , and merge them.
For all , denote by the probability that the sketches were generated with the elements in . There might be “collisions” between the sketches: if several sketches were generated using the same element, . When this happens, we need to “correct” the distribution, and add additional elements. Enumerating all the options, we denote , where is obtained by adding uniformly random elements in to . Thus, .
All these distributions are in : , , , etc. Thus:
Denoting and , this gives us:
Finally, we can compute :
Note that since , we have by idempotence, and:
Suppose that a cardinality estimator satisfies for any and all , . Then its variance for is at least .
The condition “” is equivalent to the condition of Lemma 15: with probability , the cardinality estimator “ignores” when a new element is added to a sketch. Just like in Lemma 15’s proof, we can convert this constraint into estimating the size of a set based on a sampling set. The best known estimator for this problem is deterministic, so allowing the cardinality estimator to be probabilistic does not help improving the optimal variance.
The same result than in Lemma 15 follows. ∎
An unbiased probabilistic cardinality estimator that satisfies -sketch privacy above cardinality is not precise. Namely, its variance is at least , for any and , where
Somewhat surprisingly, allowing the algorithm to add noise to the data seems to be pointless from a privacy perspective. Indeed, given the same privacy guarantee, the lower bound on accuracy is the same for deterministic and probabilistic cardinality estimators. This suggests that the constraints of these algorithms (idempotence and commutativity) require them to somehow keep a trace of who was added to the sketch (at least for some users), which is fundamentally incompatible with even weak notions of privacy.
6 Weakening the privacy definition
Our main result is negative: no cardinality estimator satisfying our privacy definition can maintain a good accuracy. Thus, it is natural to wonder whether our privacy definition is too strict, and if the result still holds for weaker variants.
In this section, we consider two weaker variants of our privacy definition: one allows a small probability of privacy loss, while the other averages the privacy loss across all possible outputs. We show that these natural relaxations do not help as close variants of our negative result still hold.
6.1 Allowing a small probability of privacy loss
As Lemma 10 shows, -sketch differential privacy provides a bound on how much information the attacker can gain in the worst case. A natural relaxation is to accept a small probability of failure: requiring a bound on the information gain in most cases, and accept a potentially unbounded information gain with low probability.
We introduce a new parameter, called , similar to the use of in the definition of -differential privacy: is -differentially private if and only if for any databases and that only differ by one element and any set of possible outputs, .
A cardinality estimator satisfies -sketch privacy above cardinality if for every , , and ,
Unfortunately, our negative result still holds for this variant of the definition. Indeed, we show that a close variant of Lemma 13 holds, and the rest follows directly.
Let . A cardinality estimator that satisfies -probabilistic sketch privacy above cardinality satisfies for .
An unbiased cardinality estimator that satisfies -sketch privacy above cardinality has a variance at least for any and , where . It is therefore not precise if .
6.2 Averaging the privacy loss
Instead of requiring that the attacker’s information gain is bounded by for every possible output, we could bound the average information gain. This is equivalent to accepting a larger privacy loss in some cases, as long as other cases have a lower privacy loss.
This intuition is captured by the use of Kullback-Leiber divergence, which is often used in similar contexts [rebollo2010optimized, rebollo2010t, diaz2002towards, dwork2010boosting]. In our case, we adapt it to maintain the asymmetry of our original privacy definition. First, we give a formal definition the privacy loss of a user given output .
Given a cardinality estimator, the positive privacy loss of given output at cardinality is defined as
This privacy loss is never negative: this is equivalent to discarding the case where the attacker gains negative information. Now, we bound this average over all possible values of , given .
A cardinality estimator satisfies -sketch average privacy above cardinality if for every and , we have
It is easy to check that -sketch average privacy above cardinality is strictly weaker than -sketch privacy above cardinality . Unfortunately, this definition is also stronger than -sketch privacy above cardinality for certain values of and , and as such, Lemma 22 also applies. We prove this in the following lemma.
If a cardinality estimator satisfies -sketch average privacy above cardinality , then it also satisfies -sketch privacy above cardinality for any .
The proof is given in Appendix D. This lemma leads to a similar version of the negative result.
An unbiased cardinality estimator that satisfies -sketch average privacy above cardinality has a variance at least for any and , where . It is thus not precise.
Recall that all existing cardinality estimators satisfy our axioms and have a bounded accuracy. Thus, an immediate corollary is that for all cardinality estimators used in practice, there are some users for which the average privacy loss is very large.
This idea of averaging is similar to the idea behind Rényi differential privacy [mironov2017renyi]. The parameter
of Rényi differential privacy determines the averaging method used (geometric mean, arithmetic mean, quadratic mean, etc.). Using KL-divergence corresponds to, while averages all possible values of . Increasing strengthens the privacy definition [mironov2017renyi, Prop. 9], so our negative result still holds.
7 Privacy loss of individual users
So far, we only considered definitions of privacy that give the same guarantees for all users. What if we allow certain users to have less privacy than others, or if we were to average the privacy loss across users instead of averaging over all possible outcomes for each user?
Such definitions would generally not be sufficiently convincing to be used in practice: one typically wants to protect all users, not just a majority of them. In this section, we show that even if we relax this requirement, cardinality estimators would in practice leak a significant amount of information.
7.1 Allowing unbounded privacy loss for some users
What happens if we allow some users to have unbounded privacy loss? We could achieve this by requiring the existence of a subset of users of density , such that every user in is protected by -sketch privacy above cardinality . In this case, a ratio of possible targets are not protected.
This approach only makes sense if the attacker cannot choose the target . For our attacker model, this might be realistic: suppose that the attacker wants to target just one particular person. Since all user identifiers are hashed before being passed to the cardinality estimator, this person will be associated to a hash value that the attacker can neither predict nor influence. Thus, although the attacker picks , the true target of the attack is , which the attacker cannot choose.
Unfortunately, this drastic increase in privacy risk for some users does not lead to a large increase in accuracy. Indeed, the best possible use of this ratio of users from an accuracy perspective would simply be to count exactly the users in a sample of sampling ratio .
Estimating the total cardinality based on this sample, similarly to the optimal estimator in the proof of Lemma 15, leads to a variance of . If is very small (say, ), this variance is too large for counting small values of (say, and ). This is not surprising: if of the values are ignored by the cardinality estimator, we cannot expect it to count values of on the order of thousands. But even this value of is larger than what is often used with -differential privacy, where typically, .
But in our running example, sketches must yield a reasonable accuracy both at small and large cardinalities, if many sketches are aggregated. This implicitly assumes that the service operates at a large scale, say with at least users. With , this means that thousands of users are not covered by the privacy property. This is unacceptable for most applications.
7.2 Averaging the privacy loss across users
Instead of requiring the same for every user, we could require that the average information gain by the attacker is bounded by . In this section, we take the example of HyperLogLog to show that accuracy is not incompatible with this notion of average privacy, but that cardinality estimators used in practice do not preserve privacy even if we average across all users.
First, we define this notion of average information gain across users.
Recall the definition of the positive privacy loss of given output at cardinality from Definition 24: The maximum privacy loss of at cardinalty is defined as . A cardinality estimator satisfies -sketch privacy on average if we have, for all , .
In this definition, we accept that some users might have less privacy as long as the average user satisfies our initial privacy definition. Remark 28 is still relevant: we chose to average over all values of , but other averaging functions are possible and would lead to strictly stronger definitions.
We show that HyperLogLog satisfies this definition and we consider the value of for various parameters and their significance. Intuitively, a HyperLogLog cardinality estimator puts every element in a random bucket, and each bucket counts the maximum number of leading zeroes of elements added in this bucket. More details are given in Appendix E.
HyperLogLog cardinality estimators have a parameter that determines its memory consumption, its accuracy, and, as we will see, its level of average privacy.
Assuming a sufficiently large , a HyperLogLog cardinality estimator of parameter satisfies -sketch privacy above cardinality on average where for ,
The assumption that the set of possible elements is very large and its consequences are explained in more detail in the proof of this theorem, given in Appendix E.
How does this positive result fit practical use cases? Figure 3 plots for three different HyperLogLog cardinality estimators. It shows two important results.
First, cardinality estimators used in practice do not preserve privacy. For example, the default parameter used for production pipelines at Google and on the BigQuery service [bigqueryhll] is . For this value of , an attacker can determine with significant accuracy whether a target was added to a sketch; not only in the worst case, but for the average user too. The average risk only becomes reasonable for , a threshold too large for most data analysis tasks.
Second, by sacrificing some accuracy, it is possible to obtain a reasonable average privacy. For example, a HyperLogLog sketch for which has a relative standard error of about , and an of about for . Unfortunately, even when the average risk is acceptable, some users will still be at a higher risk: users with a large number of leading zeroes are much more identifiable than the average. For example, if , there is a chance that at least one user has . For this user, , a very high value.
Our calculations yield only an approximation of that is an upper bound on the actual privacy loss in HyperLogLog sketches. However, these alarming results can be confirmed experimentally. We simulated , for uniformly random values of , using HyperLogLog sketches with the parameter , the default used for production pipelines at Google and on the BigQuery service [bigqueryhll]. For each cardinality , we generated 10,000 different random target values, and added each one to 1,000 HyperLogLog sketches of cardinality (generated from random values). For each target, we counted the number of sketches that ignored it.
Figure 4 plots some percentile values. For example, the all-targets curve (100th percentile) has a value of 33% at cardinality = 10,000. This means that each of the 10,000 random targets was ignored by at most 33% of the 1,000 random sketches of this cardinality, i.e., for all . In other words, an attacker observes with at least 67% probability a change when adding a random target to a random sketch that did not contain it. Similarly, the 10th-percentile at = 10,000 has a value of 3.8%. So 10% of the targets were ignored by at most 3.8% of the sketches, i.e., for 10% of all users . That is, for the average user , there is a 10% chance that a sketch with 10,000 elements changes with likelihood at least 96.2% when is first added.
For small cardinalities (), adding an element that has not yet been added to the sketch will almost certainly modify the sketch: an attacker observing that a sketch does not change after adding can deduce with near-certainty that was added previously.
Even for larger cardinalities, there is always a constant number of people with high privacy loss. For = 1,000, no target was ignored by more than 5.5% of the sketches. For = 10,000, 10% of the users were ignored by at most 3.8% of the sketches. Similarly, the 1st percentile at = 100,000 and the 1st permille at = 1,000,000 are 4.6% and 4.5%, respectively. In summary, across all cardinalities , at least 1,000 users have . For these users, the corresponding privacy loss is . Concretely, if the attacker initially believes that is 1%, this number grows to 15% after observing that . If it is initially 10%, it grows to 66%. And if it is initially 25%, it grows to 86%.
8 Mitigation strategies
A corollary of Theorem 12 and of our analysis of Section 7.2 is that the cardinality estimators used in practice do not preserve privacy. How can we best protect cardinality estimator sketches against insider threats,in realistic settings? Of course, classical data protection techniques are relevant: encryption, access controls, auditing of manual accesses, etc. But in addition to these best practices, cardinality estimators like HyperLogLog allow for specific risk mitigation techniques, which restrict the attacker’s capabilities.
8.1 Salting the hash function with a secret
As explained in Section 3, most cardinality estimators use a hash function as the first step of the operation: only depends on and the hash value . This hash can be salted with a secret value. This salt can be made inaccessible to humans, with access controls restricting access to production binaries compiled from trusted code. Thus, an adversary cannot learn all the relevant parameters of the cardinality estimator and can no longer add users to sketches. Of course, to avoid a salt reconstruction attack, a cryptographic hash function must be used.
The use of a salt does not hinder the usefulness of sketches: they can still be merged (for all cardinality estimators given as examples in Section 3) and the cardinality can still be estimated without accuracy loss. However, if an attacker gains direct access to a sketch with the aim of targeting a user and does not know the secret salt, then she cannot compute and therefore cannot compute . This prevents the previous obvious attack of adding to and observing whether the result is different.
However, this solution has two issues.
- Secret salt rotation
The secret salt must be the same for all sketches as otherwise sketches cannot be merged. Indeed, if a hash function is used to create a sketch and is used to create , then if for some that is added to both and , will be seen as a different user in and : the cardinality estimator no longer ignores duplicates. Good key management practices also recommend regularly rotating secret keys. In this context, changing the key requires recomputing all previously computed sketches. This requires keeping the original raw data, makes pipelines more complex, and can be computationally costly.
- Sketch intersection
For most cardinality estimators given as examples in Section 3, it is possible for an attacker to guess from a family of sketches (, …, ) for which the attacker knows that . For example, intersecting the lists stored in K-Minimum Values sketches can provide information on which hashes come from users that have been in all sketches. For HyperLogLog, one can use the leading zeroes in non-empty buckets to get partial information on the hash value of users who are in all sketches. Moreover, HyperLogLog++ [heule2013hyperloglog] has a sparse mode that stores full hashes when the sketch contains a small number of values; this makes intersection attacks even easier.
Intersection attacks are realistic, although they are significantly more complex than simply checking if . In our running example, sketches come from counting users across locations and time periods. If an internal attacker wants to target someone she knows, she can gather information about where they went using side channels like social media posts. This gives her a series of sketches that she knows her target belongs to, and from these, she can get information on and use it to perform an attack equivalent to checking whether .
Another possible risk mitigation technique is homomorphic encryption. Each sketch could be encrypted in a way that allows sketches to be merged, and new elements to be added; while ensuring that an attacker cannot do any operation without some secret key. Homomorphic encryption typically has significant overhead, so it is likely to be too costly for most use cases. Our impossibility results assume a computationally unbounded attacker; however, it is possible that an accurate sketching mechanism using homomorphic encryption could provide privacy against polynomial-time attackers. We leave this area of research for future work.
8.2 Using a restricted API
Using cardinality estimator sketches to perform data analysis tasks only requires access to two operations: and . So a simple option is to process the sketches over an API that only allows this type of operation. One option is to provide a SQL engine on a database, and only allow SQL functions that correspond to and over the column containing sketches. In the BigQuery SQL engine, this corresponds to allowing HLL_COUNT.MERGE and HLL_COUNT.EXTRACT functions, but not other functions over the column containing sketches [bigqueryhll]. Thus, the attacker cannot access the raw sketches.
Under this technique, an attacker who only has access to the API can no longer directly check whether . Since she does not have access to the sketch internals, she cannot perform the intersection attack described previously either. To perform the check, her easiest option is to impersonate her target within the service, interact with the service so that a sketch containing only her target is created in the sketch database, and compare the estimates obtained from and . Following the intuition given in Section 5.1, if these estimates are the same, then the target is more likely to be in the dataset. How much information the attacker gets this way depends on . We can increase this quantity by rounding the result of the operation, thus limiting the accuracy of the external attacker. This would make the attack described in this work slightly more difficult to execute, and less efficient. However, it is likely that the attack could be adapted, for example by repeating it multiple times with additional fake elements.
This risk mitigation technique can be combined with the previous one. The restricted API protects the sketches during normal use by data analysts, i.e., against external attackers. The hash salting mitigates the risk of manual access to the sketches, e.g., by internal attackers. This type of direct access is not needed for most data analysis tasks, so it can be monitored via other means.
We formally defined a class of cardinality estimator algorithms with an associated system and attacker model that captures the risks associated with processing personal data in cardinality estimator sketches. Based on this model, we proposed a privacy definition that expresses that the attacker cannot gain significant knowledge about a given target.
We showed that our privacy definition, which is strictly weaker than any reasonable definition used in practice, is incompatible with the accuracy and aggregation properties required for practical uses of cardinality estimators. We proved similar results for even weaker definitions, and we measured the privacy loss associated with the HyperLogLog cardinality estimator, commonly used in data analysis tasks.
Our results show that designing accurate privacy-preserving cardinality estimator algorithms is impossible, and that the cardinality estimator sketches used in practice should be considered as sensitive as raw data. These negative results are a consequence of the structure imposed on cardinality estimators: idempotence, commutativity, and existence of a well-behaved merge operation. This result shows a fundamental incompatibility between accurate aggregation and privacy. A natural question is ask whether other sketching algorithms have similar incompatibilities, and what are minimal axiomatizations that lead to similar impossibility results.
The authors thank Jakob Dambon for providing insights which helped us prove Lemma 15, Esfandiar Mohammadi for numerous fruitful discussions, as well as Christophe De Cannière, Pern Hui Chia, Chao Li, Aaron Johnson and the anonymous reviewers for their helpful comments. This work was partially funded by Google, and done while Andreas Lochbihler was at ETH Zurich.
Appendix A Proof of Lemma 5
Let and be two sketches. If , we denote by the result of adding elements of successively to : .
Let and be two sets such that , and let be a set such as . Then . Using the two properties of the add function, we get . The same reasoning leads to : the merge function does not depend on the choice of .
In addition, note that