DeepAI
Log In Sign Up

(Nearly) All Cardinality Estimators Are Differentially Private

03/29/2022
by   Charlie Dickens, et al.
0

We consider privacy in the context of streaming algorithms for cardinality estimation. We show that a large class of algorithms all satisfy ϵ-differential privacy, so long as (a) the algorithm is combined with a simple down-sampling procedure, and (b) the cardinality of the input stream is Ω(k/ϵ). Here, k is a certain parameter of the sketch that is always at most the sketch size in bits, but is typically much smaller. We also show that, even with no modification, algorithms in our class satisfy (ϵ, δ)-differential privacy, where δ falls exponentially with the stream cardinality. Our analysis applies to essentially all popular cardinality estimation algorithms, and substantially generalizes and tightens privacy bounds from earlier works.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/06/2023

Better Differentially Private Approximate Histograms and Heavy Hitters using the Misra-Gries Sketch

We consider the problem of computing differentially private approximate ...
08/17/2018

Cardinality Estimators do not Preserve Privacy

Cardinality estimators like HyperLogLog are sketching algorithms that es...
05/07/2019

Exponential Separations Between Turnstile Streaming and Linear Sketching

Almost every known turnstile streaming algorithm is implementable as a l...
02/04/2023

Sketch-Flip-Merge: Mergeable Sketches for Private Distinct Counting

Data sketching is a critical tool for distinct counting, enabling multis...
03/13/2019

Cardinality Estimation in a Virtualized Network Device Using Online Machine Learning

Cardinality estimation algorithms receive a stream of elements, with pos...
02/04/2023

An Effective and Differentially Private Protocol for Secure Distributed Cardinality Estimation

Counting the number of distinct elements distributed over multiple data ...
10/14/2017

Learners that Leak Little Information

We study learning algorithms that are restricted to revealing little inf...

1 Introduction

Cardinality estimation, or the distinct counting problem, is a fundamental data analysis task. Typical applications are found in network traffic monitoring [EVF03], query optimization [SAC79], and counting unique search engine queries [HNH13]. A key challenge is to perform this estimation in small space, say at most logarithmic in the number of distinct items in the input, while processing each data item quickly (ideally in constant time per item).

Typical approaches for solving this problem at scale are the fm sketch [FM85], sometimes referred to as PCSA (short for Probabilistic Counting with Stochastic Averaging), and its more practical variants such as hll [FFGM07] (we describe details of these algorithms later, in Section 2). These algorithms fall into a class that we call hash-based, order-invariant cardinality estimators. This class consists of all algorithms that satisfy the following two properties. The first is that the algorithm utilizes no information about each input identifier other than a uniform random hash of . The second is that the algorithm depends only on the set of observed hash values. This means the algorithm satisfies both permutation- and duplication-invariance. That is, the produced sketch does not depend on the ordering of the input stream, nor on the number of times any item is duplicated in the stream.

In addition to the FM and HLL sketches, other popular algorithms in this class include Bottom- [BYJK02, CK07, BGH09] (also called MinCount or -minimum values (KMV for short)), and Adaptive Sampling [Fla90].222Flajolet [Fla90] attributes the Adaptive Sampling algorithm to an unpublished 1984 communication of Wegman. For example, the Bottom- sketch hashes each input item to a value in the interval , and stores the smallest hash values observed.333In practice, hash values in a Bottom- sketch consist of a finite number of bits (say, 32 or 64), which are then interpreted as the binary representation of a number in the interval .

While research has historically focused on the accuracy, speed, and space usage of these sketches, recent work has examined their privacy guarantees. These privacy-preserving properties have grown in importance as companies have built tools that can grant an appropriate level of privacy to different people and scenarios. The tools may help satisfy users’ demand for better data stewardship, while also ensuring compliance with regulatory requirements. For example, a database may offer a baseline guarantee that individual aggregates are -differentially private to a set of trustworthy individuals (-DP for short), but a public release of data may require all the statistics in the data to be jointly -differentially private with a smaller privacy parameter .

One view, presented by Desfontaines, Lochbihler, and Basin [DLB19], is that cardinality estimators cannot simultaneously have good utility and preserve privacy. More precisely, any accurate sketch would allow an adversary to identify a user’s presence in the original data set. Hence, accurate sketches should be considered as sensitive as raw data when guarding against privacy violations.

However, the impossibility result of [DLB19] applies only to a setting wherein an adversary knows the hash functions used to generate the sketch and no additional noise unknown to the adversary is applied. In fact, in the setting considered in [DLB19], differential privacy is trivially impossible to achieve, because from the perspective of the adversary, the produced sketch is a deterministic function of the input. The main result of [DLB19] is that even weaker notions of privacy cannot be achieved in this context.

Other works have studied more realistic models where either the hashes are public, but private noise is added to the sketch [TS13, CDSKY20, MMNW11, vVT19], or the hashes are secret (i.e., not known to the adversary who is trying to “break” privacy), a setting that turns out to permit less noisy cardinality estimates. For example, Smith et al. [SSGT20] show that an hll-type sketch is differentially private444Despite the title of [SSGT20] referencing Flajolet and Martin, they actually study an algorithm more closely related to LogLog and hll sketches, rather than the fm sketch. The former sketches can be thought of as lossy compressions of the latter. while [vVT19] modifies the Flajolet-Martin sketch using coordinated sampling, which is based on a private hash. Variants of both models are analyzed by Choi et al. [CDSKY20], and they show (amongst other contributions) a similar result to [SSGT20], establishing that an hll-type sketch is differentially private. As with these prior works, our focus is the setting when the hash functions are kept secret from the adversary.

A related problem, of differentially private estimation of cardinalities under set operations is studied by [PS21], but they assume the inputs to each sketch are already de-duplicated.

Our contributions.

We show that all hash-based, order-invariant cardinality estimators are -differentially private so long as the algorithm is combined with a simple down-sampling procedure. A detailed overview of specific sketches to which our results apply can be found in Section 4. As with earlier results, our analysis holds provided that a mild lower bound on the number of distinct items in the stream is satisfied, roughly , where is a sketch parameter that can be thought of as the number of “buckets” used by the sketch.555More precisely, the cardinality lower bound we require is , which is if and if . For any hash-based, order-invariant algorithm, is always upper bounded by the number of bits in the sketch, but is typically significantly smaller. For example, the number of buckets used by a Bottom- sketch is , while the number of bits in the sketch is larger by a factor equal to the number of bits in each hash value.

The

stream cardinality lower bound needed to ensure privacy can be guaranteed to hold by inserting sufficiently many “phantom elements” into the stream when initializing the sketch. One can then subtract off the number of phantom elements from the estimate returned by the sketch, without violating privacy. This padding technique guarantees that the sketch

always satisfies -differential privacy, even when run on inputs with very small cardinality. Of course, the insertion of “phantom elements” may increase the error of the sketch on such small-cardinality inputs. But such an increase in error is inherent, as these sketches are clearly non-private (e.g., by virtue of returning the exact answer) on small enough input streams.

We also show that, even with no modification, algorithms in our class satisfy -differential privacy, where falls exponentially with the stream cardinality.

Our novel analysis has significant benefits. First, prior works on differentially private cardinality estimation have analyzed only specific sketches [TS13, vVT19, CDSKY20, SSGT20]. Moreover, many of the sketches analyzed (e.g., [TS13, SSGT20]), while reminiscent of sketches used in practice, in fact differ from practical sketches in important ways. For example, as we explain in detail later (Section 4.3), Smith et al. [SSGT20] analyze a variant of hll that has much slower update time (by a factor of ) than hll itself.

In contrast to the literature, we analyze an entire class of sketches at once. Even when specialized to specific sketches, our error analysis improves upon prior work in many cases. For example, our analysis yields tighter privacy bounds for HLL than the one given in [CDSKY20], yielding an -DP guarantee rather than an -DP guarantee—see Section 4.3 for details. Crucially, the class of sketches we analyze captures many (in fact, almost all) of the sketches that, to our knowledge, are actually used in practice. This means that existing systems can be used in contexts requiring privacy either without modification (if streams are guaranteed to satisfy the mild cardinality lower bound we require), or via the simple pre-processing step described above if such cardinality lower bounds may not hold. Thus, existing data infrastructure can be easily modified to provide DP guarantees, and in fact existing data (and, as we show, sketches thereof) can be easily migrated to DP summaries.

There is one standard caveat: following prior works [SSGT20, CDSKY20] our privacy analysis assumes a perfectly random hash function. One can remove this assumption both in theory and practice by using a cryptographic hash function. This will yield a sketch that satisfies either a computational variant of differential privacy called SIM-CDP, or standard information-theoretic notions of differential privacy under the assumption that the hash function fools space-bounded computations. See [SSGT20, Section 2.3] for details.

2 Problem Definition

Let denote a stream of samples with each identifier coming from a large universe , e.g., of size (or larger). The objective is to estimate the cardinality of using an algorithm which is given privacy parameters and a space bound .

Definition 1 (Differential Privacy [Dmns06]).

A randomized algorithm is -differentially private (or -DP for short) if for any pair of data sets that differ in one record and for all in the range of ,

(1)

In Equation (1

), the probability is over the internal randomness of the algorithm

. For brevity, -DP is referred to simply as -DP, or sometimes as pure -differential privacy.

Rather than analyzing any specific sketching algorithm, we analyze a natural class of randomized distinct counting sketches. Algorithms in this class operate in the following manner: each time a new stream item arrives, is hashed using some uniform random hash function , and then is used to update the sketch, i.e., the update procedure depends only on , and is otherwise independent of . Our analysis applies to any such algorithm that depends only on the set of observed hash values. Equivalently, the sketch state is invariant both to the order in which stream items arrive, and to item duplication.666By duplication-invariance, we mean that the state of the sketch when run on any stream is identical to its state when run on the “de-duplicated” stream in which each item appearing one or more times in appears exactly once in . We call this class of algorithms hash-based, order-invariant cardinality estimators.

All distinct counting sketches of which we are aware that are invariant to permutations of the input data are included in this class. As we explain in Section 4, this includes MinCount, HLL, LPCA, and probabilistic counting. Note that for any hash-based, order-invariant cardinality estimator, the distribution of the sketch depends only on the cardinality of the stream.

Definition 2 (Hash-Based, Order-invariant Cardinality Estimators).

Any sketching algorithm that depends only on the set of hash values of stream items using a uniform random hash function is a hash-based order-invariant cardinality estimator. We denote this class of algorithms by .

We denote a sketching algorithm over data with internal randomness by (for hash-based algorithms, will specify the hash function used by the algorithm). The state of the sketch, referred to as the sketch for short, is denoted by and is the representation of the data structure defined by algorithm . The sample space from which the sketch is drawn varies upon the specific sketching algorithm being used. Sketches are first initialized and then items are inserted into the sketch through an add operation which may or may not change the state of the sketch.

The size of the sketch is a crucial constraint, and we denote the space consumption in bits by . For example, fm is a sketching algorithm whose state consists of bitmaps, each of length . Thus, its state A common value is , so that . Further such examples are given in Section 4.

3 Hash-Based Order-Invariant Estimators are Private

3.1 Conditions Guaranteeing Privacy

Given a collection of distinct identifiers and sketching algorithm with internal randomness , denote the resulting sketch by . For denote the set by . For all hash-based, order-invariant cardinality estimators, the distribution of the sketch depends only on the number of distinct elements in the input stream, and so without loss of generality we assume henceforth that .

By definition, for an -differential privacy guarantee, we must show

(2)
Overview of the privacy results.

The main result in our analysis bounds the privacy loss of a hash-based, order-invariant sketch in terms of just two sketch-specific quantities. Both quantities intuitively capture how sensitive the sketch is to the removal or insertion of a single item from the data stream.

The first quantity is , where denotes the number of items from that would change the sketch if removed from the stream, when the internal randomness used by the sketch is (see Equation (4) for details). As we show later, is always bounded above by the number of bits in the sketch, but for most sketches is much smaller. The second quantity is , where denotes the probability over the algorithm’s internal randomness that adding one more item to the stream would change the sketch, conditioned on the sketch’s state prior to the addition being (see Equation (10) for details).

The main sub-result in our analysis (Theorem 7) roughly states that the sketch is -DP so long as (a) is not too close to (specifically, so long as ), and (b) the stream cardinality is larger than , which for small values of is .

We actually show that Property (a) must be satisfied by any -DP algorithm, if the algorithm works over data universes of unbounded size. Unfortunately, Property (a) does not directly hold for natural sketching algorithms. But we show (Section 3.2.2) that Property (a) can be generically satisfied by combining any hash-based, order-invariant sketching algorithm with a simple down-sampling procedure.

Overview of the analysis.

To establish the main result outlined above (Theorem 7) our basic strategy proceeds in two steps. First, as indicated above, we bound the quantity , i.e., we show that for any fixed value of the sketching procedure’s internal randomness , very few items actually affect in the sketch. That is, there are very few items whose removal from would change the resulting sketch, in the sense that . In particular, we show that the number of such items is always at most the size of the sketch in bits (and typically is significantly smaller). Second, the order-invariance of the sketch generates a great deal of symmetry that we can exploit. Effectively, we show that for any , the probability that is at most , or more precisely, at most the probability that a random subset of of size contains . This is enough to establish Equation (2), provided that is sufficiently larger than (roughly by a factor of ).

The above two-step description is, however, a significant simplification of the full argument. A more detailed overview follows.

Intuitively, the first step of our analysis works as follows. Since a distinct counting sketch must perform a deduplication operation to ensure only distinct items are counted, it also allows for approximate set membership queries, albeit with very high error. In order to remain small (at most bits), only information about a few items can be stored in the sketch at any given time (certainly at most items, but for many sketches of interest, information about even fewer than items is stored). Other than these at most items, all remaining items from do not affect the final state of the sketch. Thus, for any sketch outcome ,

(3)

While the definition of the conditional probability on the left hand side of Equation (3) is tantalizingly close to the Bayes ratio we wish to bound, namely , the above reasoning only provides a lower bound on the ratio. It does not rule out the possibility that adding item changes the sketch to with high probability. That is, , and the right hand side may still be large.

In order to convert the Bayes ratio into a form where we can explicitly compute the relevant probabilities, we find some other item whose inclusion does not change the sketch. Using symmetry we can change the item in the denominator to and obtain a probability conditional on . A combinatorial argument further exploiting symmetry gives our final bound.

Details of the Analysis.

Let

(4)

denote the set of items which, when removed from the data set, change the sketch and denote its cardinality by . Also, let

denote the smallest index amongst the remaining items in that do not change the sketch. If removing any item changes the sketch, i.e., if for all , then . For this case, we define to be a special symbol .

The following lemmas relate the state of a sketch over data , , to the states of the sketch when an item is omitted, .

Lemma 3.

Suppose . Then , and

(5)
Proof.

First, let us rewrite as a sum over all possible values of :

(6)

Next, we split the right hand side of Equation (6) into the distinct cases wherein and , as we will ultimately deal with each case separately:

(7)

Next, the summands over are decomposed via conditional probabilities. Specifically, the right hand side of Equation (3.1) equals:

(8)

For any hash-based, order-invariant sketch, the distribution of depends only on the number of distinct elements in , and hence the factor appearing in the th summand of Equation (8) equals , where is the element of referred to in the statement of the lemma. Accordingly, we can rewrite Expression (8) as:

(9)

Clearly, whenever the number of items in the data set, , exceeds . Hence, if , . We obtain Equation (5) as desired. ∎

Lemma 4 states that is always at most , the size of the sketch in bits, though as explained in Section 4, is much smaller than for popular sketching algorithms.

Lemma 4.

For any distinct counting sketch with size in bits bounded by , .

Proof.

Consider some data set and sketch . Recall that we denote the set of items whose removal would change the sketch by . Consider any subset . Then we claim that, for any , adding to the sketch will change it if and only if . That is, if denotes the stream consisting of one occurrence of each item in , followed by , then if and only if .

To see this, first observe that duplication-invariance of the sketching algorithm implies that if then . Second if , suppose by way of contradiction that , and let . Since , it holds that . Yet by order-invariance of the sketching algorithm, . We conclude that , contradicting the assumption that

The above means that for any fixed , the sketch losslessly encodes the arbitrary subset of . Hence, the sketch requires at least bits to represent. Thus, any sketch with size bounded by bits can have at most items that affect the sketch. ∎

Comparing Equations (2) and (5), we see that to establish -DP we must show the right hand side of Equation (5) lies in the interval . To do so, we define

(10)

to be the probability that adding item to a sketch in state will change the sketch. Conceptually, it can be helpful to think of as the probability that, when processing an as-yet-unseen item when the sketch is in state , gets “sampled” by the sketch. For a sketch such as Bottom-, which computes a sample of (hashes of) stream items, is literally the probability that is sampled by the sketch. Accordingly, we refer to later in this manuscript as a “sampling probability”.

Lemma 5.

Under the same assumptions as Lemmas 3 and 4,

(11)
Proof.

We begin by writing

(12)

To analyze Expression (12), we first focus on the term. Given that and , we know that if and only if the items are all in and item is not in . The first condition occurs with probability . This is because there are subsets of of size that contain items , out of subsets of of size . Meanwhile, means that , which occurs with probability that is exactly the complement of the sampling probability (see Equation (10)).

By the above reasoning, the left hand side of Expression (11) equals:

(13)

Having established Lemma 5, we are finally in a position to derive a result showing that a hash-based, order-invariant sketch is -DP so long as the stream cardinality is large enough and is not too close to .

Corollary 6.

Let

(14)

and let denote the set of all possible states of a hash-based order-invariant distinct counting sketching algorithm. When run on a stream of cardinality , the sketch output by the algorithm satisfies -DP if

(15)

and, for all sketch states ,

(16)

Furthermore, if the data stream consists of items from a universe of unbounded size, Condition 15 is necessarily satisfied by any sketching algorithm satisfying -DP.

Proof.

Since for all and , it follows from Lemma 5, (15) and (16) that the right hand side of Equation (5) lies in the interval , as required for an -DP guarantee.

For the necessity of Condition 15, note that if the universe of possible items is infinite, then for any possible sketch state , there exists an arbitrarily long sequence of distinct items that results in state if . One simply needs to search for a sequence of items which do not change the sketch. Combining Lemma 3 with Equation (11) in Lemma 5 therefore implies that

(17)

and hence as claimed. ∎

The above corollary may be difficult to apply directly since the expectation in Condition (16) is often difficult to compute and depends on the unknown cardinality . Our main result provides sufficient criteria to ensure that Condition (16) holds. The criteria is expressed in terms of a minimum cardinality and sketch-dependent constant . This constant is a bound on the maximum number of items which change the sketch when removed. That is, for all input streams and all , , where recall from Equation (4) that for a given input stream , denotes the set of items whose absence changes the sketch. We derive for a number of popular sketch algorithms in Section 4.

Theorem 7.

For a hash-based, order-invariant distinct counting sketch, let (recall that is defined in Equation (4)). The sketch output by the algorithm satisfies an -DP guarantee if

(18)

and the number of unique items in the input stream is strictly greater than

(19)
Proof.

We can upper bound the expectation on the right hand side of Condition (16) using and . Corollary 6 and solving for then gives the desired result. Specifically, by Corollary 6, the sketch satisfies -DP if:

(20)

Hence, the sketch satisfies -DP when run on a stream of cardinality so long as:

(21)

Later, we explain how to modify existing sketching algorithms in a black-box way to ensure that Conditions (18) and (19) are satisfied. But for most sketching algorithms used in practice, if left unmodified there will be some sketch values for which Condition 18 is not satisfied, i.e, . Let us call such sketch values “privacy-violating”. Fortunately, for practical sketching algorithms, privacy-violating sketch values only arise with tiny probability. The next theorem states that, so long as this probability is smaller than , the sketch satisfies -DP without modification.

Theorem 8.

Let be as in Theorem 7. Given a hash-based, order-invariant distinct counting sketch with bounded size, let be the set of sketch states such that . If the cardinality of the input stream is greater than , then the sketch is differentially private where .

Proof.

This trivially follows from Theorem 7. ∎

3.2 Constructing Private Sketches

3.2.1 Approximate Differential Privacy

Theorem 8 states that, when run on a stream with distinct items, any hash-based order-invariant algorithm directly (see Algorithm 0(a)) satisfies -differential privacy where denotes the probability that the final sketch state is “privacy-violating”, i.e., . Concrete bounds on as a function of the input cardinality for specific practical sketching algorithms are given in Section 4; in all cases considered, falls exponentially quickly with . In the remainder of this subsection, we provide intuition for why this is true and how we later establish the concrete bounds for specific sketches.

If denotes the state of the sketch after processing the first input items, we show that is monotonically decreasing with (see, for example, the proof of Corollary 11). We can prove the desired bound on by analyzing sketches in a manner similar to the coupon collector problem. Assuming a perfect, random hash function, the hash values of a universe of items defines a probability space. For each sketch considered in Section 4, we identify events or coupons, , such that is guaranteed to be less than after all events have occurred. A simple union bound then guarantees that the probability that a sketch fails to satisfy an -DP guarantee decreases exponentially as the cardinality grows.

As additional intuition for why unmodified sketches satisfy an -DP guarantee when the cardinality is large, we note that the inclusion probability is closely tied to the cardinality estimate in most sketching algorithms. For example, the cardinality estimators used in HLL and KMV are inversely proportional to the sampling probability , i.e., , while for LPCA and Adaptive Sampling, the cardinality estimators are monotonically decreasing with respect to . Thus, for most sketching algorithms, when run on a stream of sufficiently large cardinality, the resulting sketch is privacy-violating only when the cardinality estimate is also inaccurate. The following theorem is useful when analyzing the privacy of such sketching algorithms, as it characterizes the probability of a “privacy violation” in terms of the probability the sketch returns an estimate lower than some threshold .

Theorem 9.

Recall from Equation (19) that . Let be a sketching algorithm with estimator . Suppose the estimate returned on sketch is a strictly decreasing function of , so that for some function . Then, if , the sketching algorithm is -DP where .

Proof.

Since is invertible and decreasing, . ∎

3.2.2 Pure Differential Privacy

Theorem 7 guarantees that a sketch satisfies -DP if two conditions hold (Conditions (18) and (19)). The first condition requires that factor be smaller than , i.e., requiring that the “sampling probability” of the sketching algorithm be sufficiently small regardless of the sketch’s state (smaller than ). Meanwhile, Condition (19) requires that the number of distinct items in the input stream must be sufficiently large.

We observe that one can take any hash-based, order-invariant distinct counting sketching algorithm and turn it into one that satisfies these two conditions by adding a simple pre-processing step, which does two things. First, it “downsamples” the input stream by hashing each input (using a different hash function than the one used by the algorithm being pre-processed), interpreting the hash values as numbers in , and simply ignoring numbers whose hashes are larger than . This ensures that Condition (18) is satisfied, by simply discarding each input item with probability . Second, it artificially adds items to the input stream to ensure that Condition (19) is satisfied (these

items should also be downsampled as per the first modification). An unbiased estimate of the cardinality of the unmodified stream can then be easily recovered from the sketch via a post-processing correction. Note that for this estimate to be unbiased, these

artificial items must be distinct from any items that appear in the “real” stream. Pseudocode for the modified algorithm, which is guaranteed to satisfy -DP, is given in Algorithm 0(c).

In settings where there is an a priori guarantee that the number of distinct stream items is greater than , the addition of artificial items is not necessary to ensure -DP. Pseudocode for the resulting -DP algorithm is given in Algorithm 0(b).

function Base(items, )
     
     
     
     for  do
         
               
     return
(a) Provides -DP guarantee for sufficiently large .
function DPSketchLargeSet(items, )
     
     
     for  do
         if  then
                             
     return
(b) Provides -DP guarantee for .
function DPSketchAnySet(items, )
     
     
     for  do
         if  then
                             
     return
(c) Provides -DP guarantee for any .
Algorithms 1: Differentially private cardinality estimation algorithms from black box sketches. The function initializes a sketch in a black-box fashion. The output of the uniform random hash function is interpreted as a number in . Note that this hash function is chosen independently of the internal randomness of the black-box sketching procedure (in particular, the hash function used in pre-processing is independent of any hash function used by ). The cardinality estimate returned by sketch is denoted . DPInitSketch is given in Algorithm 1(a).
Corollary 10.

The functions DPSketchLargeSet (Algorithm 0(b)) and DPSketchAnySet (Algorithm 0(c)) yield -DP distinct counting sketches provided that and , respectively.

Proof.

Under their respective assumptions, DPSketchLargeSet and DPSketchAnySet satisfy Conditions (18) and (19) of Theorem 7. ∎

function DPInitSketch()
     
     
     
     
     for  do
         
         if  then
                             
     
     return
(a)
function DPInitSketchForMerge()
     
     
     
     
     repeat
         
         
         
     until  and
     return
(b)
Algorithms 2: Initialization routines for generating -DP sketches. The function returns an item that is guaranteed to come from a data universe disjoint from the universe over which stream items are drawn. In DPInitSketch, the binomial draw simulates inserting unique items into the sketch, with downsampling probability .

3.2.3 Constructing -DP Sketches from Existing Sketches: Algorithm 3

As regulations change and new ones are added, existing data may need to be appropriately anonymized. However, if the data has already been sketched, the underlying data may no longer be available, and even if it is retained, it may be too costly to reprocess it all. Our theory allows these sketches to be directly converted into differentially private sketches when the sketch has a merge procedure.

That is, the algorithm assumes that it is possible to take a sketch of a stream and a sketch of a stream , and “merge” them to get a sketch of the concatenation of the two streams . This is the case for most practical hash-based order-invariant discount count sketches. Denote a merge operation between sketches and by . Since merging requires the same randomness to be used, we will often suppress the dependency on in the notation. The merge step is a property of the specific sketching algorithm used and operates on the sketch states and so we also overload the notation to denote the merge over states by .

In this setting, we think of the existing non-private sketch being converted to a sketch that satisfies -DP by Algorithm 3. Since sketch is already constructed, items cannot be first downsampled in the building phase the way they are in Algorithms 0(b) and 0(c). To achieve the stated privacy, Algorithm 3 constructs a noisily initialized sketch, , which satisfies both the downsampling condition (Condition (18)) and the minimum stream cardinality requirement (Condition (19)) and returns the merged sketch . As formalized in Corollary 11 below, the merged sketch is guaranteed to satisfy both requirements needed for a privacy guarantee.

function MakeDP(, )
      Algorithm 1(b)
     return
      return private sketch and associated cardinality estimate for stream is a sketch of.
Algorithm 3 Turn an existing sketch into one with an -DP guarantee.
Corollary 11.

Regardless of the sketch provided as input to the function MakeDP (Algorithm 3), MakeDP yields an -DP distinct counting sketch.

Proof.

Given sketches with states and , respectively, we claim that any item that does not modify also cannot modify the merged sketch by the order-invariance of . To see this, let and respectively denote the streams that were processed by sketches and , and consider an item that does not appear in or and whose insertion into would not change the sketch . Since the state of the sketch is the same after processing as it was after processing , is also the sketch of , where denotes stream concatenation. By order-invariance, is also a sketch for . Also by order-invariance, is a sketch for . Hence, we have shown that the insertion of into does not change the resulting sketch.

It follows that , where the last inequality holds by the stopping condition of the loop in DPInitSketchForMerge (Algorithm 1(b)). Hence, MakeDP also satisfies Condition (18). The requirement that in DPInitSketchForMerge also ensures that is a sketch of a stream satisfying Condition (19). Hence, Theorem 7 implies that the sketch returned by MakeDP satisfies -DP. Since the additional value that affects the estimate returned by MakeDP does not depend on the data, there is no additional privacy loss incurred by returning it. ∎

3.3 Utility

When processing a data set with

unique items, denote the expectation and variance of a sketch and its estimator by

and respectively. We show that our algorithms all yield unbiased estimates. Furthermore, we show that for Algorithms 0(a)-0(c), if the base sketch satisfies a relative error guarantee (defined below), the DP sketches add no additional error asymptotically.

Establishing unbiasedness.

To analyze the expectation and variance of each algorithm’s estimator, , note that each estimator uses a ‘base estimate’ from the base sketch and has the form

(22)

where is the number of artificial items added and is the downsampling probability. This allows us to express expectations and variance in terms of the variance of the base estimator.

Theorem 12.

Consider a base sketching algorithm with an unbiased estimator for the cardinality of items added to the base sketch. All three algorithms 1 (a)-(c) and Algorithm 3 yield unbiased estimators.

Proof.

Trivially, Algorithm 0(a) is unbiased by assumption, as it does not modify the base sketch. Given , there are items added to the base sketch. Since the base sketch’s estimator is unbiased, . Algorithms 0(b), 0(c), and Algorithm 3 all have expectation:

Bounding the variance.

First, observe that Theorem 12 yields a clean expression for the variance of our private algorithms.

Corollary 13.

The variance of the estimates produced by Algorithms 0(a)-0(c) and Algorithm 3 is given by

(23)
Proof.

This follows from the law of total variance and the fact that the estimators are unbiased. ∎

Let us say that the base sketch satisfies a relative-error guarantee if with high probability, the estimate returned by the sketching algorithm when run on a stream of cardinality is for some constant . Let denote the cardinality estimate when the base algorithm is run on a stream of cardinality , as opposed to denoting the cardinality estimate produced by the base sketch on the sub-sampled stream used in our private sketches DPSketchLargeSet (Algorithm 0(b)) and DPSketchAnySet (Algorithm 0(c)). The relative error guarantee is satisfied when ; this is an immediate consequence of Chebyshev’s inequality.

When the number of artificially added items is constant as in Algorithms 0(b) and 0(c), Corollary 13 provides a precise expression for the variance of the differentially private sketch. In Theorem 14 below, we use this expression to establish that the modification of the base algorithm to an -DP sketch as per Algorithms 0(b) and 0(c) satisfy the exact same relative error guarantee asymptotically. In other words, the additional error due to any pre-processing (down-sampling and possibly adding artificial items) is insignificant for large cardinalities .

Theorem 14.

Recall from Equation (19) that . Suppose satisfies a relative error guarantee for all and for some constant . Then Algorithms 0(b) and 0(c) satisfy

(24)
(25)

where for Algorithm 0(b) and for Algorithm 0(c).

Proof.

Let denote the actual number of items inserted into the base sketch. From Corollary 13 and since is constant, the variance is

Trivially, .

In Algorithm 3, the number of artificial items added

is a random variable. We can ensure that the algorithm satisfies a utility guarantee if we can bound

with high probability. This is equivalent to showing that the base sketching algorithm satisfies an -DP guarantee since for any and data set with , an -DP guarantee ensures where the last equality follows from the definition of for Algorithm 1(b).

We provide -DP results for all the specific sketching algorithms considered in Section 4.

Corollary 15.

Assume that the conditions of Theorem 14 hold. Further assume the base sketching algorithm satisfies an privacy guarantee where as . For any given , we say Algorithm 3 succeeded if . Then with probability at least

and

where the notation denotes convergence in probability: as for any .

4 Example Hash-based, Order-Invariant Cardinality Estimators

Recall that the quantities of interest are the number of bins used in the sketch , the size of the sketch in bits and the number of items whose absence changes the sketch . From Section 3, Lemma 4 we know that but for several common sketches delineated in this section, we can give a stronger bound showing that . The relationship between these parameters for various sketching algorithms is summarized in Table 1.

We remind the reader that, per Equation (14), and per Equation (19)