Collecting Telemetry Data Privately

12/05/2017 ∙ by Bolin Ding, et al. ∙ Microsoft 0

The collection and analysis of telemetry data from users' devices is routinely performed by many software companies. Telemetry collection leads to improved user experience but poses significant risks to users' privacy. Locally differentially private (LDP) algorithms have recently emerged as the main tool that allows data collectors to estimate various population statistics, while preserving privacy. The guarantees provided by such algorithms are typically very strong for a single round of telemetry collection, but degrade rapidly when telemetry is collected regularly. In particular, existing LDP algorithms are not suitable for repeated collection of counter data such as daily app usage statistics. In this paper, we develop new LDP mechanisms geared towards repeated collection of counter data, with formal privacy guarantees even after being executed for an arbitrarily long period of time. For two basic analytical tasks, mean estimation and histogram estimation, our LDP mechanisms for repeated data collection provide estimates with comparable or even the same accuracy as existing single-round LDP collection mechanisms. We conduct empirical evaluation on real-world counter datasets to verify our theoretical results. Our mechanisms have been deployed by Microsoft to collect telemetry across millions of devices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Collecting telemetry data to make more informed decisions is a commonplace. In order to meet users’ privacy expectations and in view of tightening privacy regulations (e.g., European GDPR law) the ability to collect telemetry data privately is paramount. Counter data, e.g., daily app or system usage statistics reported in seconds, is a common form of telemetry. In this paper we are interested in algorithms that preserve users’ privacy in the face of continuous collection of counter data, are accurate, and scale to populations of millions of users.

Figure 1: Local Model of Differential Privacy

Recently, differential privacy TCC06 (DP) has emerged as defacto standard for the privacy guarantees. In the context of telemetry collection one typically considers algorithms that exhibit differential privacy in the local model ccs:ErlingssonPK14 ; pvldb:HuYYDCYGZ15 ; popets:FantiPE16 ; focs:DuchiJW13 ; focs:BassilyST14 ; NIPS:BNST017 ; Arxiv:TKBWW , also called randomized response model warner1965randomized , -amplification evfimievski2003limiting , or FRAPP agrawal2005framework . These are randomized algorithms that are invoked on user’s device to turn user’s private value into a response that is communicated to data collector and have the property that the likelihood of any specific algorithm’s output varies little with the input, thus providing users with plausible deniability. Guarantees offered by locally differentially private algorithms, although very strong in a single round of telemetry collection, quickly degrade when data is collected over time. This is a very challenging problem that limits the applicability of DP in many contexts.

In telemetry applications, privacy guarantees need to hold in the face of continuous data collection. Recently, in an influential paper ccs:ErlingssonPK14 proposed a framework based on memoization to tackle this issue. Their techniques allow one to extend single round DP algorithms to continual data collection and protect users whose values stay constant or change very rarely. The key limitation of the work of ccs:ErlingssonPK14 is that their approach cannot allow for even very small but frequent changes in users’ private values, making it inappropriate for collecting counter data. In this paper, we address this limitation.

We design mechanisms with formal privacy guarantees in the face of continuous collection of counter data. These guarantees are particularly strong when user’s behavior remains approximately the same, varies slowly, or varies around a small number of values over the course of data collection.

Our results. Our contributions are threefold.

  • We give simple -bit response mechanisms in the local model of DP for single-round collection of counter data for mean and histogram estimation. Our mechanisms are inspired by those in warner1965randomized ; nips:DuchiWJ13 ; focs:DuchiJW13 ; stoc:BassilyS15 , but allow for considerably simpler descriptions and implementations. Our experiments also demonstrate performance gains in concrete settings.

  • Our main technical contribution is a rounding technique called -point rounding that borrows ideas from approximation algorithms literature goemans2002single ; bansal2008improved , and allows memoization to be applied in the context of private collection of counters while avoiding substantial losses in accuracy or privacy. We give a rigorous definition of privacy guarantees provided by our algorithms when the data is collected continuously for an arbitrarily long period of time. We also present empirical findings related to our privacy guarantees.

  • Finally, our mechanisms have been deployed by Microsoft across millions of devices starting with Windows Insiders in Windows 10 Fall Creators Update to protect users’ privacy while collecting application usage statistics.

1.1 Preliminaries and problem formulation

In our setup, there are users, and each user at time has a private (integer or real) counter with value . A data collector wants to collect these counter values at each time stamp to do statistical analysis. For example, for the telemetry analysis, understanding the mean and the distribution of counter values (e.g., app usage) is very important to IT companies.

Local model of differential privacy (LDP). Users do not need to trust the data collector and require formal privacy guarantees before they are willing to communicate their values to the data collector. Hence, a more well-studied DP model TCC06 ; dwork2014algorithmic , which first collects all users’ data and then injects noise in the analysis step, is not applicable in our setup.

In this work, we adopt the local model of differential privacy, where each user randomizes private data using a randomized algorithm (mechanism) locally before sending it to data collector.

Definition 1 (evfimievski2003limiting ; nips:DuchiWJ13 ; stoc:BassilyS15 ).

A randomized algorithm is -locally differentially private (-LDP) if for any pair of values and any subset of output we have that

LDP formalizes a type of plausible deniability: no matter what output is released, it is approximately equally as likely to have come from one point as any other. For alternate interpretations of differential privacy within the framework of hypothesis testing we refer the reader to wasserman2010statistical ; focs:DuchiJW13 .

Statistical estimation problems. We focus on two estimation problems in this paper.

Mean estimation: For each time stamp , the data collector wants to obtain an estimation for . The error of an estimation algorithm for mean is defined to be . In other words, we do worst case analysis. We abuse notation and denote to mean for a fixed input .

Histogram estimation: Suppose the domain of counter values is partitioned into buckets (e.g., with equal widths), and a counter value can be mapped to a bucket number . For each time stamp , the data collector wants to estimate frequency of as . The error of a histogram estimation is measured by Again, we do worst case analysis of our algorithms.

1.2 Repeated collection and overview of privacy framework

Privacy leakage in repeated data collection.

Although LDP is a very strict notion of privacy, its effectiveness decreases if the data is collected repeatedly. If we collect counter values of a user for time stamps by executing an -LDP mechanism independently on each time stamp, can be only guaranteed indistinguishable to another sequence of counter values, , by a factor of up to , which is too large to be reasonable as increases.

Hence, in applications such as telemetry, where data is collected continuously, privacy guarantees provided by an LDP mechanism for a single round of data collection are not sufficient. We formalize our privacy guarantee to enhance LDP for repeated data collection later in Section 3. However, intuitively we ensure that every user blends with a large set of other users who have very different behaviors. Similar philosophy can be found in Blowfish privacy sigmod:HeMD14 which protects only a specified subset of pairs of neighborhood databases to trade-off privacy for utility. On a different but relevant line of work about streaming model of DP innovations:DworkNPRY10 , the event-level private counting problem under continual observation is studied soda:Dwork10 , with almost tight upper bounds in error (polynomial in ) in stoc:DworkNPR10 and icalp:ChanSS10 . pvldb:KellarisPXP14 proposes a weaker protection for continual events in consecutive timestamps to remove the dependency on the length of time period .

Figure 2: Privacy Framework

Our Privacy Framework and Guarantees. Our framework for repeated private collection of counter data follows similar outline as the framework used in ccs:ErlingssonPK14 . Our framework for mean and histogram estimation has four main components:

1) An important building block for our overall solution are -bit mechanisms that provide local -LDP guarantees and good accuracy for a single round of data collection (Section 2).

2) An -point rounding scheme to randomly discretize users private values prior to applying memoization (to conceal small changes) while keeping the expectation of discretized values intact (Section 3).

3) Memoization of discretized values using the -bit mechanisms to avoid privacy leakage from repeated data collection (Section 3). In particular, if the counter value of a user remains approximately consistent, then the user is guaranteed -differential privacy even after many rounds of data collection.

4) Finally, output perturbation (instantaneous noise in ccs:ErlingssonPK14 ) to protect exposing the transition points due to large changes in user’s behavior and attacks based on auxiliary information (Section 4).

We now describe these components in more detail focusing predominantly on mean estimation. Later, in Section 5 we present our experimental results and in Section 6 we discuss some details of the deployment in Windows 10.

2 Single-round LDP mechanisms for mean and histogram Estimation

We first describe our -bit LDP mechanisms for mean and histogram estimation. Our mechanisms are inspired by the works of Duchi et al. nips:DuchiWJ13 ; focs:DuchiJW13 ; corr:DuchiWJ16 and Bassily and Smith stoc:BassilyS15 . However, our mechanisms are tuned for more efficient communication (by sending bit for each counter each time) and stronger protection in repeated data collection (introduced later in Section 3). To the best our knowledge, the exact form of mechanisms presented in this Section was not known. Our algorithms yield accuracy gains in concrete settings (see Section 5) and are easy to understand and implement.

2.1 1-Bit Mechanism for mean estimation

Collection mechanism 1BitMean: When the collection of counter at time is requested by the data collector, each user sends one bit , which is independently drawn from the distribution:

(1)

Mean estimation. Data collector obtains the bits from users and estimates as

(2)

The basic randomizer of stoc:BassilyS15 is equivalent to our 1-bit mechanism for the case when each user takes values either or . The above mechanism can also be seen as a simplification of the multidimensional mean-estimation mechanism given in focs:DuchiJW13 . For the 1-dimensional mean estimation, Duchi et al. focs:DuchiJW13 show that Laplace mechanism is asymptotically optimal for the mini-max error. However, the communication cost per user in Laplace mechanism is bits, and our experiments show it also leads to larger error compared to our 1-bit mechanism. We prove following results for the above 1-bit mechanism.

Theorem 1.

For single-round data collection, the mechanism 1BitMean in (1) preserves -LDP for each user. Upon receiving the bits , the data collector can then estimate the mean of counters from users as in (2

). With probability at least

, we have

We establish a few lemmas first and then prove Theorem 1.

Lemma 1.

The algorithm 1BitMean preserves -DP of every user.

Proof: Observe that each user contributes only a single bit to data collector. By formula (1) the probability that varies from to depending on the private value Similarly, the probability that varies from to with Thus the ratios of respective probabilities for different values of can be at most

Recall the definition of .

(3)
Lemma 2.

in Equation (2

) is an unbiased estimator for

Proof: Observe that

Lemma 3.

Let and be as in Equations (2, 3). Let be arbitrary. We have

(4)

Proof: Clearly, for every we have

Let

(5)

Applying the Chernoff-Hoeffding bound (boucheron2013concentration, , Theorem 2.8) to independent

-random variables

for all we have

(6)

Combining (6) and (5) we get

(7)

Combining (7), (3), and (2) we conclude

(8)

Thus setting we obtain

(9)

which concludes the proof.

Proof:[Proof of Theorem 1] For any , set . Then, error

This fact combined with Lemmas (1, 3) completes the proof.

2.2 -Bit Mechanism for histogram estimation

Now we consider the problem of estimating histograms of counter values in a discretized domain with buckets with LDP to be guaranteed.

This problem has extensive literature both in computer science and statistics, and dates back to the seminal work Warner warner1965randomized ; we refer the readers to following excellent papers icalp:HsuKR12 ; nips:DuchiWJ13 ; stoc:BassilyS15 ; kairouz2016discrete for more information. Recently, Bassily and Smith stoc:BassilyS15 gave asymptotically tight results for the problem in the worst-case model building on the works of icalp:HsuKR12 . On the other hand, Duchi et al. nips:DuchiWJ13 introduce a mechanism by adapting Warner’s classical randomized response mechanism in warner1965randomized , which is shown to be optimal for the statistical mini-max regret if one does not care about the cost of communication.

The generic mechanism introduced in stoc:BassilyS15 can be used to reduce the communication cost in Duchi et al.’s mechanism to 1 bit per user, which however only works for .111stoc:BassilyS15 requires but we can optimize the parameters to loose the constraint to . Another major technical component in stoc:BassilyS15 is the use of Johnson-Lindenstrauss lemma to make the communication cost polynomial in . This component seems very difficult to be used in practice, because it requires from each user storage per counter, and/or time per collection. In our applications, (the number of users) is order of millions, and thus makes their mechanism prohibitively expensive. kairouz2016discrete generalizes Warner’s randomized response mechanism from binary to -ary, which is close to optimal for large but sub-optimal for small .

Therefore, in order to have a smooth trade-off between accuracy and communication cost (as well as the ability to protect privacy in repeated data collection, which will be introduced in Section 3) we introduce a modified version of Duchi et al.’s mechanism nips:DuchiWJ13 based on subsampling by buckets.

Collection mechanism BitFlip: Each user randomly draws bucket numbers without replacement from , denoted by . When the collection of discretized bucket number at time is requested by the data collector, each user

sends a vector:

Under the same public coin model as in stoc:BassilyS15 , each user only needs to send to the data collector bits , , , in , as can be generated using public coins.

Histogram estimation. Data collector estimates histogram as: for

(10)

When , BitFlip is exactly the same as the one in in Duchi et al.nips:DuchiWJ13 . The privacy guarantee is straightforward. In terms of the accuracy, the intuition is that for each bucket , there are roughly users responding with a 0-1 bit . We can prove the following result.

Theorem 2.

For single-round data collection, the mechanism BitFlip preserves -LDP for each user. Upon receiving the bits from each user , the data collector can then estimate then histogram as in (10). With probability at least , we have,

Proof: The privacy guarantee of our algorithm is straightforward from the construction. To analyze the error bound for each , let us consider the set of users each of whom sends to the data collector. Let and based on how each user chooses , we know in expectation. Consider ; since can be considered as a uniform random sample from , we can show using the Hoeffding’s inequality that

From (10) and, again from the Hoeffding’s inequality, we have

Putting them together, and using the union bound and the triangle inequality, we have

The bound of follows from the union bound over the buckets.

3 Memoization for continual collection of counter data

One important concern regarding the use of -LDP algorithms (e.g., in Section 2.1) to collect counter data pertains to privacy leakage that may occur if we collect user’s data repeatedly (say, daily) and user’s private value does not change or changes little. Depending on the value of after a number of rounds, data collector will have enough noisy reads to estimate with high accuracy.

Memoization ccs:ErlingssonPK14 is a simple rule that says that: At the account setup phase each user pre-computes and stores his responses to data collector for all possible values of the private counter. At data collection users do not use fresh randomness, but respond with pre-computed responses corresponding to their current counter values. Memoization (to a certain degree) takes care of situations when the private value stays constant. Note that the use of memoization violates differential privacy. If memoization is employed, data collector can easily distinguish a user whose value keeps changing, from a user whose value is constant; no matter how small the is. However, privacy leakage is limited. When data collector observes that user’s response had changed, this only indicates that user’s value had changed, but not what it was and not what it is.

As observed in (ccs:ErlingssonPK14, , Section 1.3) using memoization technique in the context of collecting counter data is problematic for the following reason. Often, from day to day, private values do not stay constant, but rather experience small changes (e.g., one can think of app usage statistics reported in seconds). Note that, naively using memoization adds no additional protection to the user whose private value varies but stays approximately the same, as data collector would observe many independent responses corresponding to it.

One naive way to fix the issue above is to use discretization: pick a large integer (segment size) that divides consider the partition of all integers into segments and have each user report his value after rounding the true value to the mid-point of the segment that belongs to. This approach takes care of the issue of leakage caused by small changes to as users values would now tend to stay within a single segment, and thus trigger the same memoized response; however accuracy loss may be extremely large. For instance, in a population where all are for some after rounding every user would be responding based on the value

In the following subsection we present a better (randomized) rounding technique (termed -point rounding) that has been previously used in approximation algorithms literature goemans2002single ; bansal2008improved and rigorously addresses the issues discussed above. We first consider the mean estimation problem.

3.1 -point rounding for mean estimation

The key idea of rounding is to discretize the domain where users’ counters take their values. Discretization reduces domain size, and users that behave consistently take less different values, which allows us to apply memoization to get a strong privacy guarantee.

As we demonstrated above discretization may be particularly detrimental to accuracy when users’ private values are correlated. We propose addressing this issue by: making the discretization rule independent across different users. This ensures that when (say) all users have the same value, some users round it up and some round it down, facilitating a smaller accuracy loss.

We are now ready to specify the algorithm that extends the basic algorithm 1BitMean and employs both -point rounding and memoization. We assume that counter values range in

  1. At the algorithm design phase, we specify an integer (our discretization granularity). We assume that divides We suggest setting rather large compared to say or even depending on the particular application domain.

  2. At the the setup phase, each user independently at random picks a value that is used to specify the rounding rule.

  3. User invokes the basic algorithm 1BitMean with range to compute and memoize -bit responses to data collector for all values in the arithmetic progression

    (11)
  4. Consider a user with private value who receives a data collection request. Let , where are the two neighboring elements of the arithmetic progression The user rounds value to if ; otherwise, the user rounds the value to . Let denote the value of the user after rounding. In each round, user responds with the memoized bit for value . Note that rounding is always uniquely defined.

We now establish the properties of the algorithm above.

Lemma 4.

Define . Then, where is defined by (3).

Proof: Let and . Define a random variable as follows. Let with probability and with probability Then, . It is easy to verify that random variable can be rewritten as . The proof the lemma follows from the linearity of expectation and the fact that .

Perhaps a bit surprisingly, using -point rounding does not lead to additional accuracy losses independent of the choice of discretization granularity

Theorem 3.

Independent of the value of discretization granularity at any round of data collection, the algorithm above provides the same accuracy guarantees as given in Theorem 1.

Proof: It suffices to show that independent of the each output bit is still sampled according to the distribution given by formula (1). We use the notation of Lemma 4. Be formula (1) and the definition of we have:

which concludes the proof.

3.2 Privacy definition using permanent memoization

In what follows we detail privacy guarantees provided by an algorithm that employs -point rounding and memoization in conjunction with the -DP -bit mechanism of Section 2.1 against a data collector that receives a very long stream of user’s responses to data collection events.

Let be a user and be the sequence of ’s private counter values. Given user’s private value each of gets rounded to the corresponding value in the set (defined by (11)) according to the rule given in Section 3.1.

Definition 2.

Let be the space of all sequences considered up to an arbitrary permutation of the elements of We define the behavior pattern of the user to be the element of corresponding to We refer to the number of distinct elements in the sequence as the width of

We now discuss our notion of behavior pattern, using counters that carry daily app usage statistics as an example. Intuitively, users map to the same behavior pattern if they have the same number of different modes (approximate counter values) of using the app, and switch between these modes on the same days. For instance, one user that uses an app for 30 minutes on weekdays, 2 hours on weekends, and 6 hours on holidays, and the other user who uses the app for 4 hours on weekdays, 10 minutes on weekends, and does not use it on holidays will likely map to the same behavior pattern. Observe however that the mapping from actual private counter values to behavior patterns is randomized, thus there is a likelihood that some users with identical private usage profiles may map to different behavior patterns. This is a positive feature of the Definition 2 that increases entropy among users with the same behavior pattern.

The next theorem shows that the algorithm of Section 3.1 makes users with the same behavior pattern blend with each other from the viewpoint of data collector (in the sense of differential privacy).

Theorem 4.

Consider users and with sequences of private counter values and Assume that both and respond to data collection events using the algorithm presented in Section 3.1, and with the width of equal to . Let be the random sequences of responses generated by users and then for any binary string in the response domain, we have:

(12)

Proof: Let and be the sequences of and counter values after applying -point rounding. Since the width of is the set contains elements Similarly, the set contains elements Note that vectors and are each determined by bits that are ’s (’s) memoized responses corresponding to counter values and . By the -LDP property of the basic algorithm 1BitMean of Section 2.1 for all values of and all we have

Thus the probability of observing some specific responses of can increase by at most as we vary the inputs.

3.2.1 Setting parameters

The -LDP guarantee provided by Theorem 4 ensures that each user is indistinguishable from other users with the same behavior pattern (in the sense of LDP). The exact shape of behavior patterns is governed by the choice of the parameter Setting very large, say or reduces the number of possible behavior patterns and thus increases the number of users that blend by mapping to a particular behavior pattern. It also yields stronger guarantee for blending within a pattern since for all users we necessarily have and thus by Theorem 4 the likelihood of distinguishing users within a pattern is trivially at most At the same time there are cases where one can justify using smaller values of In fact, consistent users, i.e., users whose private counter always land in the vicinity of one of a small number of fixed values enjoy a strong LDP guarantee within their patterns irrespective of (provided it is not too small), and smaller may be advantageous to avoid certain attacks based on auxiliary information as the set of all possible values of a private counter that lead to a specific output bit is potentially more complex.

Finally, it is important to stress that the -LDP guarantee established in Theorem 4 is not a panacea, and in particular it is a weaker guarantee (provided in a much more challenging setting) than just the -LDP guarantee across all users that we provide for a single round of data collection. In particular, while LDP across all population of users is resilient to any attack based on auxiliary information, LDP across a sub population may be vulnerable to such attacks and additional levels of protection may need to be applied. In particular, if data collector observes that user’s response has changed; data collector knows with certainty that user’s true counter value had changed. In the case of app usage telemetry this implies that app has been used on one of the days. This attack is partially mitigated by the output perturbation technique that is discussed in Section 4.

3.2.2 Experimental study

We use a real-world dataset of 3 million users with their daily usage of two apps (App A and B) collected (in seconds) over a continuous period of 31 days to demonstrate the mapping of users to behavior patterns in Figure 3. For each behavior pattern (Definition 2), we calculate its support as the number of users with their sequences in this pattern (-axis). All the patterns’ supports are plotted in the decreasing order, and we can also calculate the percentage of users (-axis) in patterns with supports at least We vary the parameter in permanent memoization from (maximizing blending) to and report the corresponding distributions of pattern supports in Figure 3.

Figure 3: Distribution of pattern supports for App A and B

It is not hard to see that theoretically for every behavior pattern there is a very large set of sequences of private counter values that may map to it (depending on ). Real data (Figure 3) provides evidence that users tend to be approximately consistent and therefore simpler patterns, i.e., patterns that mostly stick to a single rounded value correspond to larger sets of sequences obtained from a real population. In particular, for each app there is always one pattern (corresponding to having one fixed across all 31 days) which blends the majority of users ( million). However more complex behavior patterns have less users mapping to them. In particular, there always are some lonely users (- depending on ) who land in patterns that have support size of one or two. From the viewpoint of data collector such users can only be identified as those having a complex and irregular behavior, however the actual nature of that behavior by Theorem 4 remains uncertain.

3.3 Example

One specific example of a counter collection problem that has been identified in (ccs:ErlingssonPK14, , Section 1.3) as being non-suitable for techniques presented in ccs:ErlingssonPK14 but can be easily solved using our methods is to repeatedly collect age in days from a population of users. When we set and apply the algorithm of Section 3.1 we can collect such data for rounds with high accuracy. Each user necessarily responds with a sequence of bits that has form where Thus data collector only gets to learn the transition point, i.e., the day when user’s age in days passes the value which is safe from privacy perspective as is picked uniformly at random by the user.

3.4 Continual collection for histogram estimation using permanent memoization

Since we discretize the range of values and map each user’s value to a small number of buckets, -point rounding is not needed for histogram estimation. The single-round LDP mechanism in Duchi et al. nips:DuchiWJ13 sends out a 0-1 random response for each bucket: send with probability if the counter value is in this bucket, with probability if not. It is easy to see that this mechanism is -LDP. Each user can memorize a mapping by running this mechanism once for each , and always respond if the users’ value is in bucket . However, this memoization schema leads to very serious privacy leakage. There is a situation where one has auxiliary information that can deterministically correlate a user’s value with the output produced by the algorithm: more concretely, if the data collector knows that the app usage value is in a bucket and observes the output in some day, whenever the user sends again in future, the data collector can infer that the bucket number is with almost 100% probability.

To avoid such privacy leakages, we apply permanent memoization on our -bit mechanism BitFlip (Section 2.2). Each user runs BitFlip once for each bucket number and memoizes the response in a mapping The user will always send if the bucket number is This is mechanism is denoted by BitFlipPM, and the same estimator (10) can be used to estimate the histogram upon receiving the -bit response from every user. This scheme avoids several privacy leakages that arise due to memoization, because multiple ( w.h.p.) buckets are mapped to the same response. This protection is the strongest when . Definition 2 about behavior patterns and Theorem 4 can be naturally generalized here to provide similar privacy guarantee in repeated data collection.

4 Output Perturbation

One of the limitations of memoization approach is that it does not protect the points of time where user’s behavior changes significantly. Consider a user who never uses an app for a long time, and then starts using it. When this happens, suppose the output produced by our algorithm changes from to Then the data collector can learn with certainty that the user’s behavior changed, (but not what this behavior was or what it became). Output perturbation is one possible mechanism of protecting the exact location of the points of time where user’s behavior has changed. As mentioned earlier, output perturbation was introduced in ccs:ErlingssonPK14 as a way to mitigate privacy leakage that arises due to memoization. The main idea behind output perturbation is to flip the output of memoized responses with a small probability . This ensures that data collector will not be able to learn with certainty that behavior of a user changed at certain time stamps.

Consider the mean estimation algorithm. Suppose denotes the memoized response bit for user at time Then,

(13)

Note that output perturbation is done at each time stamp on the memoized responses. To see how output perturbation protects users from the data collector learning exact points at which user’s behavior changed, we need to set up some notation. For an arbitrary fix a time horizon where the counter data is collected. Let and be two vectors in , let denote the th coordinate of for . Let and denote the output produced by our 1-bit algorithm + memoization. Let and denote the output produced by our 1-bit algorithm + memoization + output perturbation. Suppose the Hamming distance between and is at most . Then,

Theorem 5.

Let be a vector in . Then, .

Recall that in the output perturbation step, we flip each output bit independently with probability . This implies,

where we denotes the value of at the th coordinate. For a for which , we have ; this is true, since the probability used to flip the output bits is same for both the strings. Therefore,

(14)

Now notice that for a for which , we have . Thus, the lemma follows from Eq. (14) and from our assumption that .

The theorem implies that if the user behavior changed at time , then there is an interval of time where the data collector would not be able to differentiate if the user behavior changed at time or any other time Consider a user and let be a vector in that denotes the values taken by in the interval . Suppose the user’s behavior remains constant up to time step , and it changes at time , and then remains constant. Without loss of generality, let us assume that for all , and for all . Consider the case when the output produced by our memoization changes at time ; that is, using the notation from above paragraph, . Without output perturbation, the data collector will be certain that user’s value changed at time . With output perturbation, we claim that the data collector would not be able to differentiate if the user’s behavior changed at time or any other time , if is sufficiently small. (Think of as some small constant.) We argue as follows. Consider another pattern of user’s behavior , for all and for all . Further, if , then . This is true because of the following reason. Consider the case . Then, in the interval , the output of 1-bit mechanism + memoization can be different for the strings . However, Hamming distance of and is at most . Thus, we conclude from Theorem 5 that . The argument for the case is exactly the same. Thus, output perturbation can help to protect learning exact points of time where the users’ behavior changes.

Consider a single round of data collection with the algorithm above.

Theorem 6.

Using output perturbation with a positive in combination with the -DP 1BitMean algorithm is equivalent to invoking the 1BitMean algorithm with

(15)

Thus, for each round of data collection, with probability at least the error of the mechanism presented above is at most where is an arbitrary constant between zero and one.

Proof: Observe that the distribution produced by combining output perturbation in (13) with the -DP 1BitMean algorithm in (1) is given by

(16)

It remains to note that if in formula (1) we use given by (15) instead of , then (1) yields the same distribution as (16). Now to prove Theorem 6, we simply invoke Theorem 1.

5 Empirical Evaluation

We compare our mechanisms (with permanent memoization) for mean and histogram estimation with previous mechanisms for one-time data collection. Note that all the mechanisms we compare here provide one-time -LDP guarantee; however, our mechanisms provide additional protection for each individual’s privacy during the repeated data collection (as introduced in Sections 3-4). The goal of these experiments is to show that our mechanisms, with such additional protection, are no worse than or comparable to the state-of-the-art LDP mechanisms in terms of estimation accuracy.

We first use the real-world dataset which is described in Section 3.2.2.

(a)
(b)
(c)
Figure 4: Comparison of mechanisms for mean estimation (real-world datasets)
(a)
(b)
(c)
Figure 5: Comparison of mechanisms for histogram estimation (real-world datasets)

Mean estimation. We implement our 1-bit mechanism (introduced in Section 2.1) with -point Randomized Rounding and Permanent Memoization for repeated collection (Section 3), denoted by 1BitRRPM, and output perturbation to enhance the protection for usage change (Section 4), denoted by 1BitRRPM+OP(). We compare it with the Laplace mechanism for LDP mean estimation in nips:DuchiWJ13 ; corr:DuchiWJ16 , denoted by Laplace. We vary the value of (-) and the number of users ( by randomly picking subsets of all the 3 million users), and run all the mechanisms 3000 times on the 31-day usage data with three counters. Recall that the domain size is

. The average of absolute errors (in seconds) with one standard deviation (STD) are reported in Figures 

4. 1BitRRPM is consistently better than Laplace with smaller errors and narrower STDs. Even with a perturbation probability , they are comparable in accuracy. When , output perturbation is equivalent to adding an additional uniform noise from independently on each day to provide very strong protection on usage change–even in this case, 1BitRRPM+OP(1/3) gives us tolerable accuracy when the number of users is large.

Histogram estimation. We create buckets on with even widths to evaluate mechanisms for histogram estimation. We implement our -bit mechanism (Section 2.2) with permanent memoization for repeated collection (Section 3.4), denoted by BitFlipPM. In order to provide protection on usage change in repeated collection, we use (strongest when ). We compare it with state-of-the-art one-time mechanisms for histogram estimation: BinFlip nips:DuchiWJ13 ; corr:DuchiWJ16 , KFlip kairouz2016discrete , and BinFlip+ (applying the generic protocol with 1-bit reports in stoc:BassilyS15 on BinFlip). When , BitFlipPM has the same accuracy as BinFlip. KFlip is sub-optimal for small kairouz2016discrete but has better performance when is . In contrast, BinFlip+ has good performance when .

We repeat the experiment 3000 times and report the average histogram error (i.e., maximum error across all bars in a histogram) with one standard deviation for different algorithms in Figure 5 with - and to confirm the above theoretical results. BinFlip (equivalently, 32BitFlipPM) has the best accuracy overall.

With enhanced privacy protection in repeated data collection, 4bitFlipPM is comparable to the one-time collection mechanism KFlip when is small (-); and 4bitFlipPM-1bitFlipPM are better than BinFlip+ when is large (-).

On different data distributions. We have shown that errors in mean and histogram estimations can be bounded (Theorems 1-2) in terms of and the number of users , together with the number of buckets and the number of bits (applicable only to histograms). We now conduct additional experiments on synthetic datasets to verify that the empirical errors should not change much on different data distributions. Three types of distributions are considered: i) constant distribution, i.e., each user has a counter

all the time; ii) uniform distribution, i.e.,

; and iii) normal distribution, i.e.,

(with mean equal to and standard deviation equal to ), truncated on . Three synthetic datasets are created by drawing samples of sizes from these three distributions. Results are plotted on Figures 6-7 for mean and histogram estimations, respectively, and are almost the same as those in Figures 4(a) and 5(a).

(a) Constant
(b) Uniform distribution
(c) Normal distribution (truncated)
Figure 6: Comparison of mechanisms for mean estimation (synthetic datasets: )
(a) Constant distribution
(b) Uniform distribution
(c) Normal distribution (truncated)
Figure 7: Comparison of mechanisms for histogram estimation (synthetic datasets: )

6 Deployment

In earlier sections, we presented new LDP mechanisms geared towards repeated collection of counter data, with formal privacy guarantees even after being executed for a long period of time. Our mean estimation algorithm has been deployed by Microsoft starting with Windows Insiders in Windows 10 Fall Creators Update. The algorithm is used to collect the number of seconds that a user has spent using a particular app. Data collection is performed every 6 hours, with Memoization is applied across days and output perturbation uses According to Theorem 6, this makes a single round of data collection satisfy -DP with

One important feature of our deployment is that collecting usage data for multiple apps from a single user only leads to a minor additional privacy loss that is independent of the actual number of apps. Intuitively, this happens since we are collecting active usage data, and the total number of seconds that a user can spend across multiple apps in 6 hours is bounded by an absolute constant that is independent of the number of apps.

Theorem 7.

Using the 1BitMean mechanism with a privacy parameter to simultaneously collect counters where each satisfies and preserves -DP, where

(17)

Proof: For and an integer let denote the probability that the 1BitMean mechanism produces an output on an input as given by (1). Let and be two sets of arbitrary counter values. Here all and are non-zero, and Fix some We need to bound

(18)

Let We have

(19)

Note that Thus using formula (1),