Differentially Private Set Union

02/22/2020 ∙ by Sivakanth Gopi, et al. ∙ Microsoft 0

We study the basic operation of set union in the global model of differential privacy. In this problem, we are given a universe U of items, possibly of infinite size, and a database D of users. Each user i contributes a subset W_i ⊆ U of items. We want an (ϵ,δ)-differentially private algorithm which outputs a subset S ⊂∪_i W_i such that the size of S is as large as possible. The problem arises in countless real world applications; it is particularly ubiquitous in natural language processing (NLP) applications as vocabulary extraction. For example, discovering words, sentences, n-grams etc., from private text data belonging to users is an instance of the set union problem. Known algorithms for this problem proceed by collecting a subset of items from each user, taking the union of such subsets, and disclosing the items whose noisy counts fall above a certain threshold. Crucially, in the above process, the contribution of each individual user is always independent of the items held by other users, resulting in a wasteful aggregation process, where some item counts happen to be way above the threshold. We deviate from the above paradigm by allowing users to contribute their items in a dependent fashion, guided by a policy. In this new setting ensuring privacy is significantly delicate. We prove that any policy which has certain contractive properties would result in a differentially private algorithm. We design two new algorithms, one using Laplace noise and other Gaussian noise, as specific instances of policies satisfying the contractive properties. Our experiments show that the new algorithms significantly outperform previously known mechanisms for the problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

differentially-private-set-union

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural language models for applications such as suggested replies for e-mails and dialog systems rely on the discovery of -grams and sentences Hu et al. (2014); Kannan et al. (2016); Chen et al. (2019); Deb et al. (2019). Words and phrases used for training come from individuals, who may be left vulnerable if personal information is revealed. For example, a model could generate a sentence or predict a word that can potentially reveal personal information of the users in the training set Carlini et al. (2019). Therefore, algorithms that allow the public release of the words, -grams, and sentences obtained from users’ text while preserving privacy are desirable. Additional applications of this problem include the release of search queries and keys in SQL queries Korolova et al. (2009); Wilson et al. (2020). While other privacy definitions are common in practice, guaranteeing differential privacy, introduced in the seminal work of Dwork et al Dwork et al. (2006), ensures users the strongest preservation of privacy. In this paper we consider user level privacy.

Definition 1.1 (Differential Privacy Dwork and Roth (2014)).

A randomized algorithm is (,)-differentially private if for any two neighboring databases and , which differ in exactly the data pertaining to a single user, and for all sets of possible outputs:

An algorithm satisfying differential privacy (DP) guarantees that its output does not change by much if a single user is either added or removed from the dataset. Moreover, the guarantee holds regardless of how the output of the algorithm is used downstream. Therefore, items (e.g. n-grams) produced using a DP algorithm can be used in other applications without any privacy concerns. Since its introduction a decade ago Dwork et al. (2006)

, differential privacy has become the de facto notion of privacy in statistical analysis and machine learning, with a vast body of research work (see Dwork and Roth

Dwork et al. (2014) and Vadhan Vadhan (2017) for surveys) and growing acceptance in industry. Differential privacy is deployed in many industries, including Apple Apple (2017), Google Erlingsson et al. (2014); Bittau et al. (2017), Microsoft Ding et al. (2017), Mozilla Avent et al. (2017), and the US Census Bureau Abowd (2016); Kuo et al. (2018).

The vocabulary extraction and -gram discovery problems mentioned above, as well as many commonly studied problems Korolova et al. (2009); Wilson et al. (2020), can be abstracted as a set union which leads to the following problem.

Problem 1.1 (Differentially Private Set Union (DPSU)).

Let be some universe of items, possibly of unbounded size. Suppose we are given a database of users where each user has a subset . We want an (,)-differentially private Algorithm which outputs a subset such that the size of is as large as possible.

Since the universe of items can be unbounded, as in our motivating examples, it is not clear how to apply the exponential mechanism McSherry and Talwar (2007) to DPSU. Furthermore, even for the cases when is bounded, implementing the exponential mechanism can be also very inefficient. Existing algorithms 222They don’t study the DPSU problem as defined in this paper. Their goal is to output approximate counts of as many items as possible in for this problem Korolova et al. (2009); Wilson et al. (2020) collect a bounded number of items from each user, build a histogram of these items, and disclose the items whose noisy counts fall above a certain threshold. In these algorithms, the contribution of each user is always independent from the identity of items held by other users, resulting in a wasteful aggregation process, where some items’ counts could be far above the threshold. Since the goal is to release as large a set as possible rather than to release accurate counts of each item, there could be more efficient ways to allocate the weight to users’ items.

Figure 1: Size of the set output by our proposed algorithms Policy Laplace and Policy Gaussian compared to natural generalizations of previously known algorithms for various values of privacy parameter and .

We deviate from the previous methods by allowing users to contribute their items in a dependent fashion, guided by an update policy. In our algorithms, proving privacy is more delicate as some update policies can result in histograms with unbounded sensitivity. We prove a meta-theorem to show that update policies with certain contractive properties would result in differentially private algorithms. The main contributions of the paper are:

  • Guided by our meta-theorems, we introduce two new algorithms called Policy Laplace and Policy Gaussian for the DPSU problem. Both of them run in linear time and only require a single pass over the users’ data.

  • Using a Reddit dataset, we demonstrate that our algorithms significantly improve the size of DP set union even when compared to natural generalizations of the existing mechanisms for this problem (see Figure 1).

Our algorithms are being productized in industry to make a basic subroutine in an NLP application differentially private.

1.1 Baseline algorithms

To understand the DPSU problem better, let us start with the simplest case we can solve by known techniques. Define . Suppose . This special case can be solved using the algorithms in Korolova et al. (2009); Wilson et al. (2020). Their algorithm works as follows: Construct a histogram on (the set of items in a database ) where the count of each item is the number of sets it belongs to. Then add Laplace noise or Gaussian noise to the counts of each item. Finally, release only those items whose noisy histogram counts are above a certain threshold . It is not hard to prove that if the threshold is set sufficiently high, then the algorithm is -DP.

A straight-forward extension of the histogram algorithm for is to upper bound the -sensitivity by (and -sensitivity by ), and then add some appropriate amount of Laplace noise (or Gaussian noise) based on sensitivity. The threshold has to be set based on The Laplace noise based algorithm was also the approach considered in Korolova et al. (2009); Wilson et al. (2020). This approach has the following drawback. Suppose a significant fraction of users have sets of size smaller than . Then constructing a histogram based on counts of the items results in wastage of sensitivity budget. A user with can increment the count of items in

by any vector

as long as one can ensure that sensitivity is bounded by (or sensitivity is bounded by if adding Gaussian noise). Consider the following natural generalization of Laplace and Gaussian mechanisms to create a weighted histogram of elements. A weighted histogram over a domain is any map . For an item is called the weight of In the rest of the paper, the term histogram should be interpreted as weighted histogram. Each user updates the weight of each item using the rule: for . It is not hard to see that -sensitivity of this weighted histogram is still . Adding Laplace noise (for ) or Gaussian noise (for ) to each item of the weighted histogram, and releasing only those items above an appropriately calibrated threshold will lead to differentially private output. We call these algorithms as Weighted Laplace and Weighted Gaussian, they will be used as benchmarks to compare against our new algorithms.

1.2 Our techniques

The Weighted Laplace and Weighted Gaussian mechanisms described above can be thought of trying to solve the following variant of a Knapsack problem. Here each item is a bin and we gain a profit of 1 if the total weight of the item in the weighted histogram constructed is more than the threshold. Each user can increment the weight of elements using an update policy which is defined as follows.

Definition 1.2 (Update policy).

An update policy is a map such that , i.e., can only update the weights of items in . And the user updates to Since is typically understood from context, we will write instead of for simplicity.

In this framework, the main technical challenge is the following:

How to design update policies such that the sensitivity of the resulting weighted histogram is small while maximizing the number of bins that are full?

Note that bounding sensitivity requires that for some constant i.e. each user has an -budget of and can increase the weights of items in their set by an -distance of at most . By scaling, WLOG we can assume that Note that having a larger value of should help in filling more bins as users have more choice in how they can use their budget to increment the weight of items.

In this paper, we consider algorithms which iteratively construct the weighted histogram. That is, in our algorithms, we consider users in a random order, and each user updates the weighted histogram using the update policy Algorithm 1 is a meta-algorithm for DP set union, and all our subsequent algorithms follow this framework.

  Input: : Database of users where each user has some subset : thresholdNoise: Noise distribution ( or )
  Output: S: A subset of Build weighted histogram supported over using Algorithm 2. (empty set)
  for  do
     
     if  then
        
     end if
  end for
  Output
Algorithm 1 High level meta algorithm for DP Set Union
  Input: : Database of users where each user has some subset : maximum contribution parameterhash: A random hash function which maps user ids into some large domain without collisions: Update policy for a user to update the weights of items in their set
  Output: H: A weighted histogram in (empty histogram)
  Sort users into by sorting the hash values of their user ids under the hash function hash
  for  to  do
      set with
     if  then
         Randomly choose items from
     else
        
     end if
     Update for each using update policy
  end for
  Output
Algorithm 2 High level meta algorithm for building weighted histogram using a given update policy

If the update policy is such that it increments the weights of items independent of other users (as done in Weighted Laplace and Weighted Gaussian), then it is not hard to see that sensitivity of can be bounded by ; that is, by the budget of each user. However, if some item is already way above the threshold then it does not make much sense to waste the limited budget on that item. Ideally, users can choose a clever update policy to distribute their budget among the items based on the current weights.

Note that if a policy is such that updates of a user depends on other users, it can be quite tricky to bound the sensitivity of the resulting weighted histogram. To illustrate this, consider for example the greedy update policy. Each user can use his budget of 1 to fill the bins that is closest to the threshold among the bins . If an item already reached the threshold, the user can spend his remaining budget incrementing the weight of next bin that is closest to the threshold and so on. Note that from our Knapsack problem analogy this seems be a good way to maximize the number of bins filled. However such a greedy policy can have very large sensitivity (see appendix for an example), and hence won’t lead to any reasonable DP algorithm. So, the main contribution of the paper is in showing policies which help maximize the number of item bins that are filled while keeping the sensitivity low. In particular, we define a general class of -contractive update policies and show that they produce weighted histograms with bounded -sensitivity.

Definition 1.3 (-contractive update policy).

We say that an update policy is -contractive if there exists a subset (called the invariant subset for ) of pairs of weighted histograms which are at an distance of at most 1, i.e.,

such that the following conditions hold.

  1. (Invariance) .333Note that property (1) is a slightly weaker requirement than the usual notion of -contractivity which requires for all Instead we require contraction only for

  2. for all .

Property (2) of Definition 1.3 requires that the update policy can change the histogram by an distance of at most 1 (budget of a user).

Theorem 1.1 (Contractivity implies bounded sensitivity).

Suppose is an update policy which is -contractive over some invariant subset . Then the histogram output by Algorithm 2 has -sensitivity bounded by 1.

We prove Theorem 1.1 in Section 3. Once we have bounded -sensitivity, we can get a DP Set Union algorithm with some additional technical work.

Theorem 1.2.

(Informal: Bounded sensitivity implies DP) For , if the -sensitivity of the weighted histogram output by Algorithm 2 is bounded, then Algorithm 1 for DP Set Union can be made -differentially private by appropriately choosing the noise distribution (Noise) and threshold ().

The main contribution of the paper is two new algorithms guided by Theorem 1.1. The first algorithm, which we call Policy Laplace, uses update policy that is -contractive. The second algorithm, which we call Policy Gaussian, uses update policy that is -contractive. Finally we show that our algorithms significantly outperform the weighted update policies.

At a very high-level, the role of contractivity in our algorithms is indeed similar to its role in the recent elegant work of Feldman et al Feldman et al. (2018). They show that if an iterative algorithm is contractive in each step, then adding Gaussian noise in each iteration will lead to strong privacy amplification. In particular, users who make updates early on will enjoy much better privacy guarantees. However their framework is not applicable in our setting, because their algorithm requires adding noise to the count of every item in every iteration; this will lead to unbounded growth of counts and items which belong to only a single user can also get output which violates privacy.

2 Preliminaries

Let denote the collection of all databases. We say that are neighboring databases, denoted by , if they differ in exactly one user.

Definition 2.1.

For the -sensitivity of is defined as where the supremum is over all neighboring databases .

Proposition 2.1 (The Laplace Mechanism Dwork and Roth (2014)).

Given any function , the Laplace Mechanism is defined as:

(1)

where is the -sensitivity and

are i.i.d. random variables drawn from

.

Proposition 2.2 (Gaussian Mechanism Balle and Wang (2018)).

If is a function with -sensitivity . For any and , the Gaussian output perturbation mechanism with is -DP if and only if

Definition 2.2.

We say that two distributions on a domain are -close to each other, denoted by , if for every , we have

  1. and

We say that two random variables are -close to each other, denoted by , if their distributions are -close to each other.

We will need the following lemmas which are useful to prove -DP.

Lemma 2.1.

Let

be probability distributions over a domain

. If there exists an event s.t. and , then .

Proof.

Fix some subset .

We now prove the other direction.

Now if , then we have Otherwise, trivially

We will also need the fact that if , then after post-processing they also remain -close.

Lemma 2.2.

If two random variables are -close and is any randomized algorithm, then .

Proof.

Let for some function where is the random bits used by . For any subset of the possible outputs of ,

The other direction holds by symmetry. ∎

3 Contractivity implies DP algorithms

In this section, we show that if an update policy satisfies contractive property as in Definition 1.3, we can use it to develop a DPSU algorithm. First we show that contractivity of update policy implies bounded sensitivity (Theorem 1.1), which in turn implies a DPSU algorithm by Theorem 1.2. We will first define sensitivity and update policy formally. Let denote the collection of all databases. We say that are neighboring databases, denoted by , if they differ in exactly one user.

Definition 3.1.

For the -sensitivity of is defined as where the supremum is over all neighboring databases .

Proof of Theorem 1.1.

Let be an -contractive update policy with invariant subset Consider two neighboring databases and where has one extra user compared to . Let and denote the histograms built by Algorithm 1 using the update policy when the databases are and respectively.

Say the extra user in has position in the global ordering given by the hash function. Let and be the histograms after the first (according to the global order given by the hash function hash) users’ data is added to the histogram. Therefore And the new user updates to . By property (2) in Definition 1.3 of -contractive policy, . Since , we have The remaining users are now added to in the same order. Note that we are using the fact that the users are sorted according some hash function and they contribute in that order (this is also needed to claim that ). Therefore, by property (1) in Definition 1.3 of -contractive policy, we get . Since only contains pairs with -distance at most 1, we have . Therefore the histogram built by Algorithm 2 using has -sensitivity of at most 1. ∎

Above theorem implies that once we have a contractive update policy, we can appeal to Theorem 1.2 to design an algorithm for DPSU.

4 Policy Laplace algorithm

In this section we will present an -contractive update policy called -descent (Algorithm 3) and use it to obtain a DP Set Union algorithm called Policy Laplace (Algorithm 4).

4.1 -descent update policy

The policy is described in Algorithm 3. We will set some cutoff above the threshold to use in the update policy. Once the weight of an item () crosses the cutoff, we do not want to increase it further. In this policy, each user starts with a budget of 1. The user uniformly increases for each s.t. . Once some item’s weight reaches the user stops increasing that item and keeps increasing the rest of the items until the budget of 1 is expended. To implement this efficiently, the items from each user are sorted based on distance to the cutoff. Beginning with the item whose weight is closest to the cutoff (but still below the cutoff), say item , we will add (gap to cutoff for item ) to each of the items below the cutoff. This repeats until the user’s budget of 1 has been expended.

This policy can also be interpreted as gradient descent to minimize the -distance between the current weighted histogram and the point , hence the name -descent. Since the gradient vector is 1 in coordinates where the weight is below cutoff and in coordinates where the weight is the -descent policy is moving in the direction of the gradient until it has moved a total -distance of at most 1.

  Input: : Current histogram : A subset of of size at most : cutoff parameter
  Output: : Updated histogram
  // Build cost dictionary
   // Empty dictionary
  for  do
     if  then
        // Gap to cutoff for items below cutoff
        
     end if
  end for
  budget // Each user gets a total budget of 1
   // Number of items still under cutoff
  // Sort in increasing order of the gap
  
  // Let be the sorted order
  for  to  do
     // Cost of increasing weights of remaining items by
     cost =
     if cost budget then
        for  to  do
           
           // Gap to cutoff is reduced by
           
        end for
        budget budget - cost
        // has reached cutoff, so decrease by 1
        
     else
        for  to  do
           // Update item weights by as much as remaining budget allows
           
           break
        end for
     end if
  end for
Algorithm 3 -descent update policy

4.2 Policy Laplace

The Policy Laplace algorithm (Algorithm 4) for DPSU uses the framework of the meta algorithm in Algorithm 1 with the update policy in Algorithm 3. Since the added noise is , which is centered at 0, we want to set the cutoff in the update policy to be sufficiently above the threshold . Thus we pick for some . From our experiments, choosing works best empirically. The parameters are set so as to achieve -DP as shown in Theorem 4.1.

  Input: : Database of users where each user has some subset : maximum contribution parameter: privacy parameters: parameter for setting cutoff
  Output: S: A subset of
   // Noise parameter in
  // Threshold parameter
  
   // Cutoff parameter for update policy
  Run Algorithm 1 with and the -descent update policy in Algorithm 3 to output .
Algorithm 4 Policy Laplace algorithm for DPSU

4.3 Privacy analysis of Policy Laplace

In this section, we will prove that the Policy Laplace algorithm (Algorithm 4) is -DP. By Theorem 1.2 and Theorem 1.1, it is enough to show that -descent policy (Algorithm 3) is -contractive. For two histograms , we write if for each every item . is defined similarly.

Lemma 4.1.

Let . Then -descent update policy is -contractive over the invariant subset

Proof.

Let denote the -descent update policy.

We will first show property (2) of Definition 1.3. Let be any weighted histogram and let . Clearly as the new user will never decrease the weight of any item. Moreover, the total change to the histogram is at most in -distance. Therefore Therefore

We will now prove property (1) of Definition 1.3. Let , i.e., and . Let A new user can increase and by at most 1 in distance. Let be the cutoff parameter in Algorithm 3. Let be the set of items with the new user, therefore only the items in will change in . WLOG, we can assume that the user changes both and by exactly total distance of 1. Otherwise, in at least one of them all the items in should reach the cutoff . If this happens with then clearly for all . But it is easy to see that if this happens with , then it should also happen with in which case for

Imagine that at time , the user starts pushing mass continuously at a rate of 1 to both until the entire mass of is sent, which happens at time . The mass flow is equally split among all the items which haven’t yet crossed cutoff. Let and be the histograms at time . Therefore and . We claim that implies that for all s.t. . This is because the flow is split equally among items which didn’t cross the cutoff, and there can only be more items in which didn’t cross the the cutoff when compared to . And at time , we have . Therefore, we have for all and so

We will now prove -contraction. Let . By the discussion above, (either total mass flow is equal to 1 for both or all items in will reach cutoff in before this happens in ).

(Since )
(Since )
(Since )

Therefore which proves property (2) of Definition 1.3. ∎

Lemma 4.2.

Suppose and are neighboring databases where has one extra user compared to . Let and denote the histograms built by the Policy Laplace algorithm (Algorithm 4) when the database is and respectively. Then

Proof.

Say the extra user in has position in the global ordering given by the hash function. Let and be the histograms after the first (according to the global order) users data is added to the histogram. Therefore And the new user updates to . Since the total change by an user is at most , The remaining users are now added to in the same order. Note that we are using the fact that the users are sorted according some hash function and they contribute in that order (this is also needed to claim that ). Therefore, by the -contraction property shown in Lemma 4.1,

We now state a formal theorem which proves of Policy Laplace algorithm.

Theorem 4.1.

The Policy Laplace algorithm (Algorithm 4) is - when

Proof.

Suppose and are neighboring databases where has one extra user compared to . Let and denote the distribution of output of the algorithm when the database is and respectively. We want to show that . Let be the event that

Claim 4.1.

Proof.

Let and be the histograms generated by the algorithm from databases and respectively. And and be the histograms obtained by adding noise to each entry of and respectively. For any possible output of Algorithm 4, we have

So is obtained by post-processing and is obtained by post-processing . Since post-processing only makes two distributions closer (Lemma 2.2), it is enough to show that the distributions of the and are -close to each other. By Lemma 4.2, and differ in -distance by at most 1. Therefore by the properties of Laplace mechanism (see Theorem 3.6 in Dwork and Roth (2014)). ∎

By Lemma 2.1, it is enough to show that . Let Note that and for

(2)

Thus for

we have . Therefore the Policy Laplace algorithm (Algorithm 4) is -DP. ∎

5 Policy Gaussian algorithm

In this section we will present an -contractive update policy called -descent (Algorithm 5) and use it to obtain a DP Set Union algorithm called Policy Gaussian (Algorithm 4).

5.1 -descent update policy

Similar to the Laplace update policy, we will set some cutoff above the threshold and once an item’s count () crosses the cutoff, we don’t want to increase it further. In this policy, each user starts with a budget of 1. But now, the total change a user can make to the histogram can be at most when measured in -norm (whereas in Laplace update policy we used -norm to measure change). In other words, sum of the squares of the changes that the user makes is at most 1. Since we want to get as close to the cutoff () as possible, the user moves the counts vector (restricted to the set of items the user has) in the direction of the point by an -distance of at most 1. This update policy is presented in Algorithm 5.

This policy can also be interpreted as gradient descent to minimize the -distance between the current weighted histogram and the point , hence the name -descent. Since the gradient vector is in the direction of the line joining the current point and , the -descent policy is moving the current histogram towards by an -distance of at most 1.

  Input: : Current histogram: A subset of of size at most : cutoff parameter
  Output: : Updated histogram
   // Empty dictionary
  for  do
     // is the vector joining to
     
  end for
  // -distance between and
  
  // If , then the user moves to . Else, move in the direction of by an -distance of at most 1
  if  then
     for  do
        
     end for
  else
     for  do
        
     end for
  end if
Algorithm 5 -descent update policy

5.2 Policy Gaussian

The Policy Gaussian algorithm (Algorithm 6) for DPSU uses the framework of the meta algorithm in Algorithm 1 using the Gaussian update policy (Algorithm 5). Since the added noise is which is centered at 0, we want to set the cutoff in the update policy to be sufficiently above (but not too high above) the threshold . Thus we pick for some . From our experiments, choosing empirically yields these best results. The parameters are set so as to achieve -DP as shown in Theorem 5.1.

is the cumulative density function of standard Gaussian distribution and

is its inverse.

  Input: : Database of users where each user has some subset : maximum contribution parameter: privacy parameters: parameter for setting cutoff
  Output: S: A subset of

  // Standard deviation in Gaussian noise

  
  // Threshold parameter
  
   // Cutoff parameter for update policy
  Run Algorithm 1 with and the -descent update policy in Algorithm 5 to output .
Algorithm 6 Policy Gaussian algorithm for DPSU

To find , one can use binary search because is a decreasing function of An efficient and robust implementation of this binary search can be found in Balle and Wang (2018).

5.3 Privacy analysis of Policy Gaussian

In this section we will prove that the Policy Gaussian algorithm (Algorithm 6) is -DP. By Theorem 1.2 and Theorem 1.1, it is enough to show -contractivity of -descent update policy. We will need a simple plane geometry lemma for this.

Lemma 5.1.

Let denote the vertices of a triangle in the Euclidean plane. If let be the point on the side which is at a distance of from and if define . is defined similarly. Then

Proof of Lemma 5.1.
Figure 2: Geometric illustration of Lemma 5.1 when . The lemma implies that .

Let us first assume that both Let be the angle at and let as shown in Figure 2. Then by the cosine formula,

()
Figure 3: Geometric explanation of Lemma 5.1 when .

If , then and then the claim is trivially true. Suppose . Now Let and be the angle at as shown in Figure 3. Then by the cosine formula,

()

By symmetry, the claim is also true when . ∎

Lemma 5.2.

Let . Then the -descent update policy is -contractive over the invariant set

Proof.

Let denote the -descent policy. Property (2) of Definition 1.3 follows easily because each new user can only change a weighted histogram by an -distance of at most 1.

We will now prove Property (1) of Definition 1.3. Let , i.e., . Let and