Differential privacy (DP) (CynthiaFKA06) provides a rigorous formulation of privacy and has been applied to many algorithmic and learning tasks that involves the access of private and sensitive information. Notable applications include private data release (hardt2010simple), learning histograms (CynthiaFKA06), statistical estimation (DiakonikolasHS15; KamathLSU18; acharya2020differentially; KamathSU20; AcharyaCT19; AcharyaSZ18a)
, and machine learning(chaudhuri2011differentially; bassily2014private; mcmahan2017learning; DworkTTZ14; abadi2016deep).
In real applications, each user may contribute many data samples to a dataset. For example, one may have multiple health records in a hospital, or may type many words on their phone’s virtual keyboard. Naturally, users would hope that all of their private data is protected. To achieve this, private algorithms should guarantee user-level differential privacy.
However, most existing works assume that each user only contributes one data sample. Thus, an algorithm designed under this assumption can only be used to protect the privacy of each data sample but not the user. In other words, such algorithms achieve item-level privacy, but they cannot protect privacy at the user level and may not meet the increasing privacy concerns in most applications where users may contribute a lot of data. Therefore, there has been a growing interest in revisiting differential privacy in the user-level setting.
In this work, we study the problem of estimating histograms and discrete distributions under user-level privacy. Histogram estimation is a fundamental problem that arises in many real-world applications such as demographic data and user preferences. For example, chen2019federated
used histogram estimation to compute unigram language models viafederated learning (mcmahan17fedavg; kairouz2019advances). Beyond machine learning, federated analytics uses histogram estimation to support the Now Playing feature on Google’s Pixel phones, a tool that shows users what song is playing in the room around them ramage2020fablog. There is abundant literature on histogram estimation in the context of item-level privacy, such as (hay2009boosting; Suresh2019differentially; xu2013differentially). Little has been known, however, under user-level privacy.
Some other works (durfee2019practical; pmlr-v124-silva-carvalho20a; joseph2021joint) study the top- selection problem under user-level differential privacy. However, these works differ from ours in several ways. First note that our goal is to estimate the entire histogram, whereas these works focus on estimating only the most frequent elements and do not focus on estimating the entire histogram. Further, they assume a bounded and known or norm on each user’s histogram. It is unclear how these algorithms can be applied when the bounded and known norm assumption is not satisfied.
User-level privacy is much more stringent than the item-level counterpart. Hence, many challenges arise even in the simple problem of estimating histograms. One challenge arises from the heterogeneity of histogram sizes. Since the number of samples each user contributes may potentially be unbounded, the sensitivity of the algorithm may be prohibitively large, which requires adding a large amount of noise. If we restrict user contribution to achieve low sensitivity, we may lose a large amount of useful data and suffer from bias. This dilemma is known as the bias-variance trade-off which has been discussed inamin2019bounding in the context of empirical risk minimization.
In addition, users’ data may come from diverse distributions, especially in scenarios such as federated learning and analytics (kairouz2019advances). Therefore, we cannot leverage techniques from statistical estimation problems which heavily rely on the i.i.d. assumption (Liu2020learning). Algorithms designed specifically for the i.i.d. case may perform poorly on datasets that do not meet the i.i.d. assumption. If each user has just one item, the non-i.i.d. problem can be circumvented by equivalently assuming that each user’s data comes from some common mixture distribution. With multiple samples per user, it can be difficult to find a simple reduction to the i.i.d. case. Motivated by the need for algorithms that work well without any data assumptions, we ask the following question:
Can we design algorithms for histogram estimation with theoretical guarantees without making any assumptions on user data?
Somewhat surprisingly, despite many recent works in the area, the above question has not been answered. We take a step towards answering the above question by designing algorithms that perform well in the heterogeneous setting where both the number of samples and data distribution can be unknown and different across users (hence both need to be viewed as private information). We first design an algorithm without any distribution assumptions on user’s data. We then investigate whether we can improve the performance of our algorithm by drawing conclusions from settings where assumptions on the user data distributions are made. Our contributions are as follows.
Estimation with optimal clipping. We design an algorithm that estimates the aggregate histogram using an adaptive clipping strategy. Our algorithm provides a 2-approximation of the best clipping threshold in hindsight, taking into account the private estimation of the chosen clipping threshold.
Bias reduction. For estimating counts, we design a debiasing algorithm that significantly reduces the clipping bias. Assuming the user counts come from independent Poisson distributions with different means, we provide a complete proof of the gap between the clipped-and-debiased and the clipped-only algorithms. Although designed with a distribution assumption, the debiasing method performs well on both real and synthetic datasets, even when the underlying data generating mechanism is not a Poisson distribution.
The paper is organized as follows. In Section 2, we discuss related work. In Section 3, we introduce the definition and problem formulation. In Section 4 we introduce our algorithms for the most general case with no distribution assumptions. In Section 5 we introduce our debiasing algorithm and its guarantees. In Section 6 we show the experiment results. In Section 7, we conclude with a discussion about how to extend our methods to the unknown-domain and federated settings.
2 Related work
Given its importance, user-level privacy has been studied by several works in the last decade. One of the primary motivations for user-level privacy is federated learning, where the goal is to learn a model at the server while keeping the raw data on edge devices such as cell phones (mcmahan17fedavg; kairouz2019advances). Ensuring privacy at the user level is a crucial concern in federated learning. Even though users do not send their original data, various works (phong2017privacy; wang2019beyond) have shown it is still possible to reconstruct user’s data if additional privacy-preserving mechanisms are not used. Therefore, user-level privacy has been studied under various machine learning tasks in the federated learning setup (mcmahan2017learning; mcmahan2018general; augenstein2019generative). Indeed, understanding the fundamental privacy-utility trade-offs under user-level privacy is one of the main challenges in federated learning (kairouz2019advances, Section 4.3.2).
Several works studied fundamental theoretical problems in user-level private learning. ghazi2021user studied PAC learnability under user-level privacy. levy2021learning studied high-dimensional distribution estimation and optimization and designed efficient algorithms. Both works require i.i.d. data and assume a fixed number of samples across users. There are several recent works closely related to user-level private histogram estimation. amin2019bounding studied the inherent bias and variance trade-off in bounding user contributions under user-level privacy for empirical risk minimization. Their analysis applies to estimating the total count of one symbol in the aggregate histogram. We extend their work to the setting of symbols.
esfandiari2021tight studied robust high dimensional mean estimation under user-level privacy, assuming i.i.d. and fixed number of samples across users. While their algorithm is robust to at most 49% of the samples being arbitrarily adversarial, their result still requires that the remaining samples are independent and identically distributed. cummings2021mean
studied mean estimation of Bernoulli random variables, which can be viewed as a special case of histogram estimation when the domain size is 2. They allow different distributions and number of samples for different users. However, the number of samples of each user is known and fixed in advance.wilson2019differentially proposed differentially private SQL with bounded user contributions.
Several works (durfee2019practical; pmlr-v124-silva-carvalho20a) studied top- selection or frequency estimation in a large alphabet. However, they require bounded or norm on each user’s histogram, and do not explore how that can be (optimally or near optimally) achieved when the requirement is not satisfied. Hence, their algorithms cannot be applied to the general case when the bound is not known beforehand.
3 Problem formulation
3.1 Definitions and notations
Differential privacy (DP) is studied in the central and local settings (CynthiaFKA06; kasiviswanathan2011can; duchi2013local). In this paper, we study the problem under the lens of central differential privacy, where the goal is ensure the algorithm’s outcomes do not reveal too much information about any user’s data. We now define differential privacy, starting with the basic definition of neighboring datasets. Here, we assume the number of users is known and fixed and hence we use the replacement notion of neighboring datasets.
Let respresent a dataset of users. Each consists of samples . Let be another dataset. We say and are neighboring (or adjacent) datasets if for some ,
Definition 3.2 (Differential privacy).
A randomized mechanism with range satisfies -differential privacy if for any two adjacent datasets and for any subset of output , it holds that
If , then the privacy is also referred to as pure DP, and for simplicity we say that the algorithm satisfies -DP. If , we refer to it as approximate DP.
Let denote the
norm. For a vectorand , define the clipping function,
3.2 Histogram estimation
We consider the following problem. There are users and user has a histogram over a discrete domain of size . Without loss of generality, we can assume the domain to be . Let be the size of histogram . The dataset is the collection of users’ histograms. The goal is to estimate some query subject to user-level differential privacy. Neighboring datasets differ by a single user (i.e., one histogram).
Estimation with optimal clipping.
We first consider problem of estimating the population-level histogram, i.e., the sum of the histograms
We make no assumptions about the distribution and size of each user’s histogram. Given an differentially private algorithm whose output histogram is , we characterize its performance by the expected distance between the algorithm output and the true population-level histogram
We prove that the bias from clipping can be significantly reduced when ’s satisfy some mild distribution assumptions and show that the debiasing method provides improvements even on real datasets where these assumptions may not necessarily hold. Consider the special case of , which we refer to as count estimation. Let be a family of distributions over . For each user , is drawn independently from some distribution in with mean . ’s can be arbitrary and do not need to be equal.
In addition to the absolute error of counts , we also want to characterize the accuracy for estimating the mean . Let be an estimate of . We are interested in the expected square error
where the expectation is over the randomness of the algorithm and the dataset.
In this work we set
to be the family of Poisson distributions since they arise in many applications. For example, they can be used to model the occurrences of a memoryless event in a fixed time window. Also, they are good approximations of the binomial distributionwhen is a constant (le1960approximation), and can be very useful when estimating the count of one element in a histogram over a very large domain (e.g. the count of a particular word).
4 Estimation with optimal clipping
We first consider the problem of estimating the population-level histogram. A standard strategy is to clip each user contribution either in or norm and add the corresponding amount of noise. For completeness, the details are shown in Algorithm 1.
There is a bias-variance trade-off in choosing the clipping threshold . If is small, then the noise magnitude (or variance) is small, but the clipped histogram would have large error (or bias). On the other hand, if is large, the clipped histogram would be more accurate (less bias), but the added noise would be large (high variance).
The easiest choice of
is to choose a fixed value, or some fixed quantile of theor norms of all histograms (which can be estimated privately). However, amin2019bounding showed that even in the 1-d case, such strategy cannot perform well in general. We provide an accurate characterization of the best threshold that balances the bias and variance. For the Laplace estimator, the proof is similar to that of amin2019bounding and is omitted.
Let . Choosing as the top element in yields 2-approximation
where the expectation is over the Laplace mechanism.
Let . Let and . Choosing such that
Next we discuss how to find privately using an additional privacy budget of , and further provide complete guarantees in terms of excess error compared to . We emphasize that it only requires a very small extra privacy budget compared to the original to achieve good performance, as we will later show in the experiments. We focus on -DP because it is more practical, especially when is large.
Note that computing the optimal in a differentially private way is possible though difficult because the sensitivity of can be very large for some datasets. However, observe that by Cauchy-Schwarz inequality,
Hence, if each user’s histogram has very few non-zero entries, then the sensitivity would be low. We observe this to be the case in practice. To illustrate, we plot the distinct number symbols each user has for the Sentiment140 (sentiment140) and SNAP datasets in Figure 1. Observe that most users have fewer than samples, even though is large. Hence, we assume that each user’s histogram is at most sparse. Under this assumption, the sensitivity is upper bounded by .
We first note that can be written as a minimizer to a convex function,
where . Hence we can use techniques from differentially private convex optimization algorithms. We consider two such algorithms and provide their corresponding guarantees.
Choosing with DP-SGD
We first consider the DP-SGD algorithm (bassily2014private, Algorithm 1) to estimate by minimizing . Using bassily2014private, we have the following guarantee.
Let be an upper bound on and let be the output of bassily2014private. Assume that . Then, is upper bounded by
Choosing with output perturbation
We consider the second algorithm based on output perturbation (chaudhuri2011differentially), which ensures -DP and is good for small and . Here, we solve a regularized convex optimization problem and perturb the output to provide differential privacy. The algorithm is outlined in Algorithm 2.
With appropriate parameters, the combined algorithm almost achieves a 2-approximation with respect to the best clipping threshold.
Algorithm 2 is differentially private. If is an upper bound on , setting yields an error
Comparing DP-SGD and Algorithm 2, we can see that DP-SGD has a better asymptotic dependence on , and Algorithm 2 has a better dependence on . Furthermore, DP-SGD provides an approximate DP guarantee and Algorithm 2 gives a pure DP guarantee. Finally, the time complexity of DP-SGD is typically , however it has been improved recently to with similar guarantees.
5 Bias reduction
In this section, we ask if the estimate obtained by Algorithm 1 can be improved by making assumptions on the histogram generating distribution. We prove that the bias from clipping can be significantly reduced when the user counts come from non-i.i.d. Poisson distributions and in Section 6, we show that the debiasing method provides improvements even when the counts do not come from Poisson distributions.
Our debiasing step is a post-processing procedure on the output of Algorithm 1 to reduce the clipping bias. Since this is a postprocessing step, it does not affect the differential privacy guarantees. In the rest of the section, we assume . Towards the end of the section, we discuss two ways of extending it to high dimensions and overview the pros and cons of both methods.
A general algorithm for estimating the count of a single item is given in Algorithm 3, which involves a preprocessing step on each user’s count denoted by and a postprocessing step on the differentially private output, denoted by . If we choose and , we recover the clipping algorithm.
We ask the following question, is it possible to find a post-processing function that gives better performance than the clipping algorithm even if the data generating distributions are distinct? It is not hard to see that if all users come from the same distribution i.e., for all , then such an algorithm would help. Our main contribution is to show that one can design an such that even if the users’ data come from different distributions, it can be debiased and the performance can be improved.
We first note that in many datasets, counts of most symbols appear very few times. For example, in the Sentiment140 dataset, which contains counts for a total of roughly words distributed across users, the average counts of all words among the users are no more than two. Therefore we analyze the debiasing step when the ’s are small and prove the following result for our desbiasing algorithm given in Algorithm 4.
Suppose . Let , and , where is the output of Algorithm 4. If , then
where . If we further assume that , then
The error consists of three terms. The first term is the error due to added noise, which is proportional to the clipping threshold . The second term is essentially the variance of the random variable , which is an inherent error due to the randomness in the counts ’s. The third term is a bias term which depends on the closeness of user distributions, characterized by , the variance of ’s. If the users’ distributions are similar, then we can expect the estimation error to be small. Note that the bias term has a rate. The detailed proof is provided in Appendix A.3 We provide a bound with a better dependence on in Appendix A.4.
Next we analyze the the optimal choice of threshold after the debiasing step. Observe that to minimize the error upper bound in (1), roughly we want to choose
Therefore, when user’s distribution are similar, we can use a smaller clipping threshold. This implies that as long as
we can find a that ensures a squared error of , which matches the error for the i.i.d. case.
From (2), when for any user , we can choose
This choice of always guarantees error since when for all , .
In practice, we can also privately choose as the top count as suggested by Lemma 4.1.
So far we have characterized the effect of debiasing under heterogeneous data. We next show that under mild assumptions debiasing helps even if data is non i.i.d.. The formal result is stated in Theorem 5.2. The proof is in Appendix A.5.
This implies that for any fixed , under the assumptions stated in the theorem, with sufficiently large, there is always a constant gap between the two algorithms and debiasing helps even if the data is not i.i.d.. This result justifies the choice of as the optimal quantile suggested by Lemma 4.1.
We argue that the assumption of is not too restrictive. It essentially requires that either ’s are sufficiently similar, or is sufficiently larger than . Indeed, if for all user , then ; if is sufficiently large, then is almost linear near and hence is close to .
As a specific example, set . If all , due to concavity of , we have . With some arithmetic, , and the first term in (3) is at least 0.0217. Note that this is the difference between squared errors. The gap between absolute errors could well be of order 0.1, which is significant considering that . This example shows that Algorithm 4 can achieve significant improvement even when the variance of s is constant.
Extension to . We now discuss two possible extensions to the general .
1. A natural extension to the entire histogram is to apply Algorithm 4 to each symbol in the histogram separately. To ensure differential privacy, we assign each symbol a privacy budget of by strong composition (kairouz2017composition). The main disadvantage is that when is large, clipping each coordinate separately may perform poorly compared to clipping the or norm of the entire histogram.
2. We can generalize Algorithm 4 to by replacing 1-d clipping with the high dimensional clipping functions as defined in Algorithm 1. Then choose a suitable function that essentially inverts the expectation of the clipped vector . However finding such inverse may be difficult in high dimensions as it likely involves non-convex optimization.
We run experiments with two datasets: Sentiment140 (sentiment140), a twitter dataset that contains user tweets, and SNAP (cho2011friendship), a social network dataset that contains the location information of check-ins by users. For Sentiment140, we parse each user’s tweets to words, and treat each word as an element. For SNAP, each element is a location, and each user has check-ins to multiple locations in the dataset.
In all experiments, the privacy budget for estimating is , and the budget for Algorithm 1 is . For DP-SGD with sparsity assumptions, we set 111The choice is arbitrary (i.e. not a function of the underlying datasets) and has not been tuned. and clip each to when estimating . This introduces bias when the assumption is not satisfied for some users. However if the percentage of such users is small, this effect can be negligible.
6.1 Estimation with optimal clipping
To conduct experiments on real life datasets, we calculate the top elements in the datasets, and only run experiments on these elements. We measure the relative loss of , defined as:
We evaluate different algorithms for estimating the clipping threshold for the Gaussian mechanism given in Algorithm 1. We compare the performance of the following methods: : The non-private clipping threshold given in Theorem 4.2. DP--quantile: inspired by the 2-approximation quantile in (amin2019bounding), we set to be the largest value of , where is given in Theorem 4.2. We estimate it by gradient descent with differential privacy, e.g. andrew2021differentially. : estimation of with DPSGD algorithm (Corollary 4.3). and : estimation of with output pertubation (Algorithm 2).
In Figure 2, we show the comparison of these threshold estimation algorithms with different choices of in . The results with both datasets are similar, but SNAP has much higher errors, possibly because of the location information in SNAP is more non-i.i.d compared to the words in Sentiment140. Setting to DP--quantile inspired by the 2-approximation quantile in (amin2019bounding), without any theoretical support, and the errors are relatively high. For Algorithm 2, we run experiments with , and a fixed (see the Supplementary materials). Of all the algorithms, has similar performance to the true without differential privacy.
6.2 Bias reduction
In this section, we run experiments for Algorithm 3 with both synthetic datasets and words in Sentiment140. For the synthetic part, we generate users with from a Dirichlet distribution with parameter . Larger means that the s are closer. In this experiment, we set to the privately estimated top count among users, as discussed in amin2019bounding. Figure 4 shows that the debiased output of Algorithm 4 greatly reduces the error compared to the original output of Algorithm 1, especially for large (meaning the dataset is more i.i.d. like).
|word||avg lossstd||avg lossstd|
7 Conclusion and discussion
We study the problem of estimating a population-level histogram under user-level privacy with heterogeneous data. We extend previous works by relaxing i.i.d. assumptions on the user data distribution and allowing for a different number of samples per user. We propose algorithms based on adaptive clipping without any distribution assumptions. We further show that the bias of clipping can be reduced under the special case of count estimation. Future directions include: (a) extending the bias reduction algorithm to high dimensional histograms and proving theoretical lower bounds, (b) designing similar clipping and debiasing strategies for histogram estimation over unknown domains, and (c) applying these methods to the federated setting. With respect to (c), we note that Algorithm 2 can be federated by having users only send the and norms of their local histograms to the server across rounds. This allows the server to run the output perturbation algorithm and (privately) learn the near-optimal clipping threshold. Combining this approach with secure aggregation (bonawitz2017practical) and a distributed DP (kairouz2021distributed; agarwal2021skellam) guarantee is left for future work.
Appendix A Detailed proofs of lemmas and theorems
a.1 Proof of Theorem 4.2
Recall that . We can upper bound the error as follows.
Equation 4 is convex with respect to . To optimize the upper bound on the error, we will take the sub-derivative with respect to and set it to zero. This gives us the following equation
Roughly we want to choose that satisfies the above equality. The precise value of is
minimizes the right hand side of (4), and it also makes the expected loss at most twice the loss of the optimal loss with this algorithm. Formally, suppose is the -norm that minimizes . Let and be the the coordinate of , then:
This shows that yields a 2-approximation. ∎
a.2 Proof of Corollary 4.4
To estimate privately, one can use the output perturbation algorithm. For ease of analysis we consider the regularized problem. More precisely, let
Note that is -Lipschitz where . The goal is to minimize the following function
Let . We first compute the sensitivity of as a function of the dataset.
We first compute the sensitivity of . Consider a pair of neighboring datasets and which only differ by the th user.
Let be a histogram and defined similarly as with replaced by . Let and . Let and . Then,
Observe that is Lipschitz. Let .
Since is -strongly convex, we have
Combining the two parts,
Let be an upper bound on . Then
a.3 Proof of Theorem 5.1
Let . Then, when ,
Let , then we have . We first bound the error for estimating ,