Releasing seemingly innocuous functions of a data set can easily compromise the privacy of individuals, whether the functions are simple counts 52, 29]. To protect against such leaks, Dwork et al. proposed the notion of differential privacy (DP). Given some data from participants in a study, we say that a statistic of the data is differentially private if an attacker who already knows the data of participants cannot reliably determine from the statistic whether the
-th remaining participant is Alice or Bob. With the recent explosion of publicly available data, progress in machine learning and data science, and widespread public release of machine learning models and other statistical inferences, differential privacy has become an important standard and is widely adopted by both industry and government[32, 7, 20, 55].
The standard setting of DP described in  assumes that each participant contributes a single data point to the dataset, and preserves privacy by “noising” the output in a way that is commensurate with the maximum contribution of a single example. This is not the situation faced in many applications of machine learning models, where users often contribute multiple samples to the model—for example, when language and image recognition models are trained on the users’ own data, or in federated learning settings . As a result, current techniques either provide privacy guarantees that degrade with a user’s increased participation or naively add a substantial amount of noise, relying on the group property of differential privacy, which significantly harms the performance of the deployed model.
To remedy this issue, we consider user-level differential privacy. In order to make this setting precise, let us consider differential privacy in the most general way, which only requires specifying a dataset space and a distance on .
Definition 1 (Differential Privacy).
Let . Let be a (potentially randomized) mechanism. We say that is -DP with respect to if for any measurable subset and all satisfying ,
If , we refer to this guarantee as pure differential privacy.
For a data space , choosing and recovers the canonical setting considered in most of the literature—we refer to this as item-level differential privacy. When we wish to guarantee privacy for users rather than individual samples, we instead assume a structured dataset into which each of users contributes samples. This corresponds to such that for , we have
which means that, in this setting, two datasets are neighboring if at most one of the user’s contributions differ. We henceforth refer to this setting as user-level differential privacy. Very recently, for the reasons outlined above, there has been increasing interest in user-level DP for applications such as estimating discrete distributions under user-level privacy constraints  and bounding user contributions in ML models [6, 25]. Differentially private SQL with bounded user contributions was proposed in . Since the stronger user-level privacy guarantee is desirable in practice, it has been also studied in the context of learning models via federated learning [49, 48, 58, 8].
In this paper, we tackle the problem of learning with (central) user-level privacy. In particular, we provide algorithms and analyses for the tasks of mean estimation, empirical risk minimization (ERM), stochastic convex optimization (SCO), and learning hypothesis classes with finite metric entropy. Our utility analyses assume that all users draw their samples i.i.d. from the same distribution, while our privacy guarantees hold for arbitrary data. We develop novel private mean estimators in high dimension with statistical and privacy error scaling with the (arbitrary) concentration radius rather than the range, and apply these to the statistical query setting [SQ; 41]. Our algorithms then rely on (privately) answering a sequence of adaptively chosen queries using users’ samples. Importantly, we show that for these tasks, the additional error due to privacy constraints decreases as , contrasting with the naive rate—independent of —obtained using the group property of item-level DP. Interestingly, increasing , the number of users, decreases the privacy cost at a faster rate.
To provide intuition into our techniques for user-level DP, consider the vignette of private mean estimation in one dimension. In this setting, each user contributes a set of i.i.d. values . If we summarize each user by their sample average , then we reduce the problem to item-level DP since each user now contributes a single datum. When the data distribution has favorable concentration properties—say, it is -sub-Gaussian—these averages are roughly concentrated in an interval of size . In other words, the user-level sensitivity effectively decreases by a factor of in comparison to a naive application of group privacy in the item-level setting. While the argument is straightforward for i.i.d. data, when answering adaptively chosen queries the same samples are almost always reused. This makes the data non-i.i.d. and breaks concentration. We circumvent these issues by considering families of queries and distributions with uniform concentration properties. We provide algorithms answering adaptively chosen queries which leverage these properties and apply them to solve a range of learning tasks.
1.1 Our Contributions and Related Work
We provide a theoretical tool to construct estimators for tasks with user-level privacy constraints and apply it to a range of learning problems. We preview our results briefly below.
Private mean estimation and uniformly concentrated queries (Section 3)
In this section, we present our main technical tool. We show that for a random variable inconcentrated in an unknown interval of radius (made precise in Definition 2), we can privately estimate its mean with statistical and private error proportional to rather than the whole range . Indeed, standard private mean estimation techniques  add Laplace or Gaussian noise proportional to the worst-case range of the random variable. However, for well-concentrated distributions, it is often the case that , resulting in needlessly large noise. When data is concentrated in -norm, several papers show that one can achieve an error scaling with rather than , either asymptotically , for Gaussian mean-estimation [40, 38], for sub-Gaussian symmetric distributions [17, 16] or for distributions with bounded
-th moment. We propose a private mean estimator with error scaling with that works in arbitrary dimension when data is concentrated in -norm and without additional assumptions. We present the estimator in Algorithm 3 and the analysis in Theorem 2. In Corollary 1
, we apply this estimator to the task of mean estimation under user-level privacy constraints for random vectors bounded in-norm, and achieve the optimal risk (up to logarithmic factors) of . Building on these results, we show in the statistical query framework that for uniformly concentrated queries (see Definition 3) it is possible to privately answer a sequence of adaptively chosen queries with privacy cost (see Corollary 3
). An important feature of our algorithm is that, with high probability, it returns the sample mean of the query with exogenous noise. This makes the analysis significantly easier when considering stochastic gradient algorithms.
Our conclusions relate to the growing literature in adaptive data analysis; however, our results have a different aim and as such are hard to compare. A sequence of papers [24, 11, 26, 27] focus on adaptivity and seek to answer queries accurately—without privacy constraints—and achieve an error for each query scaling as . While these algorithms use techniques from differential privacy and their answers are -DP with , our work focuses more on the privacy cost for general with the additional assumption of uniform concentration.
Empirical risk minimization (Section 4)
An influential line of papers studies ERM under item-level privacy constraints [18, 42, 10]. Importantly, these papers assume arbitrary data, i.e., not necessarily i.i.d. samples from a data distribution. The exact analog of ERM in the user-level setting is consequently less interesting as, for data points , in the worst case, each user contributes copies of and the problem reduces to the item-level setting. We thus consider the (related) problem of ERM when observing data points sampled i.i.d. from a distribution. Assuming some regularity (see Assumptions A1 and A2), the stochastic gradients are uniformly concentrated, and the results in Section 3 apply. Casting stochastic gradient methods as a sequence of adaptive queries, we establish convergence rates for ERM under user-level DP constraints for convex, strongly-convex, and non-convex losses. We show that the cost of privacy decreases as for the convex and non-convex cases and for the strongly convex case as user contribution increases. We detail these results in Theorem 4.
Stochastic convex optimization (Section 5)
Under item-level DP (or equivalently, user-level DP with ), a sequence of work [18, 10, 12, 13, 28] establishes the constrained minimax risk as when the loss is -Lipschitz and the parameter space has diameter less than . In this paper, with the additional assumptions that the losses are individually smooth222We note that the results only require -smooth losses. In the limit of large —keeping all other problem parameters fixed—this is a very weak assumption. More precisely, when , by considering a smoothed version of (for example, using the Moreau envelope ), our algorithm on yields optimal rates for non-smooth losses. and the stochastic gradients are -sub-Gaussian, we prove an upper bound of and a lower bound of on the population risk, where and . We present precise statements in Theorems 5 and 6. When , the privacy rates match and when , the statistical rates match (in both cases up to logarithmic factors). We leave closing the gap outside of this regime to future work.
Function classes with finite metric entropy under pure DP (Section 6)
Our previous results only hold for approximate user-level DP. Turning to pure DP, we consider function classes with bounded range and finite metric entropy. We provide an estimator combining our mean estimation techniques with the private selection mechanism of . For a finite class of size , we achieve an excess risk of . We further prove a lower bound of . For SCO with -Lipschitz gradients and -bounded domain, a covering number argument implies an excess risk of under pure user-level DP constraints. While the statistical rate is optimal, we observe a gap of order for the privacy cost compared to the lower bound, which we leave as future work.
Limit of learning with a fixed number of users (Appendix A)
Finally, we resolve a conjecture of  and prove that with a fixed number of users, even in the limit (i.e., each user has an infinite number of samples), we cannot reach zero error. In particular, we prove that for all the learning tasks we consider, the risk under user-level privacy constraints is at least regardless of . Note that this does not contradict the results above since they require to hold.
Throughout this work, denotes the dimension, the number of users, and the number of samples per user. Generically, will denote the sub-Gaussian parameter, the concentration radius and
the variance of a random vector. We denote the optimization variable withand use (or when random) to denote the data sample supported on a space . We denote by the data distribution and by
the loss function. Gradients (denoted) are always taken with respect to the optimization variable . For a convex set , denotes the euclidean projection on , i.e. . We use to refer to (possibly random) private mechanisms and as a shorthand for the dataset .
2.1 ERM and stochastic convex optimization
Throughout this work, we assume that the parameter space is closed, convex, and satisfies for all . We also assume that the loss is -Lipschitz, meaning that for all , for all , .333It is straightforward to develop analogs of the results of Sections 3 and 4 for arbitrary norms, but we restrict our attention to the norm in this work for clarity. We further consider the following assumptions.
The function is -smooth. In other words, the gradient is -Lipschitz in the variable for all .
The random vector is -sub-Gaussian for all and . Equivalently, for all , is a -sub-Gaussian random variable, i.e.,
In this work, our rates often depend on the sub-Gaussianity and Lipschitz parameters and , and thus we define the shorthands and . Intuitively, the -Lipschitzness assumption bounds the gradient in a ball around (independently of ), while sub-Gaussianity implies that, for each , likely lies in . Generically, there is no ordering between and : consider a linear loss , depending on the problem, it can hold that (e.g., for ), (e.g., is truncated in a ball around , with ) or (e.g., ).
We introduce the objectives that we consider in this work, namely empirical risk minimization (ERM) and stochastic convex optimization (SCO). For a collection of samples from users , where each , we define the empirical objectives
In the user-level setting we wish to minimize under user-level privacy constraints. Going beyond the empirical risk, we also solve SCO , i.e. minimizing a population convex objective when provided with i.i.d. samples. In the user-level setting, for a convex loss and a convex constraint set , we observe and wish to
2.2 Uniform concentration of queries
We define concentration of a single query and uniform concentration of multiple queries as follows.
A (random) sample supported on is -concentrated (and we call the “concentration radius”) if there exists such that with probability at least ,
Definition 3 (Uniform concentration of vector queries).
Let be a family of queries with bounded range. For , we say that is -uniformly-concentrated if with probability at least , we have
3 High Dimensional Mean Estimation and Uniformly Concentrated Queries
In this section, we study the cost of answering adaptively-chosen queries privately when the class of queries exhibit uniform concentration properties defined in Definition 3. In Section 3.1, we present a private mean estimator with privacy cost proportional to the concentration radius. In Section 3.2, we show that, under uniform concentration, we can also answer adaptively-chosen queries with privacy cost proportional to the concentration radius instead of the whole range. Our theorems guarantee that the estimator is -close (with exponentially small in
) to a simple unbiased estimator with small noise. This formulation makes the analysis much simpler when applying the technique to learning problems. Furthermore, we show that we painlessly translate these results into upper bounds on the estimator error, which we demonstrate by providing tight bounds on privately estimating the mean of-bounded random vectors under user-level DP constraints (see Corollary 1).
3.1 Winsorized mean estimator: answering a single query privately
We consider the simple task of mean estimation, which is exactly equivalent to answering the single statistical query . We first focus on the scalar case () and present our estimator in Algorithm 2. The algorithm uses a two-stage procedure, similar in spirit to those of , , and . In the first stage of this procedure, we use the approximate median estimation in , detailed in Algorithm 1, to privately estimate a crude interval in which the means lie, with accuracy . The second stage clips the mean around this interval, reducing the sensitivity from to , and adds the appropriate Laplace noise. With high probability, we can recover the guarantee of the Laplace mechanism with smaller sensitivity since the samples are concentrated in a radius . We present the formal guarantees of Algorithm 2 in Theorem 1 and defer its proof to Appendix B.1.
Mean estimation in -dimensional space.
In the general -dimensional case, if is concentrated in -norm, one simply applies Algorithm 2 to each dimension. However, when is concentrated in -norm, naively upper bounding -norm by the -norm will incur a superfluous factor: if , each is possibly as large as . To remedy this issue, we use the random rotation trick in [3, 54]. This guarantees that all coordinates have roughly the same range: for , with high probability, , where is the random rotation. We present this procedure in Algorithm 3 and its performance in Theorem 2.
Let be the output of Algorithm 3. is -DP. Furthermore, if is -concentrated in -norm, there exists an estimator such that and
where and .
We present the proof of Theorem 2 in Appendix B.2. We are able to transfer both Theorem 1 and Theorem 2 into finite-sample estimation error bounds for various types of concentrated distributions and obtain near optimal guarantees (see Appendix B.4
for an example in mean estimation of sub-Gaussian distributions). The next corollary characterizes the risk of mean estimation for distributions supported on an-bounded domain with user-level DP guarantees (see Appendix B.3 for the proof).
Let be a distribution supported on with mean . Given , , consisting of i.i.d. samples from . There exists an -user-level DP algorithm such that, if for a numerical constant , we have555For precise log factors, see Appendix B.3.
Furthermore, this bound is optimal up to logarithmic factors.
3.2 Uniform concentration: answering many queries privately
The statistical query framework subsumes many learning algorithms. For example, we easily express stochastic gradient methods for solving ERM in the language of SQ algorithms (see beginning of Section 4). In the next theorem, we show that with a uniform concentration assumption we can answer a sequence of adaptively chosen queries with variance—or, equivalently, privacy cost—proportional to the concentration radius of the queries instead of the full range.
If is -uniformly concentrated, then for any sequence of (possibly adaptively chosen) queries , there exists an -DP algorithm , such that outputs satisfying , where
where and .
4 Empirical Risk Minimization with User-Level Differential Privacy
In this section, we present an algorithm to solve the ERM objective of (2) under user-level DP constraints. We apply the results of Section 3 by noting that the SQ framework encompasses stochastic gradient methods. Informally, one can sequentially choose queries and, for a stepsize , update , where is the answer to the -th query. For the results to hold, we require a uniform concentration result over the appropriate class of queries.
Uniform concentration of stochastic gradients
The class of queries for stochastic gradient methods is . We prove that when assumptions A1 and A2 hold, is -uniformly concentrated. The next proposition is a simplification of the result of  under the (stronger) assumption A1 that is uniformly -smooth. The proof, which we defer to Appendix C.1, hinges on a covering number argument.
Stochastic gradient methods
We state classical convergence results for stochastic gradient methods for both convex and non-convex losses under smoothness. For a function , we assume access to a first-order stochastic oracle , i.e., a random mapping such that for all ,
We abstract optimization algorithms in the following way: an algorithm consists of an output set , a sub-routine that takes the last output and indicates the next point to query and a sub-routine that takes the previous output and a stochastic gradient and returns the next output. After steps, we call , which takes all the previous outputs and returns the final point. (See Algorithm 7 in Appendix C.2 for how to instantiate generic first-order optimization in this framework.) For example, the classical projected SGD algorithm with fixed stepsize corresponds to the following (with ):
We detail in Proposition 3 in Appendix C.2 standard convergence results for variations666For convex functions, the algorithm is fixed-stepsize, averaged, projected SGD. For strongly-convex functions, the algorithm consists of projected SGD with a fixed stepsize and non-uniform averaging followed by a single restart with decreasing stepsize. Finally, in the non-convex case, the Query and Update sub-routine are also projected SGD with fixed stepsize while the Aggregate selects one of the past iterates uniformly at random.
of (projected) stochastic gradient descent (SGD). We introduce this abstraction to forego the details of each specific algorithm and instead focus on the privacy and utility guarantees. We also note the progress in recent years to improve convergence rates in the settings we consider (see, e.g.,[44, 30, 31, 43, 4, 5]). While our privacy guarantees hold for these algorithms—and thus practitioners can benefit from the improvements—we do not focus on them as projected SGD already allows us to attain the minimax optimal rate for SCO (see Section 5).
We recall the ERM setting with user-level DP. We observe with for and wish to solve the constrained optimization problem with objective in (2)
Theorem 4 (Privacy and utility guarantees for ERM).
Recall that , assume 777For precise log factors, see Appendix C.3. and let be the output of Algorithm 4. There exists variants of projected SGD (e.g. the ones we present in Proposition 3) such that, with probability greater than :
If for all is convex, then
If for all is -strongly-convex, then
Otherwise, defining the gradient mapping888In the unconstrained case——this is equivalent to the notion of critical point, i.e. . , we have
Furthermore, for , Algorithm 4 instantiated with any first-order gradient algorithm is -user-level DP.
We present the proof in Appendix C.3. For the utility guarantees, the crux of the proof resides in Theorem 3: as well as ensuring small excess loss in expectation, the SQ algorithm produces with high probability a sample from the stochastic gradient oracle where . When this happens for all steps, the analysis of stochastic gradient methods provide the desired regret. The privacy guarantees follow from the strong composition theorem of .
Importantly, when the function exhibits (some) strong-convexity (which will be the case for any regularized objective), we are able to localize the optimal parameter—up to the privacy cost—in steps. This will be particularly important in Section 5.
Corollary 2 (Localization).
Let be the output of Algorithm 4 on the ERM problem of (2). Assume that is -strongly-convex for all , that and set and . For , it holds999A logarithmic dependence on is hiding in the result. Since , we implicitly assume is polynomial in the stated parameters, which is satisfied when we later apply these results to regularized objectives.
5 Stochastic Convex Optimization with User-level Privacy
In this section we address the SCO task of (3) under user-level DP constraints. Our approach (which we show in Algorithm 5) solves a sequence of carefully regularized ERM problems, drawing on the guarantees of the previous section. Recall that and , and that is -smooth under assumption A1. In this section, we assume that is convex. We first present our results and state an upper and lower bound for SCO with user-level privacy constraints.
Theorem 5 (Phased ERM for SCO).
Theorem 6 (Lower bound for SCO).
5.1 Upper bound: minimizing a sequence of regularized ERM problems
We now present Algorithm 5, which achieves the upper bound of Theorem 5. It is similar in spirit to Phased ERM  and EpochGD , in that at each round we minimize a regularized ERM problem with fresh samples and increased regularization, initializing each round from the final iterate of the previous round. This allows us to localize the optimum with exponentially increasing accuracy without blowing up our privacy budget. We solve each round using Algorithm 4 to guarantee privacy and obtain an approximate minimizer. We show the guarantee in Corollary 2 is enough to achieve optimal rates. We provide the proof of Theorem 5 in Appendix D and present a sketch here.
Proof sketch of Theorem 5.
The privacy guarantee comes directly from the privacy guarantee of Algorithm 4 and the fact that are non-overlapping (since ).
The proof for utility is similar to the proof of Theorem 4.8 in . In round of Algorithm 5, we consider the true minimizer and the approximate minimizer . By stability , we can bound the generalization error of (see Proposition 4 in Appendix D) and, by Corollary 2, we can bound . We finally choose such that the assumptions of Corollary 2 hold and to minimize the final error. ∎
5.2 Lower bound: SCO is harder than Gaussian mean estimation
Theorem 6 holds for -user-level DP—importantly, this is a setting for which lower bounds are generally more challenging (we provide a related lower bound for -user-level DP in Section 6.2). We present the proof in Appendix D.2 and provide a sketch here.
Proof sketch of Theorem 6.
As is often the case in privacy, the (constrained) minimax lower bound decomposes into a statistical rate and a privacy rate. The statistical rate is information-theoretically optimal (see, e.g., [45, 2]), thus we focus on the privacy rate.
We consider linear losses101010We truncate the data distribution appropriately so the losses remain individually Lipschitz. of the form . We show that optimizing over is harder than the mean estimation task for . Intuitively, attains its minimum at and finding provides a good estimate of (the direction of) . We make this formal in Proposition 5.
Next, for Gaussian mean estimation, we reduce, in Proposition 2, user-level DP to item-level DP with lower variance by having each user contribute their sample average (which is a sufficient statistic). We conclude with the results of  (see Proposition 6) by proving in Corollary 6 that estimating the direction of the mean with item-level privacy is hard. ∎
6 Function Classes with Bounded Metric Entropy under Pure DP
We consider the general task of learning hypothesis class with finite metric
entropy (i.e., such that there exists a finite