1 Introduction
Releasing seemingly innocuous functions of a data set can easily compromise the privacy of individuals, whether the functions are simple counts [35]
or complex machine learning models like deep neural networks
[52, 29]. To protect against such leaks, Dwork et al. proposed the notion of differential privacy (DP). Given some data from participants in a study, we say that a statistic of the data is differentially private if an attacker who already knows the data of participants cannot reliably determine from the statistic whether theth remaining participant is Alice or Bob. With the recent explosion of publicly available data, progress in machine learning and data science, and widespread public release of machine learning models and other statistical inferences, differential privacy has become an important standard and is widely adopted by both industry and government
[32, 7, 20, 55].The standard setting of DP described in [21] assumes that each participant contributes a single data point to the dataset, and preserves privacy by “noising” the output in a way that is commensurate with the maximum contribution of a single example. This is not the situation faced in many applications of machine learning models, where users often contribute multiple samples to the model—for example, when language and image recognition models are trained on the users’ own data, or in federated learning settings [37]. As a result, current techniques either provide privacy guarantees that degrade with a user’s increased participation or naively add a substantial amount of noise, relying on the group property of differential privacy, which significantly harms the performance of the deployed model.
To remedy this issue, we consider userlevel differential privacy. In order to make this setting precise, let us consider differential privacy in the most general way, which only requires specifying a dataset space and a distance on .
Definition 1 (Differential Privacy).
Let . Let be a (potentially randomized) mechanism. We say that is DP with respect to if for any measurable subset and all satisfying ,
(1) 
If , we refer to this guarantee as pure differential privacy.
For a data space , choosing and recovers the canonical setting considered in most of the literature—we refer to this as itemlevel differential privacy. When we wish to guarantee privacy for users rather than individual samples, we instead assume a structured dataset into which each of users contributes samples. This corresponds to such that for , we have
which means that, in this setting, two datasets are neighboring if at most one of the user’s contributions differ. We henceforth refer to this setting as userlevel differential privacy. Very recently, for the reasons outlined above, there has been increasing interest in userlevel DP for applications such as estimating discrete distributions under userlevel privacy constraints [47] and bounding user contributions in ML models [6, 25]. Differentially private SQL with bounded user contributions was proposed in [59]. Since the stronger userlevel privacy guarantee is desirable in practice, it has been also studied in the context of learning models via federated learning [49, 48, 58, 8].
In this paper, we tackle the problem of learning with (central) userlevel privacy. In particular, we provide algorithms and analyses for the tasks of mean estimation, empirical risk minimization (ERM), stochastic convex optimization (SCO), and learning hypothesis classes with finite metric entropy. Our utility analyses assume that all users draw their samples i.i.d. from the same distribution, while our privacy guarantees hold for arbitrary data. We develop novel private mean estimators in high dimension with statistical and privacy error scaling with the (arbitrary) concentration radius rather than the range, and apply these to the statistical query setting [SQ; 41]. Our algorithms then rely on (privately) answering a sequence of adaptively chosen queries using users’ samples. Importantly, we show that for these tasks, the additional error due to privacy constraints decreases as , contrasting with the naive rate—independent of —obtained using the group property of itemlevel DP. Interestingly, increasing , the number of users, decreases the privacy cost at a faster rate.
To provide intuition into our techniques for userlevel DP, consider the vignette of private mean estimation in one dimension. In this setting, each user contributes a set of i.i.d. values . If we summarize each user by their sample average , then we reduce the problem to itemlevel DP since each user now contributes a single datum. When the data distribution has favorable concentration properties—say, it is subGaussian—these averages are roughly concentrated in an interval of size . In other words, the userlevel sensitivity effectively decreases by a factor of in comparison to a naive application of group privacy in the itemlevel setting. While the argument is straightforward for i.i.d. data, when answering adaptively chosen queries the same samples are almost always reused. This makes the data noni.i.d. and breaks concentration. We circumvent these issues by considering families of queries and distributions with uniform concentration properties. We provide algorithms answering adaptively chosen queries which leverage these properties and apply them to solve a range of learning tasks.
1.1 Our Contributions and Related Work
We provide a theoretical tool to construct estimators for tasks with userlevel privacy constraints and apply it to a range of learning problems. We preview our results briefly below.
Private mean estimation and uniformly concentrated queries (Section 3)
In this section, we present our main technical tool. We show that for a random variable in
concentrated in an unknown interval of radius (made precise in Definition 2), we can privately estimate its mean with statistical and private error proportional to rather than the whole range . Indeed, standard private mean estimation techniques [23] add Laplace or Gaussian noise proportional to the worstcase range of the random variable. However, for wellconcentrated distributions, it is often the case that , resulting in needlessly large noise. When data is concentrated in norm, several papers show that one can achieve an error scaling with rather than , either asymptotically [53], for Gaussian meanestimation [40, 38], for subGaussian symmetric distributions [17, 16] or for distributions with boundedth moment
[39]. We propose a private mean estimator with error scaling with that works in arbitrary dimension when data is concentrated in norm and without additional assumptions. We present the estimator in Algorithm 3 and the analysis in Theorem 2. In Corollary 1, we apply this estimator to the task of mean estimation under userlevel privacy constraints for random vectors bounded in
norm, and achieve the optimal risk (up to logarithmic factors) of . Building on these results, we show in the statistical query framework that for uniformly concentrated queries (see Definition 3) it is possible to privately answer a sequence of adaptively chosen queries with privacy cost (see Corollary 3). An important feature of our algorithm is that, with high probability, it returns the sample mean of the query with exogenous noise. This makes the analysis significantly easier when considering stochastic gradient algorithms.
Our conclusions relate to the growing literature in adaptive data analysis; however, our results have a different aim and as such are hard to compare. A sequence of papers [24, 11, 26, 27] focus on adaptivity and seek to answer queries accurately—without privacy constraints—and achieve an error for each query scaling as . While these algorithms use techniques from differential privacy and their answers are DP with , our work focuses more on the privacy cost for general with the additional assumption of uniform concentration.
Empirical risk minimization (Section 4)
An influential line of papers studies ERM under itemlevel privacy constraints [18, 42, 10]. Importantly, these papers assume arbitrary data, i.e., not necessarily i.i.d. samples from a data distribution. The exact analog of ERM in the userlevel setting is consequently less interesting as, for data points , in the worst case, each user contributes copies of and the problem reduces to the itemlevel setting. We thus consider the (related) problem of ERM when observing data points sampled i.i.d. from a distribution. Assuming some regularity (see Assumptions A1 and A2), the stochastic gradients are uniformly concentrated, and the results in Section 3 apply. Casting stochastic gradient methods as a sequence of adaptive queries, we establish convergence rates for ERM under userlevel DP constraints for convex, stronglyconvex, and nonconvex losses. We show that the cost of privacy decreases as for the convex and nonconvex cases and for the strongly convex case as user contribution increases. We detail these results in Theorem 4.
Stochastic convex optimization (Section 5)
Under itemlevel DP (or equivalently, userlevel DP with ), a sequence of work [18, 10, 12, 13, 28] establishes the constrained minimax risk as when the loss is Lipschitz and the parameter space has diameter less than . In this paper, with the additional assumptions that the losses are individually smooth^{2}^{2}2We note that the results only require smooth losses. In the limit of large —keeping all other problem parameters fixed—this is a very weak assumption. More precisely, when , by considering a smoothed version of (for example, using the Moreau envelope [33]), our algorithm on yields optimal rates for nonsmooth losses. and the stochastic gradients are subGaussian, we prove an upper bound of and a lower bound of on the population risk, where and . We present precise statements in Theorems 5 and 6. When , the privacy rates match and when , the statistical rates match (in both cases up to logarithmic factors). We leave closing the gap outside of this regime to future work.
Function classes with finite metric entropy under pure DP (Section 6)
Our previous results only hold for approximate userlevel DP. Turning to pure DP, we consider function classes with bounded range and finite metric entropy. We provide an estimator combining our mean estimation techniques with the private selection mechanism of [46]. For a finite class of size , we achieve an excess risk of . We further prove a lower bound of . For SCO with Lipschitz gradients and bounded domain, a covering number argument implies an excess risk of under pure userlevel DP constraints. While the statistical rate is optimal, we observe a gap of order for the privacy cost compared to the lower bound, which we leave as future work.
Limit of learning with a fixed number of users (Appendix A)
Finally, we resolve a conjecture of [6] and prove that with a fixed number of users, even in the limit (i.e., each user has an infinite number of samples), we cannot reach zero error. In particular, we prove that for all the learning tasks we consider, the risk under userlevel privacy constraints is at least regardless of . Note that this does not contradict the results above since they require to hold.
2 Preliminaries
Notation.
Throughout this work, denotes the dimension, the number of users, and the number of samples per user. Generically, will denote the subGaussian parameter, the concentration radius and
the variance of a random vector. We denote the optimization variable with
and use (or when random) to denote the data sample supported on a space . We denote by the data distribution and bythe loss function. Gradients (denoted
) are always taken with respect to the optimization variable . For a convex set , denotes the euclidean projection on , i.e. . We use to refer to (possibly random) private mechanisms and as a shorthand for the dataset .2.1 ERM and stochastic convex optimization
Assumptions.
Throughout this work, we assume that the parameter space is closed, convex, and satisfies for all . We also assume that the loss is Lipschitz, meaning that for all , for all , .^{3}^{3}3It is straightforward to develop analogs of the results of Sections 3 and 4 for arbitrary norms, but we restrict our attention to the norm in this work for clarity. We further consider the following assumptions.
Assumption A1.
The function is smooth. In other words, the gradient is Lipschitz in the variable for all .
Assumption A2.
The random vector is subGaussian for all and . Equivalently, for all , is a subGaussian random variable, i.e.,
In this work, our rates often depend on the subGaussianity and Lipschitz parameters and , and thus we define the shorthands and . Intuitively, the Lipschitzness assumption bounds the gradient in a ball around (independently of ), while subGaussianity implies that, for each , likely lies in . Generically, there is no ordering between and : consider a linear loss , depending on the problem, it can hold that (e.g., for ), (e.g., is truncated in a ball around , with ) or (e.g., ).
We introduce the objectives that we consider in this work, namely empirical risk minimization (ERM) and stochastic convex optimization (SCO). For a collection of samples from users , where each , we define the empirical objectives
(2) 
In the userlevel setting we wish to minimize under userlevel privacy constraints. Going beyond the empirical risk, we also solve SCO [51], i.e. minimizing a population convex objective when provided with i.i.d. samples. In the userlevel setting, for a convex loss and a convex constraint set , we observe and wish to
(3) 
2.2 Uniform concentration of queries
We define concentration of a single query and uniform concentration of multiple queries as follows.
Definition 2.
A (random) sample supported on is concentrated (and we call the “concentration radius”) if there exists such that with probability at least ,
Definition 3 (Uniform concentration of vector queries).
Let be a family of queries with bounded range. For , we say that is uniformlyconcentrated if with probability at least , we have
3 High Dimensional Mean Estimation and Uniformly Concentrated Queries
In this section, we study the cost of answering adaptivelychosen queries privately when the class of queries exhibit uniform concentration properties defined in Definition 3. In Section 3.1, we present a private mean estimator with privacy cost proportional to the concentration radius. In Section 3.2, we show that, under uniform concentration, we can also answer adaptivelychosen queries with privacy cost proportional to the concentration radius instead of the whole range. Our theorems guarantee that the estimator is close (with exponentially small in
) to a simple unbiased estimator with small noise. This formulation makes the analysis much simpler when applying the technique to learning problems. Furthermore, we show that we painlessly translate these results into upper bounds on the estimator error, which we demonstrate by providing tight bounds on privately estimating the mean of
bounded random vectors under userlevel DP constraints (see Corollary 1).3.1 Winsorized mean estimator: answering a single query privately
We consider the simple task of mean estimation, which is exactly equivalent to answering the single statistical query . We first focus on the scalar case () and present our estimator in Algorithm 2. The algorithm uses a twostage procedure, similar in spirit to those of [53], [40], and [39]. In the first stage of this procedure, we use the approximate median estimation in [26], detailed in Algorithm 1, to privately estimate a crude interval in which the means lie, with accuracy . The second stage clips the mean around this interval, reducing the sensitivity from to , and adds the appropriate Laplace noise. With high probability, we can recover the guarantee of the Laplace mechanism with smaller sensitivity since the samples are concentrated in a radius . We present the formal guarantees of Algorithm 2 in Theorem 1 and defer its proof to Appendix B.1.
Theorem 1.
Compared to prior work [40, 38, 39], our algorithm runs in time instead of owing to the approximate median estimation algorithm in [26], which is faster when .
Mean estimation in dimensional space.
In the general dimensional case, if is concentrated in norm, one simply applies Algorithm 2 to each dimension. However, when is concentrated in norm, naively upper bounding norm by the norm will incur a superfluous factor: if , each is possibly as large as . To remedy this issue, we use the random rotation trick in [3, 54]. This guarantees that all coordinates have roughly the same range: for , with high probability, , where is the random rotation. We present this procedure in Algorithm 3 and its performance in Theorem 2.
Theorem 2.
Let be the output of Algorithm 3. is DP. Furthermore, if is concentrated in norm, there exists an estimator such that and
(4) 
where and .
We present the proof of Theorem 2 in Appendix B.2. We are able to transfer both Theorem 1 and Theorem 2 into finitesample estimation error bounds for various types of concentrated distributions and obtain near optimal guarantees (see Appendix B.4
for an example in mean estimation of subGaussian distributions). The next corollary characterizes the risk of mean estimation for distributions supported on an
bounded domain with userlevel DP guarantees (see Appendix B.3 for the proof).Corollary 1.
Let be a distribution supported on with mean . Given , , consisting of i.i.d. samples from . There exists an userlevel DP algorithm such that, if for a numerical constant , we have^{5}^{5}5For precise log factors, see Appendix B.3.
Furthermore, this bound is optimal up to logarithmic factors.
3.2 Uniform concentration: answering many queries privately
The statistical query framework subsumes many learning algorithms. For example, we easily express stochastic gradient methods for solving ERM in the language of SQ algorithms (see beginning of Section 4). In the next theorem, we show that with a uniform concentration assumption we can answer a sequence of adaptively chosen queries with variance—or, equivalently, privacy cost—proportional to the concentration radius of the queries instead of the full range.
Theorem 3.
If is uniformly concentrated, then for any sequence of (possibly adaptively chosen) queries , there exists an DP algorithm , such that outputs satisfying , where
where and .
4 Empirical Risk Minimization with UserLevel Differential Privacy
In this section, we present an algorithm to solve the ERM objective of (2) under userlevel DP constraints. We apply the results of Section 3 by noting that the SQ framework encompasses stochastic gradient methods. Informally, one can sequentially choose queries and, for a stepsize , update , where is the answer to the th query. For the results to hold, we require a uniform concentration result over the appropriate class of queries.
Uniform concentration of stochastic gradients
The class of queries for stochastic gradient methods is . We prove that when assumptions A1 and A2 hold, is uniformly concentrated. The next proposition is a simplification of the result of [50] under the (stronger) assumption A1 that is uniformly smooth. The proof, which we defer to Appendix C.1, hinges on a covering number argument.
Stochastic gradient methods
We state classical convergence results for stochastic gradient methods for both convex and nonconvex losses under smoothness. For a function , we assume access to a firstorder stochastic oracle , i.e., a random mapping such that for all ,
We abstract optimization algorithms in the following way: an algorithm consists of an output set , a subroutine that takes the last output and indicates the next point to query and a subroutine that takes the previous output and a stochastic gradient and returns the next output. After steps, we call , which takes all the previous outputs and returns the final point. (See Algorithm 7 in Appendix C.2 for how to instantiate generic firstorder optimization in this framework.) For example, the classical projected SGD algorithm with fixed stepsize corresponds to the following (with ):
We detail in Proposition 3 in Appendix C.2 standard convergence results for variations^{6}^{6}6For convex functions, the algorithm is fixedstepsize, averaged, projected SGD. For stronglyconvex functions, the algorithm consists of projected SGD with a fixed stepsize and nonuniform averaging followed by a single restart with decreasing stepsize. Finally, in the nonconvex case, the Query and Update subroutine are also projected SGD with fixed stepsize while the Aggregate selects one of the past iterates uniformly at random.
of (projected) stochastic gradient descent (SGD). We introduce this abstraction to forego the details of each specific algorithm and instead focus on the privacy and utility guarantees. We also note the progress in recent years to improve convergence rates in the settings we consider (see, e.g.,
[44, 30, 31, 43, 4, 5]). While our privacy guarantees hold for these algorithms—and thus practitioners can benefit from the improvements—we do not focus on them as projected SGD already allows us to attain the minimax optimal rate for SCO (see Section 5).Algorithm
We recall the ERM setting with userlevel DP. We observe with for and wish to solve the constrained optimization problem with objective in (2)
(5) 
We present our method in Algorithm 4 and provide utility and privacy guarantees in Theorem 4.
Theorem 4 (Privacy and utility guarantees for ERM).
Recall that , assume ^{7}^{7}7For precise log factors, see Appendix C.3. and let be the output of Algorithm 4. There exists variants of projected SGD (e.g. the ones we present in Proposition 3) such that, with probability greater than :

[label=()]

If for all is convex, then

If for all is stronglyconvex, then

Otherwise, defining the gradient mapping^{8}^{8}8In the unconstrained case——this is equivalent to the notion of critical point, i.e. . , we have
Furthermore, for , Algorithm 4 instantiated with any firstorder gradient algorithm is userlevel DP.
We present the proof in Appendix C.3. For the utility guarantees, the crux of the proof resides in Theorem 3: as well as ensuring small excess loss in expectation, the SQ algorithm produces with high probability a sample from the stochastic gradient oracle where . When this happens for all steps, the analysis of stochastic gradient methods provide the desired regret. The privacy guarantees follow from the strong composition theorem of [22].
Importantly, when the function exhibits (some) strongconvexity (which will be the case for any regularized objective), we are able to localize the optimal parameter—up to the privacy cost—in steps. This will be particularly important in Section 5.
Corollary 2 (Localization).
Let be the output of Algorithm 4 on the ERM problem of (2). Assume that is stronglyconvex for all , that and set and . For , it holds^{9}^{9}9A logarithmic dependence on is hiding in the result. Since , we implicitly assume is polynomial in the stated parameters, which is satisfied when we later apply these results to regularized objectives.
5 Stochastic Convex Optimization with Userlevel Privacy
In this section we address the SCO task of (3) under userlevel DP constraints. Our approach (which we show in Algorithm 5) solves a sequence of carefully regularized ERM problems, drawing on the guarantees of the previous section. Recall that and , and that is smooth under assumption A1. In this section, we assume that is convex. We first present our results and state an upper and lower bound for SCO with userlevel privacy constraints.
Theorem 5 (Phased ERM for SCO).
Theorem 6 (Lower bound for SCO).
When , the upper bound matches the lower bound up to logarithmic factors. We present the algorithm and proof for Theorem 5 in Section 5.1. Theorem 6 is proved in Section 5.2.
5.1 Upper bound: minimizing a sequence of regularized ERM problems
We now present Algorithm 5, which achieves the upper bound of Theorem 5. It is similar in spirit to Phased ERM [28] and EpochGD [34], in that at each round we minimize a regularized ERM problem with fresh samples and increased regularization, initializing each round from the final iterate of the previous round. This allows us to localize the optimum with exponentially increasing accuracy without blowing up our privacy budget. We solve each round using Algorithm 4 to guarantee privacy and obtain an approximate minimizer. We show the guarantee in Corollary 2 is enough to achieve optimal rates. We provide the proof of Theorem 5 in Appendix D and present a sketch here.
(6) 
Proof sketch of Theorem 5.
The privacy guarantee comes directly from the privacy guarantee of Algorithm 4 and the fact that are nonoverlapping (since ).
The proof for utility is similar to the proof of Theorem 4.8 in [28]. In round of Algorithm 5, we consider the true minimizer and the approximate minimizer . By stability [14], we can bound the generalization error of (see Proposition 4 in Appendix D) and, by Corollary 2, we can bound . We finally choose such that the assumptions of Corollary 2 hold and to minimize the final error. ∎
5.2 Lower bound: SCO is harder than Gaussian mean estimation
Theorem 6 holds for userlevel DP—importantly, this is a setting for which lower bounds are generally more challenging (we provide a related lower bound for userlevel DP in Section 6.2). We present the proof in Appendix D.2 and provide a sketch here.
Proof sketch of Theorem 6.
As is often the case in privacy, the (constrained) minimax lower bound decomposes into a statistical rate and a privacy rate. The statistical rate is informationtheoretically optimal (see, e.g., [45, 2]), thus we focus on the privacy rate.
We consider linear losses^{10}^{10}10We truncate the data distribution appropriately so the losses remain individually Lipschitz. of the form . We show that optimizing over is harder than the mean estimation task for . Intuitively, attains its minimum at and finding provides a good estimate of (the direction of) . We make this formal in Proposition 5.
Next, for Gaussian mean estimation, we reduce, in Proposition 2, userlevel DP to itemlevel DP with lower variance by having each user contribute their sample average (which is a sufficient statistic). We conclude with the results of [38] (see Proposition 6) by proving in Corollary 6 that estimating the direction of the mean with itemlevel privacy is hard. ∎
6 Function Classes with Bounded Metric Entropy under Pure DP
We consider the general task of learning hypothesis class with finite metric entropy (i.e., such that there exists a finite cover under a certain norm) and bounded loss under pure userlevel DP constraints.
For this setting, we present Algorithm 6, which we complement with an informationtheoretic lower bound. As in the previous sections, we consider a sample set