Many fundamental problems in machine learning reduce to the problem of minimizing the expected loss (also referred to aspopulation loss) for convex loss functions of given access to i.i.d. samples from the data distribution
. This problem arises in various settings, such as estimating the mean of a distribution, least squares regression, or minimizing a convex surrogate loss for a classification problem. This problem is commonly referred to asstochastic convex optimization (SCO) and has been the subject of extensive study in machine learning and optimization. In this work we study this problem with the additional constraint of differential privacy with respect to the samples [DMNS06].
A natural approach toward solving SCO is minimization of the empirical loss and is referred to as empirical risk minimization (ERM). The problem of ERM with differential privacy (DP-ERM) has been well-studied and asymptotically tight upper and lower bounds on excess loss111Excess loss refers to the difference between the achieved loss and the true minimum. are known [CM08, CMS11, JKT12, KST12, ST13, SCS13, DJW13, Ull15, JT14, BST14, TTZ15, STU17, WLK17, WYX17, INS19].
A standard approach for deriving bounds on the population loss is to appeal to uniform convergence of empirical loss to population loss, namely an upper bound on . This approach can be used to derive optimal bounds on the excess population loss in a number of special cases, such as regression for generalized linear models. However, in general, it leads to suboptimal bounds. It is known that there exist distributions over loss functions over for which the best bound on uniform convergence is [Fel16]. In contrast, in the same setting, DP-ERM can be solved with excess loss of and the optimal excess population loss achievable without privacy is . As a result, in the high-dimensional settings often considered in modern ML (when ), bounds based on uniform convergence are and do not lead to meaningful bounds on population loss.
The first work to address the population loss for SCO with differential privacy (DP-SCO) is [BST14]. It gives bounds based on two natural approaches. The first approach is to use the generalization properties of differential privacy itself to bound the gap between the empirical and population losses [DFH15, BNS16], and thus derive bounds for SCO from bounds on ERM. This approach leads to a suboptimal bound (specifically222For clarity, in the introduction we focus on the dependence on and and for -DP. We suppress the dependence on and on parameters of the loss function such as Lipschitz constant and the constraint set radius., [BST14, Sec. F]). For the important case when and this results in the bound of on excess population loss. The second approach relies on generalization properties of stability to bound the gap between the empirical and population losses [BE02, SSSSS09]. Stability is ensured by adding a strongly convex regularizer to the empirical loss [SSSSS09]. This technique also yields a suboptimal bound on the excess population loss .
There are two natural lower bounds that apply to DP-SCO. The lower bound of for the excess loss of non-private SCO applies for DP-SCO. Further it is not hard to show that lower bounds for DP-ERM translate to essentially the same lower bound for DP-SCO, leading to a lower bound of (see Appendix C for the proof).
1.1 Our contribution
In this work, we address the gap between the known bounds for DP-SCO. Specifically, we show that the optimal rate of is achievable, matching the known lower bounds. In particular, we obtain the statistically optimal rate of whenever . This is in contrast to the situation for DP-ERM where the cost of privacy grows with the dimension for all .
In our first result we show that, under relatively mild smoothness assumptions, this rate is achieved by a variant of the standard noisy mini-batch SGD. The classical analyses for non-private SCO depend crucially on making only one pass over the dataset. However, a single pass noisy SGD is not sufficiently accurate as we need a non-trivial amount of noise in each step to carry out the privacy analysis. We rely instead on generalization properties of uniform stability [BE02]. Unlike in [BST14], our analysis of stability is based on extension of recent stability analysis of SGD [HRS15, FV19] to noisy SGD. In this analysis, the stability parameter degrades with the number of passes over the dataset, while the empirical loss decreases as we make more passes. In addition, the batch size needs to be sufficiently large to ensure that the noise added for privacy is small. To satisfy all these constraints the parameters of the scheme need to be tuned carefully. Specifically we show that steps of SGD with a batch size of are sufficient to get all the desired properties.
Our second contribution is to show that the smoothness assumptions can be relaxed at essentially no additional increase in the rate. We use a general smoothing technique based on the Moreau-Yosida envelope operator that allows us to derive the same asymptotic bounds as the smooth case. This operator cannot be implemented efficiently in general, but for algorithms based on gradient steps we exploit the well-known connection between the gradient step on the smoothed function and the proximal step on the original function. Thus our algorithm is equivalent to (stochastic, noisy, mini-batch) proximal descent on the unsmoothed function. We show that our analysis in the smooth case is robust to inaccuracies in the computation of the gradient. This allows us to show that sufficient approximation to the proximal steps can be implemented in polynomial time given access to the gradient of the ’s.
Finally, we show that Objective Perturbation [CMS11, KST12] also achieves optimal bounds for DP-SCO. However, objective perturbation is only known to satisfy privacy under some additional assumptions, most notably, Hessian being rank
on all points in the domain. The generalization analysis in this case is based on the uniform stability of the solution to strongly convex ERM. Aside from extending the analysis of this approach to population loss, we show that it can lead to algorithms for DP-SCO that use only near-linear number of gradient evaluations (whenever these assumptions hold). In particular, we give a variant of objective perturbation in conjunction with the stochastic variance reduced gradient descent (SVRG) with onlygradient evaluations. We remark that the known lower bounds for uniform convergence [Fel16] hold even under those additional assumptions invoked in objective perturbation. Finding algorithms with near-linear running time in the general setting of SCO is a natural avenue for future research.
Our work highlights the importance of uniform stability as a tool for analysis of this important class of problems. We believe it should have applications to other differentially private statistical analyses.
Differentially private empirical risk minimization (ERM) is a well-studied area spanning over a decade [CM08, CMS11, JKT12, KST12, ST13, SCS13, DJW13, Ull15, JT14, BST14, TTZ15, STU17, WLK17, WYX17, INS19]. Aside from [BST14] and work in the local model of DP [DJW13] these works focus on achieving optimal empirical risk bounds under privacy. Our work builds heavily on algorithms and analyses developed in this line of work while contributing additional insights.
We use to denote the parameter space, which is assumed to be a convex, compact set. We denote by the radius of . We use to denote an arbitrary data domain and to denote an arbitrary distribution over . We let
be a loss function that takes a parameter vectorand a data point as inputs and outputs a real value.
The empirical loss of w.r.t. loss and dataset is defined as The excess empirical loss of is defined as The population loss of with respect to a loss and a distribution over , is defined as The excess population loss of is defined as
Definition 2.1 (Uniform stability).
Let . A (randomized) algorithm is -uniformly stable (w.r.t. loss ) if for any pair differing in at most one data point, we have
where the expectation is taken only over the internal randomness of .
We will use the following simple generalization property of stability that upper bounds the expectation of population loss. Our bounds on excess population loss can also be shown to hold (up to log factors) with high probability using the results from[FV19].
Lemma 2.2 ([Be02]).
Let be an -uniformly stable algorithm w.r.t. loss . Let be any distribution over , and let . Then,
Definition 2.3 (Smooth function).
Let . A differentiable function is -smooth over if for every we have
In the sequel, whenever we attribute a property (e.g., convexity, Lipschitz property, smoothness, etc.) to a loss function , we mean that for every data point the loss possesses that property over .
Stochastic Convex Optimization (SCO):
Let be an arbitrary (unknown) distribution over , and be a sequence of i.i.d. samples from . Let be a convex loss function. A (possibly randomized) algorithm for SCO uses the sample to generate an (approximate) minimizer for . We measure the accuracy of by the expected excess population loss of its output parameter , defined as:
where the expectation is taken over the choice of , and any internal randomness in .
A randomized algorithm is -differentially private if, for any pair of datasets and differ in exactly one data point, and for all events in the output range of , we have
where the probability is taken over the random coins of . For meaningful privacy guarantees, the typical settings of the privacy parameters are and .
Differentially Private Stochastic Convex Optimization (DP-SCO):
An -DP-SCO algorithm is a SCO algorithm that satisfies -differential privacy.
3 Private SCO via Mini-batch Noisy SGD
In this section, we consider the setting where the loss
is convex, Lipschitz, and smooth. We give a technique that is based on a mini-batch variant of Noisy Stochastic Gradient Descent (NSGD) algorithm[BST14, ACG16] described in Figure 1.
Theorem 3.1 (Privacy guarantee of ).
Algorithm 1 is -differentially private.
The proof follows from [ACG16, Theorem 1]
, which gives a tight privacy analysis for mini-batch NSGD via the Moments Accountant technique and privacy amplification via sampling. We note that the setting of the mini-batch size in Step2 of Algorithm 1 satisfies the condition in [ACG16, Theorem 1] (we obtain here an explicit value for the universal constants in the aforementioned theorem in that reference). We also note that the setting of the Gaussian noise in [ACG16] is not normalized by the mini-batch size, and hence the noise variance reported in [ACG16, Theorem 1] is larger than our setting of by a factor of . ∎
The population loss attained by is given by the next theorem.
Theorem 3.2 (Excess population loss of ).
Let be any distribution over and let . Suppose . Let and . Then,
Before proving the above theorem, we first state and prove the following useful lemmas.
Let . Suppose the parameter set is convex and -bounded. For any the excess empirical loss of satisfies
where the expectation is taken with respect to the choice of the mini-batch (step 5) and the independent Gaussian noise vectors .
The following lemma is a simple extension of the results on uniform stability of GD methods that appeared in [HRS15] and [FV19, Lemma 4.3] to the case of mini-batch noisy SGD. For completeness, we provide a proof in Appendix A.
In , suppose where is the smoothness parameter of . Then, is -uniformly stable with .
Proof of Theorem 3.2
where (1) follows from the fact that . Optimizing the above bound in and yields the values in the theorem statement for these parameters, as well as the stated bound on the excess population loss.
4 Private SCO for Non-smooth Losses
In this section, we consider the setting where the convex loss is non-smooth. First, we show a generic reduction to the smooth case by employing the smoothing technique known as Moreau-Yosida regularization (a.k.a. Moreau envelope smoothing) [Nes05]. Given an appropriately smoothed version of the loss, we obtain the optimal population loss w.r.t. the original non-smooth loss function. Computing the smoothed loss via this technique is generally computationally inefficient. Hence, we move on to describe a computationally efficient algorithm for the non-smooth case with essentially optimal population loss. Our construction is based on an adaptation of our noisy SGD algorithm (Algorithm 1) that exploits some useful properties of Moreau-Yosida smoothing technique that stem from its connection to proximal operations.
Definition 4.1 (Moreau envelope).
Let be a convex function, and . The -Moreau envelope of is a function defined as
Moreau envelope has direct connection with the proximal operator of a function defined below.
Definition 4.2 (Proximal operator).
The prox operator of is defined as
It follows that the Moreau envelope can be written as
The following lemma states some useful, known properties of Moreau envelope.
Let be a convex, -Lipschitz function, and let . The -Moreau envelope satisfies the following:
is convex, -Lipschitz, and -smooth.
The convexity and -smoothness together with properties 2 and 3 are fairly standard and the proof can be found in the aforementioned references. The fact that is -Lipschitz follows easily from property 3. We include the proof of this fact in Appendix B for completeness.
Let be a convex, -Lipschitz loss. For any let denote the -Moreau envelope of For a dataset let be the empirical risk w.r.t. the -smoothed loss. For any distribution , let denote the corresponding population loss. The following theorem asserts that, with an appropriate setting for running over the -smoothed losses yields the optimal population loss w.r.t. the original non-smooth loss .
Theorem 4.4 (Excess population loss for non-smooth losses via smoothing).
Computationally efficient algorithm (Nsgd + Prox)
Computing the Moreau envelope of a function is computationally inefficient in general. However, by property 3 of Lemma 4.3, we note that evaluating the gradient of Moreau envelope at any point can be attained by evaluating the proximal operator of the function at that point. Evaluating the proximal operator is equivalent to minimizing a strongly convex function (see Definition 4.2). This can be approximated efficiently, e.g., via gradient descent. Since our algorithm (Algorithm 1) requires only sufficiently accurate gradient evaluations, we can hence use an efficient, approximate proximal operator to approximate the gradient of the smoothed losses. The gradient evaluations in will thus be replaced with such approximate gradients evaluated via the approximate proximal operator. The resulting algorithm, referred to as , will approximately minimize the smoothed empirical loss without actually computing the smoothed losses.
Definition 4.5 (Approximate operator).
We say that is an -approximate proximal operator of for a function if
Let be -bounded. Let be convex, -Lipschitz function. Suppose . For all , there is -approximate such that for each , computing requires time that is equivalent to at most gradient evaluations.
This fact follows from the fact that where . This is minimization of -strongly convex and -Lipschitz function over , The Lipschitz constant follows from the fact that . Hence, one can run ordinary Gradient Descent to obtain an approximate minimizer. From a standard result on convergence of GD for strongly convex and Lipschitz functions [Bub15], in gradient steps we obtain an approximate satisfying , where . Since is -strongly convex, we get .
Description of :
The algorithm description follows exactly the same lines as except that: (i) the input loss is now non-smooth, and (ii) for each iteration , the gradient evaluation for each data point in the mini-batch is replaced with the evaluation of an approximate gradient of the smoothed loss . The approximate gradient, denoted as , is computed using an approximate proximal operator. Namely,
where . Here, we use a computationally efficient -approximate like the one in Fact 4.6 with set as
Note that the approximation error in the gradient , and that where is the Lipschitz constant of .
Running time of :
if we use the approximate proximal operator in Fact 4.6, then it is easy to see that requires a number of gradient evaluations that is a factor of more than , where That is, the total number of gradient evaluations is where is the mini-batch size.
We now argue that privacy, stability, and accuracy of the algorithm are preserved under the approximate proximal operator.
Note that to bound the sensitivity of the approximate gradient of the mini-batch, it suffices to bound the norm of the approximate gradient. From the discussion above, note that we have Thus, the sensitivity remains basically the same as in the case where the algorithm is run with the exact gradients. Hence, the same privacy guarantee holds as in .
Note that the approximation error in the gradient of the mini-batch (due to the approximate proximal operation) can be viewed as a fixed error term of magnitude at most that is added to the exact gradient of the smoothed loss. It is well-known and easy to see that the effect of this additional approximation error on the standard convergence bounds is that excess empirical loss may grow by at most the error times the diameter of the domain (e.g. [NB10, FGV15]). Hence, compared to the error bound error in Lemma 3.3, the bound we get incurs an additional term of . Clearly, this additional error is dominated by the other terms in the empirical loss bound in Lemma 3.3, and thus will have no significant impact on the final bound.
This easily follows from the following facts. First, note that the additional approximation error due to gradient approximation is . Second, the gradient update w.r.t. the exact gradient of the smoothed loss is non-expansive operation (which is the key fact in proving uniform stability of (stochastic) gradient methods [HRS15, FV19]), and hence the approximation error in the gradient is not going to be amplified by the gradient update step. Hence, for any trajectory of approximate gradient updates, the accumulated approximation error in the final output cannot exceed . This cannot increase the final uniform stability bound by more than an additive term of . Thus, we obtain basically the same bound in Lemma 3.4.
Putting these together, we have argued that is computationally efficient algorithm that achieves the optimal population loss bound in Theorem 4.4.
5 Private SCO via Objective Perturbation
In this section, we show that the technique known as objective perturbation [CMS11, KST12] can be used to attain optimal population loss for a large subclass of convex, smooth losses. In objective perturbation, the empirical loss is first perturbed by adding two terms: a noisy linear term and a regularization term. As shown in [CMS11, KST12], under some additional assumptions on the Hessian of the loss, an appropriate random perturbation ensures differential privacy. The excess empirical loss of this technique for smooth convex losses was originally analyzed in the aforementioned works, and was shown to be optimal by the lower bound in [BST14]. We revisit this technique and show that the regularization term added for privacy can be used to attain the optimal excess population loss by exploiting the stability-inducing property of regularization.
For all is twice-differentiable, and the rank of its Hessian at any is at most .
The regularization term as appears in is of different scaling than the one that appears in [KST12]. In particular, the regularization term in [KST12] is normalized by , whereas here it is not. Hence, whenever the results from [KST12] are used here, the regularization parameter in their statements should be replaced with . This presentation choice is more consistent with literature on regularization.
The privacy guarantee of is given in the following theorem, which follows directly from [KST12].
Theorem 5.2 (Privacy guarantee of , restatement of Theorem 2 in [Kst12]).
Suppose that Assumption 5.1 holds and that the smoothness parameter satisfies . Then, is -differentially private.
We now state our main result for this section showing that, with appropriate setting for , yields asymptotically optimal excess population loss.
Theorem 5.3 (Excess population loss of ).
Let be any distribution over and let . Suppose that Assumption 5.1 holds. Suppose that is -bounded. In set Then, we have
To prove the above theorem, we use the following lemmas.
Lemma 5.4 (Excess empirical loss of , restatement of Theorem 26 in [Kst12]).
Let . Under Assumption 5.1, the excess empirical loss of satisfies
where the expectation is taken over the Gaussian noise in .
The next lemma states the well-known stability property of regularized empirical risk minimization.
Lemma 5.5 ([Ssbd14]).
Let be a convex, -Lipschitz loss, and let . Let . Let be an algorithm that outputs where Then, is -uniformly stable.
Proof of Theorem 5.3
Fix any realization of the noise vector . For every define Note that is -Lipschitz. For any dataset define Hence, the output of on input dataset can be written as . Define Thus, for any fixed by combining Lemma 5.5 with Lemma 2.2, we have On the other hand, note that for any dataset we always have since the linear term cancels out. Hence, the expected generalization error (w.r.t. ) satisfies
Now, by taking expectation over as well, we arrive at
where we assume (since otherwise we would have the trivial error).
Now, observe that:
where the second inequality follows from the fact that , and the last bound follows from combining (2) with Lemma 5.4. Optimizing this bound in yields the setting of in the theorem statement. Plugging that setting of into the bound yield the stated bound on the excess population loss.
A note on the rank assumption: While in this section we presented our result under the assumption that rank of is at most one, one can extend the analysis (by using similar argument in [INS19]) to a rank of without affecting the asymptotic population loss guarantees. In general, to ensure differential privacy to , one only needs the following assumption involving the Hessian of individual losses: for all and , rather than a constraint on the rank.
5.1 Oracle Efficient Objective Perturbation
The privacy guarantee of the standard objective perturbation technique is given only when the output is the exact minimizer [CMS11, KST12]. In practice, we usually cannot attain the exact minimizer, but rather obtain an approximate minimizer via efficient optimization methods. Hence, in this section we focus on providing a practical version of algorithm , called approximate objective perturbation (Algorithm ), that i) is -differentially private, ii) achieves nearly the same population loss as , and iii) only makes evaluations of the gradient at any and . The main idea in is to first obtain a that ensures is at most , and then perturb with Gaussian noise to “fuzz” the difference between and the true minimizer. In this work, we use Stochastic Variance Reduced Gradient Descent (SVRG) [JZ13, XZ14] as the optimization algorithm. This leads to a construction that requires near linear oracle complexity (i.e., number of gradient evaluations). In particular, achieves oracle complexity of and asymptotically optimal excess population loss.
Theorem 5.6 (Privacy guarantee of ).
Suppose that Assumption 5.1 holds and that the smoothness parameter satisfies . Then, Algorithm is -differentially private.
Let , and , where is the optimizer defined in Algorithm . Notice that one can compute from the tuple by simple post-processing. Furthermore, the algorithm that outputs is -differentially private by Theorem 5.2. In the following, we will bound in order to make differentially private, conditioned on the knowledge of .
As is -strongly convex,