# Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds

In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex. Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for (ϵ,0)- and (ϵ,δ)-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different. Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.

Comments

There are no comments yet.

## Authors

• 14 publications
• 18 publications
• 13 publications
• ### Differentially Private Empirical Risk Minimization Revisited: Faster and More General

In this paper we study the differentially private Empirical Risk Minimiz...
02/14/2018 ∙ by Di Wang, et al. ∙ 0

read it

• ### Private Empirical Risk Minimization Beyond the Worst Case: The Effect of the Constraint Set Geometry

Empirical Risk Minimization (ERM) is a standard technique in machine lea...
11/20/2014 ∙ by Kunal Talwar, et al. ∙ 0

read it

• ### The Computational Complexity of ReLU Network Training Parameterized by Data Dimensionality

Understanding the computational complexity of training simple neural net...
05/18/2021 ∙ by Vincent Froese, et al. ∙ 0

read it

• ### Multi-Observation Regression

Recent work introduced loss functions which measure the error of a predi...
02/27/2018 ∙ by Rafael Frongillo, et al. ∙ 0

read it

• ### Empirical Risk Minimization in the Non-interactive Local Model of Differential Privacy

In this paper, we study the Empirical Risk Minimization (ERM) problem in...
11/11/2020 ∙ by Di Wang, et al. ∙ 0

read it

• ### Differentially Private Empirical Risk Minimization in Non-interactive Local Model via Polynomial of Inner Product Approximation

In this paper, we study the Empirical Risk Minimization problem in the n...
12/17/2018 ∙ by Di Wang, et al. ∙ 0

read it

• ### Efficient Empirical Risk Minimization with Smooth Loss Functions in Non-interactive Local Differential Privacy

In this paper, we study the Empirical Risk Minimization problem in the n...
02/12/2018 ∙ by Di Wang, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Convex optimization is one of the most basic and powerful computational tools in statistics and machine learning. It is most commonly used for empirical risk minimization (ERM): the data set

defines a convex loss function which is minimized over a convex set . When run on sensitive data, however, the results of convex ERM can leak sensitive information. For example, medians and support vector machine parameters can, in many cases, leak entire records in the clear (see “Motivation”, below).

In this paper, we provide new algorithms and matching lower bounds for differentially private convex ERM assuming only that each data point’s contribution to the loss function is Lipschitz and that the domain of optimization is bounded. This builds on a line of work started by Chaudhuri et al. [11].

#### Problem formulation.

Given a data set drawn from a universe , and a closed, convex set , our goal is to

 minimize L(θ;D)=n∑i=1ℓ(θ;di) over θ∈C

The map defines, for each data point , a loss function on . We will generally assume that is convex and -Lipschitz for all . One obtains variants on this basic problem by assuming additional restrictions, such as (i) that is -strongly convex for all , and/or (ii) that is -smooth for all . Definitions of Lipschitz, strong convexity and smoothness are provided at the end of the introduction.

For example, given a collection of data points in , the Euclidean 1-median is a point in that minimizes the sum of the Euclidean distances to the data points. That is, , which is 1-Lipschitz in for any choice of . Another common example is the support vector machine (SVM): given a data point , one defines a loss function , where (here equals for and 0, otherwise). The loss is -Lipshitz in when .

Our formulation also captures regularized ERM, in which an additional (convex) function is added to the loss function to penalize certain types of solutions; the loss function is then . One can fold the regularizer into the data-dependent functions by replacing with , so that . This folding comes at some loss of generality (since it may increase the Lipschitz constant), but it does not affect asymptotic results. Note that if is -strongly convex, then every is -strongly convex.

We measure the success of our algorithms by the worst-case (over inputs) expected excess empirical risk, namely

 E(L(^θ;D)−L(θ∗;D)), (1)

where is the output of the algorithm,

is the true minimizer, and the expectation is only over the coins of the algorithm. Expected risk guarantees can be converted to high-probability guarantees using standard amplification techniques (see Appendix

D for details).

Another important measure of performance is an algorithm’s (excess) generalization error, where loss is measured with respect to the average over an unknown distribution from which the data are assumed to be drawn i.i.d.. Our upper bounds on empirical risk imply upper bounds on generalization error (via uniform convergence and similar ideas); the resulting bounds are only known to be tight in certain ranges of parameters, however. Detailed statements may be found in Appendix F.

#### Motivation.

Convex ERM is used for fitting models from simple least-squares regression to support vector machines, and their use may have significant implications to privacy. As a simple example, note that the Euclidean 1-median of a data set will typically be an actual data point, since the gradient of the loss function has discontinuities at each of the . (Thinking about the one-dimensional median, where there is always a data point that minimizes the loss, is helpful.) Thus, releasing the median may well reveal one of the data points in the clear. A more subtle example is the support vector machine (SVM). The solution to an SVM program is often presented in its dual form, whose coefficients typically consist of a set of exact data points. Kasiviswanathan et al. [32] show how the results of many convex ERM problems can be combined to carry out reconstruction attacks in the spirit of Dinur and Nissim [16].

#### Differential privacy

is a rigorous notion of privacy that emerged from a line of work in theoretical computer science and cryptography [19, 6, 21]. We say two data sets and of size are neighbors if they differ in one entry (that is, ). A randomized algorithm is -differentially private (Dwork et al. [21, 20]) if, for all neighboring data sets and and for all events in the output space of , we have

 Pr(A(D)∈S)≤eϵPr(A(D′)∈S)+δ.

Algorithms that satisfy differential privacy for and provide meaningful privacy guarantees, even in the presence of side information. In particular, they avoid the problems mentioned in “Motivation” above. See Dwork [18], Kasiviswanathan and Smith [30], Kifer and Machanavajjhala [33] for discussion of the “semantics” of differential privacy.

#### Setting Parameters.

We will aim to quantify the role of several basic parameters on the excess risk of differentially private algorithms: the size of the data set , the dimension of the parameter space , the Lipschitz constant of the loss functions, the diameter of the constraint set and, when applicable, the strong convexity .

We may take and to be 1 without loss of generality: We can set by rescaling (replacing by with ); we can then set by rescaling the loss function (replacing by ). These two transformations change the excess risk by . The parameter cannot similarly be rescaled while keeping and the same. However, we always have .

In the sequel, we thus focus on the setting where and . To convert excess risk bounds for to the general setting, one can multiply the risk bounds by , and replace by .

### 1.1 Contributions

We give algorithms that significantly improve on the state of the art for optimizing non-smooth loss functions — for both the general case and strongly convex functions, we improve the excess risk bounds by a factor of , asymptotically. The algorithms we give for - and -differential privacy work on very different principles. We group the algorithms below by technique: gradient descent, exponential sampling, and localization.

For the purposes of this section, notation hides factors polynomial in and . Detailed bounds are stated in Table 1.

#### Gradient descent-based algorithms.

For -differential privacy, we show that a noisy version of gradient descent achieves excess risk . This matches our lower bound, , up to logarithmic factors. (Note that every has excess risk at most , so a lower bound of can always be matched.) For -strongly convex functions, a variant of our algorithm has risk , which matches the lower bound when is bounded below by a constant (recall that since ).

Previously, the best known risk bounds were for general convex functions and for -strongly convex functions (achievable via several different techniques (Chaudhuri et al. [11], Kifer et al. [34], Jain et al. [28], Duchi et al. [17])). Under the restriction that each data point’s contribution to the loss function is sufficiently smooth, objective perturbation [11, 34] also has risk (which is tight, since the lower bounds apply to smooth functions). However, smooth functions do not include important special cases such as medians and support vector machines. Chaudhuri et al. [11] suggest applying their technique to support vector machines by smoothing (“huberizing”) the loss function. We show in Appendix A that this approach still yields expected excess risk .

Although straightforward noisy gradient descent would work well in our setting, we present a faster variant based on stochastic gradient descent: At each step , the algorithm samples a random point from the data set, computes a noisy version of ’s contribution to the gradient of

at the current estimate

, and then uses that noisy measurement to update the parameter estimate. The algorithm is similar to algorithms that have appeared previously (Williams and McSherry [49] first investigated gradient descent with noisy updates; stochastic variants were studied by Jain et al. [28], Duchi et al. [17], Song et al. [46]). The novelty of our analysis lies in taking advantage of the randomness in the choice of (following Kasiviswanathan et al. [31]) to run the algorithm for many steps without a significant cost to privacy. Running the algorithm for steps, gives the desired expected excess risk bound. Even nonprivate first-order algorithms—i.e., those based on gradient measurements—must learn information about the gradient at points to get risk bounds that are independent of (this follows from “oracle complexity” bounds showing that convergence rate is optimal [39, 1]). Thus, the query complexity of our algorithm cannot be improved without using more information about the loss function, such as second derivatives.

The gradient descent approach does not, to our knowledge, allow one to get optimal excess risk bounds for -differential privacy. The main obstacle is that “strong composition” of -privacy Dwork et al. [22] appears necessary to allow a first-order method to run for sufficiently many steps.

#### Exponential Sampling-based Algorithms.

For -differential privacy, we observe that a straightforward use of the exponential mechanism — sampling from an appropriately-sized net of points in , where each point has probability proportional to — has excess risk on general Lipschitz functions, nearly matching the lower bound of . (The bound would not be optimal for -privacy because it scales as , not ). This mechanism is inefficient in general since it requires construction of a net and an appropriate sampling mechanism.

We give a polynomial time algorithm that achieves the optimal excess risk, namely . Note that the achieved excess risk does not have any logarithmic factors which is shown to be the case using a “peeling-” type argument that is specific to convex functions. The idea of our algorithm is to sample efficiently from the continuous distribution on all points in with density . Although the distribution we hope to sample from is log-concave, standard techniques do not work for our purposes: existing methods converge only in statistical difference, whereas we require a multiplicative convergence guarantee to provide -differential privacy. Previous solutions to this issue (Hardt and Talwar [25]

) worked for the uniform distribution, but not for general log-concave distributions.

The problem comes from the combination of an arbitrary convex set and an arbitrary (Lipschitz) loss function defining . We circumvent this issue by giving an algorithm that samples from an appropriately defined distribution on a cube containing , such that (i) the algorithm outputs a point in with constant probability, and (ii) , conditioned on sampling from , is within multiplicative distance from the correct distribution. We use, as a subroutine, the random walk on grid points of the cube of [2].

#### Localization: Optimal Algorithms for Strongly Convex Functions.

The exponential-sampling-based technique discussed above does not take advantage of strong convexity of the loss function. We show, however, that a novel combination of two standard techniques—the exponential mechanism and Laplace-noise-based output perturbation—does yield an optimal algorithm. Chaudhuri et al. [11] and [41] showed that strongly convex functions have low-sensitivity minimizers, and hence that one can release the minimum of a strongly convex function with Laplace noise (with total Euclidean length about if each loss function is -strongly convex). Simply using this first estimate as a candidate output does not yield optimal utility in general; instead it gives a risk bound of roughly .

The main insight is that this first estimate defines us a small neighborhood , of radius about , that contains the true minimizer. Running the exponential mechanism in this small set improves the excess risk bound by a factor of about over running the same mechanism on all of . The final risk bound is then , which matches the lower bound of when . This simple “localization” idea is not needed for -privacy, since the gradient descent method can already take advantage of strong convexity to converge more quickly.

#### Lower Bounds.

We use techniques developed to bound the accuracy of releasing 1-way marginals (due to Hardt and Talwar [25] for and Bun et al. [8] for -privacy) to show that our algorithms have essentially optimal risk bounds. The instances that arise in our lower bounds are simple: the functions can be linear (or quadratic, for the case of strong convexity) and the constraint set can be either the unit ball or the hypercube. In particular, our lower bounds apply to special case of smooth functions, demonstrating the optimality of objective perturbation [11, 34] in that setting. The reduction to lower-bounds for 1-way marginals is not quite black-box; we exploit specific properties of the instances used by Hardt and Talwar [25], Bun et al. [8].

Finally, we provide a much stronger lower bound on the utility of a specific algorithm, the Huberization-based algorithm proposed by Chaudhuri et al. [11] for support vector machines. In order to apply their algorithm to nonsmooth loss functions, they proposed smoothing the loss function by Huberization, and then running their algorithm (which requires smoothness for the privacy analysis) on the resulting, modified loss functions. We show that for any setting of the Huerization parameters, there are simple, one-dimensional nonsmooth loss functions for which the algorithm has error . This bound justifies the effort we put into designing new algorithms for nonsmooth loss functions.

#### Generalization Error.

In Appendix F, we discuss the implications of our results for generalization error. Specifically, suppose that the data are drawn i.i.d. from a distribution , and let denote the expected loss of on unseen data from , that is .

For an important class of loss functions, called generalized linear models, the straightforward application of our algorithms gives generalization error where for the case of -differential privacy, and for -differential privacy (assuming ). This bound is tight: the is necessary even in the nonprivate setting, and the necessity of the term follows from our lower bounds on excess empirical risk (they are also lower bounds on generalization error).

For the case of general Lipschitz convex functions, a modification of our algorithms gives excess risk , where for -differential privacy and for differential privacy (that is, the generalization error bound is roughly the square root of the corresponding empirical error bound). The best known lower bound, however, is the same as for the special case of generalized linear models. The bounds match when (in which case no nontrivial generalization error is possible). However, for smaller values of there remains a gap that is polynomial in . Closing the gap is an interesting open problem.

### 1.2 Other Related Work

In addition to the previous work mentioned above, we mention several closely related works. A rich line of work seeks to characterize the optimal error of differentially private algorithms for learning and optimization Kasiviswanathan et al. [31], Beimel et al. [3], Chaudhuri and Hsu [9], Beimel et al. [4, 5]. In particular, our results on -differential privacy imply nearly-tight bounds on the “representation dimension” Beimel et al. [5] of convex Lipschitz functions.

Jain and Thakurta [27] gave dimension-independent expected excess risk bounds for the special case of “generalized linear models” with a strongly convex regularizer, assuming that (that is, unconstrained optimization). Kifer et al. [34], Smith and Thakurta [45] considered parameter convergence for high-dimensional sparse regression (where ). The settings of those papers are orthogonal to ours, though a common generalization would be interesting.

Efficient implementations of the exponential mechanism over infinite domains were discussed by Hardt and Talwar [25], Chaudhuri et al. [12] and Kapralov and Talwar [29]. The latter two works were specific to sampling (approximately) singular vectors of a matrix, and their techniques do not obviously apply here.

Differentially private convex learning in different models has also been studied: for example, Jain et al. [28], Duchi et al. [17], Smith and Thakurta [44] study online optimization, Jain and Thakurta [26] study an interactive model tailored to high-dimensional kernel learning. Convex optimization techniques have also played an important role in the development of algorithms for “simultaneous query release” (e.g., the line of work emerging from Hardt and Rothblum [24]). We do not know of a direct connection between those works and our setting.

### 1.3 Additional Definitions

For completeness, we state a few additional definitions related to convex sets and functions.

• is -Lipschitz (in the Euclidean norm) if, for all pairs , we have . A subgradient of a convex function at , denoted , is the set of vectors such that for all , .

• is -strongly convex on if, for all , for all subgradients at , and for all , we have (i.e., is bounded below by a quadratic function tangent at ).

• is -smooth on if, for all , for all subgradients at and for all , we have (i.e., is bounded above by a quadratic function tangent at ). Smoothness implies differentiability, so the subgradient at is unique.

• Given a convex set , we denote its diameter by . We denote the projection of any vector to the convex set by .

### 1.4 Organization of this Paper

Our upper bounds (efficient algorithms) are given in Sections 2, 3, and 4, whereas our lower bounds are given in Section 5. Namely, in Section 2, we give efficient construction for -differentially private algorithms for general convex loss as well as Lipschitz strongly convex loss. In Section 3, we discuss a pure -differentially private algorithm for general Lipschitz convex loss and outline an efficient construction for such algorithm. In Section 4, we discuss our localization technique and show how to construct efficient pure -differentially private algorithms for Lipschitz strongly convex loss. We derive our lower bound for general Lipschitz convex loss in Section 5.1 and our lower bound for Lipschitz strongly convex loss in Section 5.2. In Section 6, we discuss a generic construction of an efficient algorithm for sampling (with a multiplicative distance guarantee) from a logconcave distribution over an arbitrary convex bounded set. As a by-product of our generic construction, we give the details of the construction of our efficient -differentially private algorithm from Section 3.2.

The appendices contain proof details and supplementary material: Appendix A shows that smoothing a nonsmooth loss function in order to apply the objective perturbation technique of Chaudhuri et al. [11] can introduce significant additional error. Appendix B gives details on the application of localization in the setting of -differential privacy. Appendix C provides additional details on the proofs of lower bounds. In Appendix D, we explain standard modifications that allow our algorithms to give high probability guarantees instead of expected risk guarantees. Finally, in Appendix F we discuss the how our algorithms can be adapted to provide guarantees on generalization error, rather than empirical error.

## 2 Gradient Descent and Optimal (ϵ,δ)-differentially private Optimization

In this section we provide an algorithm (Algorithm 1) for computing using a noisy stochastic variant of the classic gradient descent algorithm from the optimization literature [7]. Our algorithm (and the utility analysis) was inspired by the approach of Williams and McSherry [49]

All the excess risk bounds (16) in this section and the rest of this paper, are presented in expectation over the randomness of the algorithm. In Section D we provide a generic tool to translate the expectation bounds into high probability bound albeit at a loss of extra logarithmic factor in the inverse of the failure probability.

Note(1): The results in this section do not require the loss function to be differentiable. Although we present Algorithm (and its analysis) using the gradient of the loss function at , the same guarantees hold if instead of the gradient, the algorithm is run with any sub-gradient of at .

Note(2): Instead of using the stochastic variant in Algorithm 1, one can use the complete gradient (i.e., ) in Step 5 and still have the same utility guarantee as Theorem 2.4. However, the running time goes up by a factor of .

###### Theorem 2.1 (Privacy guarantee).

Algorithm (Algorithm 1) is -differentially private.

###### Proof.

At any time step in Algorithm , fix the randomness due to sampling in Line 4. Let

be a random variable defined over the randomness of

and conditioned on (see Line 5 for a definition), where is the data point picked in Line 4. Denote to be the measure of the random variable induced on . For any two neighboring data sets and , define the privacy loss random variable [22] to be . Standard differential privacy arguments for Gaussian noise addition (see [34, 40]) will ensure that with probability (over the randomness of the random variables ’s and conditioned on the randomness due to sampling), for all . Now using the following lemma (Lemma 2.2 with and ) we ensure that over the randomness of ’s and the randomness due to sampling in Line 4 , w.p. at least , for all . While using Lemma 2.2, we ensure that the condition is satisfied.

###### Lemma 2.2 (Privacy amplification via sampling. Lemma 4 in [3]).

Over a domain of data sets , if an algorithm is differentially private, then for any data set , executing on uniformly random entries of ensures -differential privacy.

To conclude the proof, we apply “strong composition” (Lemma 2.3) from [22]. With probability at least , the privacy loss is at most . This concludes the proof.

###### Lemma 2.3 (Strong composition [22]).

Let . The class of -differentially private algorithms satisfies -differential privacy under -fold adaptive composition for .

In the following we provide the utility guarantees for Algorithm under two different settings, namely, when the function is Lipschitz, and when the function is Lipschitz and strongly convex. In Section 5 we argue that these excess risk bounds are essentially tight.

###### Theorem 2.4 (Utility guarantee).

Let . For output by Algorithm we have the following. (The expectation is over the randomness of the algorithm.)

1. Lipschitz functions: If we set the learning rate function , then we have the following excess risk bound. Here is the Lipscthiz constant of the loss function .

 E[L(θpriv;D)−L(θ∗;D)]=O⎛⎝L∥C∥2log3/2(n/δ)√plog(1/δ)ϵ⎞⎠.
2. Lipschitz and strongly convex functions: If we set the learning rate function , then we have the following excess risk bound. Here is the Lipscthiz constant of the loss function and is the strong convexity parameter.

 E[L(θpriv;D)−L(θ∗;D)]=O(L2log2(n/δ)plog(1/δ)nΔϵ2).
###### Proof.

Let in Line 5 of Algorithm 1. First notice that over the randomness of the sampling of the data entry from , and the randomness of , . Additionally, we have the following bound on .

 E[∥Gt∥22] =n2E[∥▽ℓ(˜θt;d)∥22]+2nE[⟨▽ℓ(˜θt;d),bt⟩]+E[∥bt∥22] ≤n2L2+pσ2               [% Here σ2 is the variance of bt in Line ???] (2)

In the above expression we have used the fact that since is independent of , so . Also, we have . We can now directly use Lemma 2.5 to obtain the required error guarantee for Lipschitz convex functions, and Lemma 2.6 for Lipschitz and strongly convex functions.

###### Lemma 2.5 (Theorem 2 from [43]).

Let (for ) be a convex function and let . Let be any arbitrary point from

. Consider the stochastic gradient descent algorithm

, where , and the learning rate function . Then for any , the following is true.

 E[F(θT)−F(θ∗)]=O(∥C∥2GlogT√T).

Using the bound from (2) in Lemma 2.5 (i.e., set ), and setting and the learning rate function as in Lemma 2.5, gives us the required excess risk bound for Lipschitz convex functions. For Lipschitz and strongly convex functions we use the following result by [43].

###### Lemma 2.6 (Theorem 1 from [43]).

Let (for ) be a -strongly convex function and let . Let be any arbitrary point from . Consider the stochastic gradient descent algorithm , where , and the learning rate function . Then for any , the following is true.

 E[F(θT)−F(θ∗)]=O(G2logTλT).

Using the bound from (2) in Lemma 2.6 (i.e., set ), , and setting and the learning rate function as in Lemma 2.6, gives us the required excess risk bound for Lipschitz and strongly convex convex functions. ∎

Note: Algorithm has a running time of , assuming that the gradient computation for takes time . Variants of Algorithm have appeared in [49, 28, 17, 47]. The most relevant work in our current context is that of [47]. The main idea in [47] is to run stochastic gradient descent with gradients computed over small batches of disjoint samples from the data set (as opposed to one single sample used in Algorithm ). The issue with the algorithm is that it cannot provide excess risk guarantee which is , where is the number of data samples. One observation that we make is that if one removes the constraint of disjointness and use the amplification lemma (Lemma 2.2), then one can ensure a much tighter privacy guarantees for the same setting of parameters used in the paper.

## 3 Exponential Sampling and Optimal (ϵ,0)-private Optimization

In this section, we focus on the case of pure -differential privacy and provide an optimal efficient algorithm for empirical risk minimization for the general class of convex and Lipschitz loss functions. The main building block of this section is the well-known exponential mechanism [37].

First, we show that a variant of the exponential mechanism is optimal. A major technical contribution of this section is to make the exponential mechanism computationally efficient which is discussed in Section 3.2.

### 3.1 Exponential Mechanism for Lipschitz Convex Loss

In this section we only deal with loss functions which are Lipschitz. We provide an -differentially private algorithm (Algorithm 2) which achieves the optimal excess risk for arbitrary convex bounded sets.

###### Theorem 3.1 (Privacy guarantee).

Algorithm 2 is -differentially private.

###### Proof.

First, notice that the distribution induced by the exponential weight function in step 2 of Algorithm 2 is the same if we use for some arbitrary point . Since is -Lipschitz, the sensitivity of is at most . The proof then follows directly from the analysis of exponential mechanism by [37]. ∎

In the following we prove the utility guarantee for Algorithm .

###### Theorem 3.2 (Utility guarantee).

Let be the output of (Algorithm 2 above). Then, we have the following guarantee on the expected excess risk. (The expectation is over the randomness of the algorithm.)

 E[L(θpriv;D)−L(θ∗;D)]=O(pL∥C∥2ϵ).
###### Proof.

Consider a differential cone centered at (see Figure 1). We will bound the expected excess risk of by conditioned on for every differential cone. This immediately implies the above theorem by the properties of conditional expectation.

Let be a fixed threshold (to be set later) and let for the purposes of brevity. Let the marked sets ’s in Figure 1 be defined as

 Ai={θ∈Ω∩C:(i−1)Γ≤R(θ)≤i⋅Γ}.

Instead of directly computing the probability of being outside , we will analyze the probabilities for being in each of the ’s individually. This form of “peeling” arguments have been used for risk analysis of convex loss in the machine learning literature (e.g., see [48]) and will allow us to get rid of the extra logarithmic factor that would have otherwise shown up in the excess risk if we use the standard analysis of the exponential mechanism in [37].

Since is a differential cone and since is continuous on , it follows that within , only depends on . Therefore, let be the distance of the set boundaries of from . (See Figure 1.) One can equivalently write each as follows:

 Ai={θ∈Ω∩C:ri−1<∥θ−θ∗∥2≤ri}.

The following claim is the key part of the proof.

###### Claim 3.3.

Convexity of for all implies that for all .

###### Proof.

Since by definition is the minimizer of within and is convex, we have for any such that . This directly implies the required bound. ∎

Now, the volume of the set is given by for some fixed constant . Hence,

 Vol(Ai)Vol(A2)=rpi−1rp1⋅(ri/ri−1)p−1(r2/r1)p−1≤rpi−1rp1≤(i−1)p.

where the last two inequalities follows from Claim 3.3. Hence, we get the following bound on the probability that the excess risk conditioned on (For brevity, we remove the conditioning sign from the probabilities below).

 Pr[R(θpriv)≥4Γ] ≤Pr[θpriv∈∞⋃i=4Ai]Pr[θpriv∈A2]≤∞∑i=4Vol(Ai)Vol(A2)⋅e−ϵ(i−3)Γ2L∥C∥2≤∞∑i=4(i−1)p⋅e−ϵ(i−3)Γ2L∥C∥2 ≤3pe−ϵΓ2L∥C∥21−2pe−ϵΓ2L∥C∥2

where the last inequality follows from the fact that for . Hence, for every , if we choose , then, conditioned on , we get

 Pr[R(θpriv)≥8L∥C∥2ϵ((p+1)ln3+t)]≤e−t.

Since this is true for every , we have our required bound as a corollary.

### 3.2 Efficient Implementation of Algorithm Aexp−samp (Algorithm 2)

In this section, we give a high-level description of a computationally efficient construction of Algorithm 2. Our algorithm runs in polynomial time in and outputs a sample from a distribution that is arbitrarily close (in the multiplicative sense) to the distribution of the output of Algorithm 2.

Since we are interested in an efficient pure -differentially private algorithm, we need an efficient sampler with a multiplicative distance guarantee. In fact, if we were interested in algorithms, efficient sampling with a total variation guarantee would have sufficed which would have made our task a lot easier as we could have used one of the exisiting algorithms, e.g., [36]. In [25], it was shown how to sample efficiently with a multiplicative guarantee from the unifrom distribution over a convex bounded set. However, what we want to achieve here is more general, that is, to sample efficiently from any given logconcave distribution defined over a convex bounded set. To the best of our knowledge, this task has not been explicitly worked out before, nevertheless, all the ingredients needed to accomplish it are present in the literature, mainly [2].

We highlight here the main ideas of our constrution. Since such construction is not specific to our privacy problem and could be of independent interest, in this section, we only provide the high-level description of this construction, however all the details of such construction and the proof of our main result (Theorem 3.4 below) are deferred to Section 6.

###### Theorem 3.4.

There is an efficient version of Algorithm 2 that has the following guarantees.

1. Privacy: The algorithm is -differentially private.

2. Utility: The output of the algorithm satisfies

 E[L(θpriv;D)−L(θ∗;D)]=O(pL∥C∥2ϵ).
3. Running time: Assuming is in isotropic position, the algorithm runs in time111The case where is not in isotropic position is discussed below.

 O(∥C∥22p3n3max{plog(∥C∥2pn),ϵ∥C∥2n}).

In fact, the running time of our algorithm depends on rather than . Namely, all the terms in the running time can be replaced with , however, we chose to write it in this less conservative way since all the bounds in this paper are expressed in terms of .

Before describing our construction, we first introduce some useful notation and discuss some preliminaries.

For any two probability measures defined with respect to the same sample space , the relative (multiplicative) distance between and , denoted as is defined as

 Dist∞(μ,ν)=supq∈Q∣∣∣logdμ(q)dν(q)∣∣∣.

where (resp., ) denotes the ratio of the two measures (more precisely, the Radon-Nikodym derivative).

Assumptions: We assume that we can efficiently test whether a given point lies in using a membership oracle. We also assume that we can efficienly optimize an efficiently computable convex function over a convex set. To do this, it suffices to have a projection oracle. We do not take into account the extra polynomial factor in the running time which is required to perform such operations since this factor is highly dependent on the specific structure of the set .

When is not isotropic: In Theorem 3.4 and in the rest of this section, we assume that is in isotropic position. In particular, we assume that . However, if the convex set is not in isotropic position, fortunately, we know of efficient algorithms for placing it in isotropic position (for example, the algorithm of [35]). In such case, the first step of our algorithm would be to transform to an isotropic position (and apply the corresponding transformation to the loss function). This step takes time polynomial in with additional polylog factor in where is the diameter of the largest ball we can fit inside (See [35] and [23]). Specifically, if , then our set would be already in isotropic position. Then, we sample efficiently from the transformed set. Finally, we apply the inverse transformation to the output sample to obtain a sample from the desired distribution over in its original position. In the worst case, the isotropic transformation can amplify the diameter of by a factor of . Putting all this together, the running time in Theorem 3.4 above will pick up an extra factor of in the worst case if is not in isotropic position.

### 3.3 Our construction

Let denote the diameter of . The Minkowski’s norm of with respect to , denoted as , is defined as . We define for . Note that if and only if . Moreover, it is not hard to verify that is -Lipschitz.

We use the grid-walk algorithm of [2] for sampling from a logconcave distribution defined over a cube as a building block. Our construction is described as follows:

1. Enclose the set with a cube with edges of length .

2. Obtain a convex Lipschitz extension of the loss function over . This can be done efficiently using a projection oracle.

3. Define , for a specific choice of (See Section 6 for details).

4. Run the grid-walk algorithm of [2] with as the input weight function and as the input cube, and output a sample whose distribution is close, with respect to , to the distribution induced by on which is given by .

Now, note that what we have so far is an efficient procedure (let’s denote it by ) that outputs a sample from a distribution over which close, with respect to , to the continuous distribution . We then argue that due to the choices made for the values of the parameters above, outputs a sample in with probability at least . That is, the algorithm succeeds to output a sample from a distribution close to the right distribution on with probability at least . Hence, we can amplify the probability of success by repeating sufficiently many times where fresh random coins are used by in every time (specifically, iterations would suffice). If returns a sample in one of those iterations, then our algorithm terminates outputting . Otherwise, it outputs a uniformly random sample from the unit ball (Note that since is in isotropic position). We finally show that this termination condition can only change the distribution of the output sample by a constant factor sufficiently close to . Hence, we obtain our efficient algorithm referred to in Theorem 3.4.

## 4 Localization and Optimal Private Algorithms for Strongly Convex Loss

It is unclear how to get a direct variant of Algorithm 2 in Section 3 for Lipschitz and strongly convex losses that can achieve optimal excess risk guarantees. The issue in extending Algorithm 2 directly is that the convex set over which the exponential mechanism is defined is “too large” to provide tight guarantees.

We show a generic -differentially private algorithm for minimizing Lipschitz strongly convex loss functions based on a combination of a simple pre-processing step (called the localization step) and any generic -differentially private algorithm for Lipschitz convex loss functions. We carry out the localization step using a simple output perturbation algorithm which ensures that the convex set over which the -differentially private algorithm (in the second step) is run has diameter .

Next, we instantiate the generic -differentially private algorithm in the second step with our efficient exponential mechanism of Section3.1 (Algorithm 2) to obtain an algorithm with optimal excess risk bound (Theorem 4.3).

Note: The localization technique is not specific to pure -differential privacy, and extends naturally to case. Although it is not relevant in our current context, since we already have gradient descent based algorithm which achieves optimal excess risk bound. We defer the details for the case to Appendix B.

Details of the generic algorithm: We first give a simple algorithm that carries out the desired localization step. The crux of the algorithm is the same as to that of the output perturbation algorithm of [10, 11]. The high-level idea is to first compute and add noise according to the sensitivity of . The details of the algorithm are given in Algorithm 3.

Having Algorithm 3 in hand, we now give a generic -differentially private algorithm for minimizing over . Let denote any generic -differentially private algorithm for optimizing over some arbitrary convex set . Algorithm 2 from Section 3.1 or its efficient version Algorithm 7(See Theorem 3.4 and Section 6 for details) is an example of