Private Empirical Risk Minimization Beyond the Worst Case: The Effect of the Constraint Set Geometry

11/20/2014 ∙ by Kunal Talwar, et al. ∙ Google 0

Empirical Risk Minimization (ERM) is a standard technique in machine learning, where a model is selected by minimizing a loss function over constraint set. When the training dataset consists of private information, it is natural to use a differentially private ERM algorithm, and this problem has been the subject of a long line of work started with Chaudhuri and Monteleoni 2008. A private ERM algorithm outputs an approximate minimizer of the loss function and its error can be measured as the difference from the optimal value of the loss function. When the constraint set is arbitrary, the required error bounds are fairly well understood BassilyST14. In this work, we show that the geometric properties of the constraint set can be used to derive significantly better results. Specifically, we show that a differentially private version of Mirror Descent leads to error bounds of the form Õ(G_C/n) for a lipschitz loss function, improving on the Õ(√(p)/n) bounds in Bassily, Smith and Thakurta 2014. Here p is the dimensionality of the problem, n is the number of data points in the training set, and G_C denotes the Gaussian width of the constraint set that we optimize over. We show similar improvements for strongly convex functions, and for smooth functions. In addition, we show that when the loss function is Lipschitz with respect to the ℓ_1 norm and C is ℓ_1-bounded, a differentially private version of the Frank-Wolfe algorithm gives error bounds of the form Õ(n^-2/3). This captures the important and common case of sparse linear regression (LASSO), when the data x_i satisfies |x_i|_∞≤ 1 and we optimize over the ℓ_1 ball. We show new lower bounds for this setting, that together with known bounds, imply that all our upper bounds are tight.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A common task in supervised learning is to select the model that best fits the data. This is frequently achieved by selecting a

loss function that associates a real-valued loss with each datapoint and model and then selecting from a class of admissible models, the model that minimizes the average loss over all data points in the training set. This procedure is commonly referred to as Empirical Risk Minimization(ERM).

The availability of large datasets containing sensitive information from individuals has motivated the study of learning algorithms that guarantee the privacy of individuals contributing to the database. A rigorous and by-now standard privacy guarantee is via the notion of differential privacy. In this work, we study the design of differentially private algorithms for Empirical Risk Minimization, continuing a long line of work initiated by [CM08] and continued in [CMS11, KST12, JKT12, ST13a, DJW13, JT14, BST14, Ull14].

As an example, suppose that the training dataset consists of pairs of data where

, usually called the feature vector, and

, the prediction. The goal is to find a “reasonable model” such that can be predicted from the model and the feature vector . The quality of approximation is usually measured by a loss function , and the empirical loss is defined as . For example, in the linear model with squared loss, . Commonly, one restricts to come from a constraint set . This can account for additional knowledge about , or can be helpful in avoiding overfitting and making the learning algorithm more stable. This leads to the constrained optimization problem of computing . For example, in the classical sparse linear regression problem, we set to be the ball. Now our goal is to compute a model that is private with respect to changes in a single while having high quality, where the quality is measured by the excess empirical risk compared to the optimal model.

Problem definition: Given a data set of samples from a domain , a convex set , and a convex loss function , for any model , define its excess empirical risk as

(1)

We define the risk of a mechanism on a data set as , where the expectation is over the internal randomness of , and the risk is the maximum risk over all the possible data sets. Our objective is then to design a mechanism which preserves -differential privacy (Definition 2.1) and achieves as low risk as possible. We call the minimum achievable risk as privacy risk, defined as , where the min is over all -differentially private mechanisms .

Previous work on private ERM has studied this problem under fairly general conditions. For convex loss functions that for every are 1-Lipschitz as functions from to (i.e. are Lipschitz in the first parameter with respect to the norm), and for contained in the unit ball,  [BST14] showed111Throughout the paper, we use to hide the polynomial factors in , , , and . that the privacy risk is at most . They also showed that this bound cannot be improved in general, even for the squared loss function. Similarly they gave tight bounds under stronger assumptions on the loss functions (more details below).

In this work, we go beyond these worst-case bounds by exploiting properties of the constraint set . In the setting of the previous paragraph, we show that the term in the privacy risk can be replaced by the Gaussian Width of , defined as . Gaussian width is a well-studied quantity in Convex Geometry that captures the global geometry of [Bal97]. For a contained in the the ball it is never larger than and can be significantly smaller. For example, for the ball, the Gaussian width is only . Similarly, we give improved bounds for other assumptions on the loss functions. These bounds are proved by analyzing a noisy version of the mirror descent algorithm [NY83, BT03].

In the simplest setting, when the loss function is convex, and -Lipschitz with respect to the norm on the parameter space, we get the following result. The precise bounds require a potential function that is tailored to the convex set . In the following, let denote the radius of , and denote the Gaussian width of .

Theorem 1.1 (Informal version).

There exists an -differentially private algorithm such that

In particular, , and if is a polytope with vertices, .

Similar improvements can be shown (Section 3.2) for other constraint sets, such as those bounding the grouped

norm, interpolation norms, or the nuclear norm when the vector is viewed as a matrix. When one additionally assumes that the loss functions satisfy a strong convexity definition (Appendix 

A) , we can get further improved bounds. Moreover, for smooth loss functions (Section 4), we can show that a simpler objective perturbation algorithm [CMS11, KST12] gives Gaussian-width dependent bounds similar to the ones above. Our work also implies Gaussian-width-dependent convergence bounds for the noisy (stochastic) mirror descent algorithm, which may be of independent interest.

The bounds based on mirror descent have a dependence on the Lipschitz constant. This constant might be too large for some problems. For example, for the popular sparse linear regression problem, one often assumes to have bounded norm, i.e. each entry of , instead of , is bounded. The Lipschitz constant is then polynomial in and leads to a loose bound. In these cases, it would be more beneficial to have a dependence on the Lipschitz constant. Our next contribution is to address this issue. We show that when is the ball, one can get significantly better bounds using a differentially private version of the Frank-Wolfe algorithm. Let denote the maximum radius of , and the curvature constant for (precise definition in Section 5).

Theorem 1.2.

If is a polytope with vertices, then there exists an -differentially private algorithm such that

In particular, for the sparse linear regression problem where each , we have that

Finally, we use the fingerprinting code lower bound technique developed in [BUV14] to show that the upper bound for the sparse linear regression problem, and hence the above result, is nearly tight.

Theorem 1.3.

For the sparse linear regression problem where , for and , any -differentially private algorithm must have

In Table 1 we summarize our upper and lower bounds. Combining our results with that of [BST14], in particular we show that all the bounds in this paper are essentially tight. The lower bound for the -norm case does not follow from [BST14], and we provide a new lower bound argument.

Previous work This work
Assumption Upper bound Lower bound Upper bound Lower bound
-Lipschitz w.r.t -norm and [BST14] [BST14] Mirror descent:
… and -smooth [CMS11] [BST14]
(for )
Frank-Wolfe:
Obj. pert:
-Lipschitz w.r.t -norm, , and curvature Frank-Wolfe:
Table 1: Upper and lower bounds for -differentially private ERM. denotes the number of corners in the convex set .(In general the dependence is on the Gaussian width of , generalizing or .) The curvature parameter is a weaker condition than smoothness, and is in particular bounded by the smoothness. Bounds ignore multiplicative dependence of and in the lower bounds, is considered as a constant. The lower bounds of [BST14] have the form .

Our results enlarge the set of problems for which privacy comes “for free”. Given samples from a distribution, suppose that is the empirical risk minimizer and is the differentially private approximate minimizer. Then the non-private ERM algorithm outputs and incurs expected (on the distribution) loss equal to the , where the generalization error term depends on the loss function, and on the number of samples . The differentially private algorithm incurs an additional loss of the privacy risk. If the privacy risk is asymptotically no larger than the generalization error, we can think of privacy as coming for free, since under the assumption of being large enough to make the generalization error small, we are also making large enough to make the privacy risk small. For many of the problems, by our work we get privacy risk bounds that are close to the best known generalization bounds for those settings. More concretely, in the case when the and the loss function is -Lipschitz in the -norm, the known generalization error bounds strictly dominate the privacy risk when [SSSSS09, Theorem 7]. In the case when is the -ball, and the loss function is the squared loss with and , the generalization error dominates the privacy risk when [BM03, Theorem 18].

1.1 Related work

In the following we distinguish between the two settings: i) the convex set is bounded in the -norm and the the loss function is -Lipschitz in the -norm (call it the -setting for brevity), and ii) the convex set is bounded in the -norm and the the loss function is -Lipschitz in the -norm (call it the -setting).

The -setting: In all the works on private convex optimization that we are aware of, either the excess risk guarantees depend polynomially on the dimensionality of the problem (), or assumes special structure to the loss (e.g., generalized linear model [JT14] or linear losses [DNPR10, ST13b]). Similar dependence is also present in the online version of the problem [JKT12, ST13c]. [BST14] recently show that in the private ERM setting, in general this polynomial dependence on is unavoidable. In our work we show that one can replace this dependence on with the Gaussian width of the constraint set , which can be much smaller. We use the mirror descent algorithm of  [BT03] as our building block.

The -setting: The only results in this setting that we are aware of are [KST12, ST13a, JT14, ST13b]. The first two works make certain assumtions about the instance (restricted strong convexity (RSC) and mutual incoherence). Under these assumptions, they obtain privacy risk guarantees that depend logarithmically in the dimensions , and thus allowing the guarantees to be meaningful even when . In fact their bound of can be better than our tight bound of . However, these assumptions on the data are strong and may not hold in practice [Was12]. Our guarantees do not require any such data dependent assumptions. The result of [JT14] captures the scenario when the constraint set

is the probability simplex and the loss function is in the generalized linear model, but provides a

worse bound of .

Effect of Gaussian width in risk minimization: For linear losses, the notions of Rademacher complexities and Gaussian complexities are closely related to the concept of Gaussian width, i.e., when the loss function are of the form . One of the initial works that formalized this connection was by [BM03]. They in particular bound the excess generalization error by the Gaussian complexity of the constraint set , which is very similar to Gaussian width in the context of linear functions. Recently [CRPW12] show that the Gaussian width of a constraint set is very closely related to the number of generic linear measurements one needs to perform to recover an underlying model .

[SZ13]

analyzed the problem of noisy stochastic gradient descent (SGD) for general convex loss functions. Their empirical risk guarantees depend polynomially on the

-norm of the noise vector that gets added during the gradient computation in the SGD algorithm. As a corollary of our results we show that if the noise vector is sub-Gaussian (not necessarily spherical), the polynomial dependence on the -norm of the noise can be replaced by a term depending on the Gaussian width of the set .

Analysis of noisy descent methods: The analysis of noisy versions of gradient descent and mirror descent algorithms has attracted interest for unrelated reasons [RRWN11, DJM13] when asynchronous updates are the source of noise. To our knowledge, this line of work does not take the geometry of the constraint set into account, and thus our results may be applicable to those settings as well.

We should note here that the notion of Gaussian width has been used by [NTZ13], and [DNT13] in the context of differentially private query release mechanisms but in the very different context of answering multiple linear queries over a database.

2 Background

2.1 Differential Privacy

The notion of differential privacy (Definition 2.1) is by now a defacto standard for statistical data privacy [DMNS06, Dwo06, Dwo08, Dwo09]. One of the reasons for which differential privacy has become so popular is because it provides meaningful guarantees even in the presence of arbitrary auxiliary information. At a semantic level, the privacy guarantee ensures that an adversary learns almost the same thing about an individual independent of his presence or absence in the data set. The parameters quantify the amount of information leakage. For reasons beyond the scope of this work, and are a good choice of parameters. Here refers to the number of samples in the data set.

Definition 2.1.

A randomized algorithm is -differentially private ([DMNS06, DKM06]) if, for all neighboring data sets and (i.e., they differ in one record, or equivalently, ) and for all events in the output space of , we have

Here refers to the Hamming distance.

2.2 Bregman Divergence, Convexity, Norms, and Gaussian Width

In this section we review some of the concepts commonly used in convex optimization useful to the exposition of our algorithms. In all the definitions below we assume that the set is closed and convex.

-norm, : For , the -norm for any vector is defined as , where is the -th coordinate of the vector .

Minkowski norm w.r.t a set : For any vector , the Minkowski norm (denoted by ) w.r.t. a centrally symmetric convex set is defined as follows.

-Lipschitz continuity w.r.t. norm : A function is -Lispchitz within a set w.r.t. a norm if the following holds.

Convexity and -strong convexity w.r.t norm : A function is convex if

A function is -strongly convex within a set w.r.t. a norm if

Bregman divergence: For any convex function , the Bregman divergence defined by is defined as

Notice that Bregman divergence is always positive, and convex in the first argument.

-strong convexity w.r.t a function : A function is -strongly convex within a set w.r.t. a differentiable convex function if the following holds.

Duality: The following duality property (Fact 2.2) of norms will be useful through the rest of this paper. Recall that for any pair of dual norms and , and , Holder’s inequality says that .

Fact 2.2.

The dual of norm is -norm such that . The dual of is , where for any vector ,

Gaussian width of a set : Let be a Gaussian random vector in . The Gaussian width of a set is defined as .

3 Private Mirror Descent and the Geometry of

In this section we introduce the well-established mirror descent algorithm [NY83] in the context of private convex optimization. We notice that since mirror descent is designed to closely follow the geometry of the convex set , we get much tighter bounds than that were known earlier in the literature for a large class of interesting instantiations of the convex set . More precisely, using private mirror descent one can show that the privacy depends on the Gaussian width (see Section 2.2) instead of any explicit dependence on the dimensionality . The main technical contribution in the analysis of private (noisy) mirror descent is to express the expected potential drop in terms of the Gaussian width.(See (LABEL:eq:alds132) in the proof of Theorem 3.2.)

3.1 Private Mirror Descent Method

In Algorithm 1 we define our private mirror descent procedure. The algorithm takes as input a potential function that is chosen based on the constraint set . refers to the Bregman divergence with respect to . (See Section 2.2.) If is not differentiable at , we use any sub-gradient at instead of .

0:  Data set: , loss function: (with -Lipschitz constant for ), privacy parameters: , convex set: , potential function: , and learning rate: .
1:

  Set noise variance

.
2:  Let be an arbitrary point in .
3:  for  to  do
4:     , where .
5:  Output .
Algorithm 1 : Differentially Private Mirror Descent
Theorem 3.1 (Privacy guarantee).

Algorithm 1 is -differentially private.

The proof of this theorem is fairly straightforward and follows from by now standard privacy guarantee of Gaussian mechanism [DKM06], and the strong composition theorem [DRV10]. For a detailed proof, we refer the reader to [BST14, Theorem 2.1]. To establish the utility guarantee in a general form, it will be useful to introduce a symmetric convex body (and the norm ) w.r.t. which the potential function is strongly convex. We will instantiate this theorem with various choices of and depending on in Section 3.2. While relatively standard in Mirror Descent algorithms, the reader may find it somewhat counter-intuitive that enters the algorithm only through the potential function , but plays an important role the analysis and the resulting guarantee. In most of the cases, we will set and the reader may find it convenient to think of that case. Our proof of the theorem below closely follows the analysis of mirror descent from [ST10].

One can obtain stronger guarantees (typically, ) under strong convexity assumptions on the loss function. We defer the details of this result to Appendix A.

Theorem 3.2 (Utility guarantee).

Suppose that for any , the loss function is convex and -lipschitz with respect to the norm. Let be a symmetric convex set with Gaussian width and -diameter , and let be -strongly convex w.r.t. -norm chosen in Algorithm (Algorithm 1). If and for all , , then

Remark 1.

Notice that the bound above is scale invariant. For example, given an initial choice of the convex set , scaling may reduce but at the same time it will scale up the strong convexity parameter.

Proof of Theorem 3.2.

For the ease of notation we ignore the parameterization of on the data set and simply refer to as . To begin with, from a direct application of Jensen’s inequality, we have the following.

(2)

So it suffices to bound the R.H.S. of (2) in order to bound the excess empirical risk. In Claim 3.3, we upper bound the R.H.S. of (2) by a sequence of linear approximations of , thus “linearizing” our analysis.

Claim 3.3.

Let . For every , let be the sub-gradient of used in iteration of Algorithm (Algorithm 1). Then the convexity of the loss function implies that

Thus it suffices to bound in order to bound the privacy risk. By simple algebraic manipulation we have the following. (Recall that is the noise vector used in Algorithm .)

(3)

We next upper bound each of the terms , and in (3). By Holder’s inequality, we write

where we have used the A.M-G.M. inequality in the second step, and the triangle inequality in the third step. Taking expectations over the choice of , and using Jensen’s inequality and the definition of Gaussian width, we conclude that

(5)

We next proceed to bound the term in (3). By the definition of , it follows that

This implies that

(6)

One can write the term in (3) as follows.

(7)

Notice that since is independent of ,. Plugging the bounds (5),(6) and (7) in (3), we have the following.

(8)

In order to bound the term in (8), we use the assumption that is -strongly convex with respect to . This immediately implies that in (8) . Using this bound, summing over all -rounds, we have

(9)

In the above we used the following property of Bregman divergence: . We can prove this fact as follows. Let . By the generalized Pythagorean theorem [Rak09, Chapter 2], it follows that . The last inequality follows from the fact that Bregman diverence is always non-negative. Now since minimizes and is convex, it follows that . This immediately implies .

Setting and , and using (2) and Claim 3.3 we get the required bound. ∎

3.2 Instantiation of Private Mirror Descent to Various Settings of

In this section we discuss some of the instantiations of Theorem 3.2.

For arbitrary convex set with -diameter : Let (with some fixed ) and we choose the convex set to be the unit -ball in Theorem 3.2. Immediately, we obtain the following as a corollary.

(10)

This is a slight improvement over [BST14].

For the convex set being a polytope: Let be the convex hull of vectors such that for all , . Fact 3.4 will be very useful for choosing the correct potential function in Algorithm (Algorithm 1).

Fact 3.4 (From [Sst11]).

For the convex set defined above, let be the convex hull of and . The Minkowski norm for any is given by . Additionally, let be a norm for any . Then the function is -strongly convex w.r.t. -norm.

In the following we state the following claim which will be useful later.

Claim 3.5.

If , then the following is true for any :

Proof.

First notice that for any vector , . This follows from Holder’s inequality. Now setting , we get . For any , let be the vector of parameters corresponding to . From the above, we know that . And by definition, we know that . This completes the proof. ∎

Claim 3.5 implies that if and , then . Additionally due to Fact 3.4, is -strongly convex w.r.t. . With the above observations, and observing that , setting and as above, we immediately get the following corollary of Theorem 3.2. Notice that the bound does not have any explicit dependence on the dimensionality of the problem.

(11)

Notice that this result extends to the standard -dimensional probability simplex: . In this case, the only difference is that the term gets replaced by in (11). We remark that applying standard approaches of [JT14, ST13b] provides a similar bound only in the case of linear loss functions.

For grouped -norm: For a vector and a parameter , the grouped -norm defined as . If denotes the convex set centered at zero with radius one with respect to -norm, then it follows from union bound on each of the blocks of coordinates in that . In the following we propose the following choices of depending on the parameter . (These choices are based on [BTN13, Section 5.3.3].) For a given , divide the coordinates of into blocks, and denote each block as .

With this setting of one can show that . Plugging these bounds in Theorem 3.2, we get (12) as a corollary.

(12)

Similar bounds can be achieved for other forms of interpolation norms, e.g., -interpolation norms:
with . Notice that since the set is a subset of , where and , it follows that the Gaussian width . Additionally from [SST11] it follows that there exists a strongly convex function w.r.t. such that it is for . While using Theorem 3.2 in both of the above settings, we set the convex set .

For low-rank matrices: It is known that the non-private mirror descent extends immediately to matrices [BTN13]. In the following we show that this is also true for the private mirror descent algorithm in Algorithm 1 (). For the matrix setting, we assume and the loss function is -Lipschitz in the Frobenius norm . From [DTTZ14] it follows that if the noise vector in Algorithm is replaced by a matrix with each entry of drawn i.i.d. from

(with the standard deviation

being the same as in Algorithm ), then the -differential privacy guarantee holds. In the following we instantiate Theorem 3.2 for the class of real matrices with nuclear norm at most one. Call it the set . (For a matrix ,

refers to the sum of the singular values of

.) This class is the convex hull of rank one matrices with unit euclidean norm. [CRPW12, Proposition 3.11] shows that the Gaussian width of is . [BTN13, Section 5.2.3] showed that the function with is -strongly convex w.r.t. -norm. Moreover, . Plugging these bounds in Theorem 3.2 , we immediately get the following excess empirical risk guarantee.

(13)

3.3 Convergence Rate of Noisy Mirror Descent

In this section we analyze the excess empirical risk guarantees of Algorithm 1 (Algorithm ) as a purely noisy mirror descent algorithm, and ignoring privacy considerations. Let us assume that the oracle that returns the gradient computation is noisy. In particular each of the (in Line 4 of Algorithm ) is drawn independently from distributions which are mean zero and sub-Gaussian with variance , where is the covariance matrix. For example, this may be achieved by sampling a small number of ’s and averaging over the sampled values. Using the same proof technique of Theorem 3.2, and the observation that