Gradient Perturbation is Underrated for Differentially Private Convex Optimization

11/26/2019 ∙ by Da Yu, et al. ∙ Microsoft SUN YAT-SEN UNIVERSITY 0

Gradient perturbation, widely used for differentially private optimization, injects noise at every iterative update to guarantee differential privacy. Previous work first determines the noise level that can satisfy the privacy requirement and then analyzes the utility of noisy gradient updates as in non-private case. In this paper, we explore how the privacy noise affects the optimization property. We show that for differentially private convex optimization, the utility guarantee of both DP-GD and DP-SGD is determined by an expected curvature rather than the minimum curvature. The expected curvature represents the average curvature over the optimization path, which is usually much larger than the minimum curvature and hence can help us achieve a significantly improved utility guarantee. By using the expected curvature, our theory justifies the advantage of gradient perturbation over other perturbation methods and closes the gap between theory and practice. Extensive experiments on real world datasets corroborate our theoretical findings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning has become a powerful tool for many practical applications. The training process often needs access to some private dataset, e.g., applications in financial and medical fields. Recent work has shown that the model learned from training data may leak unintended information of individual records (fredrikson2015model; wu2016methodology; shokri2017membership; hitaj2017deep). It is known that Differential privacy (DP) (dwork2006our; dwork2006calibrating) is a golden standard for privacy preserving data analysis. It provides provable privacy guarantee by ensuring the influence of any individual record is negligible. It has been deployed into real world applications by large-scale corporations and U.S. Census Bureau (erlingsson2014rappor; APPLE; abowd2016challenge; ding2017collecting).

We study the fundamental problem when differential privacy meets machine learning: the differentially private empirical risk minimization (DP-ERM) problem (chaudhuri2009privacy; chaudhuri2011differentially; kifer2012private; bassily2014differentially; talwar2015nearly; wu2017bolt; zhang2017efficient; wang2017differentially; smith2017interaction; jayaraman2018distributed; feldman2018privacy; iyengar2019towards; wang2019differentially). DP-ERM minimizes the empirical risk while guaranteeing that the output of learning algorithm is differentially private with respect to the training data. Such privacy guarantee provides strong protection against potential adversaries (hitaj2017deep; rahman2018membership). In order to guarantee privacy, it is necessary to introduce randomness to the algorithm. There are usually three ways to introduce randomness according to the time of adding noise: output perturbation, objective perturbation and gradient perturbation.

Output perturbation (wu2017bolt; zhang2017efficient) first runs the learning algorithm the same as in the non-private case then adds noise to the output parameter. Objective perturbation (chaudhuri2011differentially; kifer2012private; iyengar2019towards) perturbs the objective (i.e., the empirical loss) then release the minimizer of the perturbed objective. Gradient perturbation (song2013stochastic; bassily2014differentially; abadi2016deep; wang2017differentially; lee2018concentrated; jayaraman2018distributed) perturbs each intermediate update. If each update is differentially private, the composition theorem of differential privacy ensures the whole learning procedure is differentially private.

Gradient perturbation comes with several advantages over output/objective perturbations. Firstly, gradient perturbation does not require strong assumption on the objective because it only needs to bound the sensitivity of gradient update rather than the whole learning process. Secondly, gradient perturbation can release the noisy gradient at each iteration without damaging the privacy guarantee as differential privacy is immune to post processing (algofound). Thus, it is a more favorable choice for certain applications such as distributed optimization (rajkumar2012differentially; agarwal2018cpsgd; jayaraman2018distributed). At last, gradient perturbation often achieves better empirical utility than output/objective perturbations for DP-ERM.

However, the existing theoretical utility guarantee for gradient perturbation is the same as or strictly inferior to that of other perturbation methods as shown in Table 1. This motivates us to ask

“What is wrong with the theory for gradient perturbation? Can we justify the empirical advantage of gradient perturbation theoretically?”

We revisit the analysis for gradient perturbation approach. Previous work (bassily2014differentially; wang2017differentially; jayaraman2018distributed)

derive the utility guarantee of gradient perturbation via two steps. They first determine the noise variance at each step that meets the privacy requirement and then derive the utility guarantee by using the convergence analysis the same as in non-private case. However, the noise to guarantee privacy naturally affects the optimization procedure, but previous approach does not exploit the interaction between privacy noise and optimization of gradient perturbation.

In this paper, we utilize the fact the privacy noise affects the optimization procedure and establish new and much tighter utility guarantees for gradient perturbation approaches. Our contribution can be summarized as follows.

  • We introduce an expected curvature that can characterize the optimization property accurately when there is perturbation noise at each gradient update.

  • We establish the utility guarantees for DP-GD for both convex and strongly convex objectives based on the expected curvature rather than the usual minimum curvature.

  • We also establish the the utility guarantees for DP-SGD for both convex and strongly convex objectives based on the expected curvature. To the best of our knowledge, this is the first work to remove the dependency on minimum curvature for DP-ERM algorithms.

In DP-ERM literature, there is a gap between the utility guarantee of non-strongly convex objectives and that of strongly convex objectives. However, by using the expected curvature, we show that some of the non-strongly convex objectives can achieve the same order of utility guarantee as the strongly convex objectives, matching the empirical observation. This is because the expected curvature could be relatively large even for non-strongly convex objectives.

As we mentioned earlier, prior to our work, there is a mismatch between theoretical guarantee and empirical observation of gradient perturbation approach compared with other two perturbation approaches. Our result theoretically justifies the advantage of gradient perturbation and close the mismatch.

1.1 Paper Organization

The rest of this paper is organized as follows. Section 2 introduces notations and the DP-ERM task. In Sections 3, we first introduce the expected curvature and establish the utility guarantee of both DP-GD and DP-SGD based on such expected curvature. Then we give some discussion on three perturbation approaches. We conduct extensive experiments in Section 4. Finally, we conclude in Section 5.

Authors Perturbation Algorithm Utility () Utility ()
chaudhuri2011differentially Objective N/A
zhang2017efficient Output GD
bassily2014differentially Gradient SGD
jayaraman2018distributed Gradient GD N/A
Ours Gradient GD
Ours Gradient SGD
Table 1: Expected excess empirical risk bounds under -DP, where and are the number of samples and the number of parameters, respectively, and and are the smooth coefficient, the strongly convex coefficient and the expected curvature, respectively, and (see Section 3.1). We note that denotes the convex but not strongly convex objective. The Lipschitz constant is assumed to be . We omit for simplicity.

2 Preliminary

We introduce notations and definitions in this section. Given dataset , the objective function is defined as , where is the loss of model for the record .

For simplicity, we use to denote . We use to denote the

norm of a vector

. We use to denote the set of optimal solutions of . Throughout this paper, we assume non-empty.

Definition 1 (Objective properties).

For any , a function

  • is -Lipschitz if .

  • is -smooth if .

  • is convex if .

  • is -strongly convex (or -SC) if .

The strong convexity coefficient is the lower bound of the minimum curvature of function over the domain.

We say that two datasets are neighboring datasets (denoted as ) if can be obtained by arbitrarily modifying one record in (or vice versa). In this paper we consider -differential privacy as follows.

Definition 2 ( -Dp (dwork2006our; dwork2006calibrating)).

A randomized mechanism guarantees -differential privacy if for any two neighboring input datasets and for any subset of outputs it holds that .

We note that

can be viewed as the probability that original

-DP fails and a meaningful setting requires . By its definition, differential privacy controls the maximum influence that any individual record can produce. Smaller implies less information leak but usually leads to worse utility. One can adjust to trade off between privacy and utility.

DP-ERM requires the output is differentially private with respect to the input dataset . Let be one of the optimal solutions of , the utility of DP-ERM algorithm is measured by expected excess empirical risk: , where the expectation is taken over the algorithm randomness.

3 Main Results

In this section, we first define the expected curvature and explain why it depends only on the average curvature. We then use such expected curvature to improve the analysis of both DP-SGD and DP-GD.

3.1 Expected Curvature

In non-private setting, the analysis of convex optimization relies on the strongly convex coefficient , which is the minimum curvature over the domain and can be extremely small for some common objectives. Previous work on DP-ERM uses the same analysis as in non-private case and therefore the resulting utility bounds rely on the minimum curvature. In our analysis, however, we avoid the dependency on the minimum curvature by exploiting how the privacy noise affects the optimization. With the perturbation noise, the expected curvature that the optimization path encounters is related to the average curvature instead of the minimum curvature. Definition 3 uses to capture such average curvature with Gaussian noise. We use to denote the closest solution to the initial point.

Definition 3 (Expected curvature).

A convex function , has expected curvature with respect to noise if for any and where , it holds that

(1)

where the expectation is taken with respect to .

Claim 1.

If is -strongly convex, we have .

Proof.

It can be verified that always holds because of the strongly convex definition. ∎

In fact, represents the average curvature and is much larger than . We use to denote the transpose of . Let be the Hessian matrix evaluated at . We use Taylor expansion to approximate the left hand side of Eq (1) as follows

For convex objective, the Hessian matrix is positive semi-definite and

is the sum of the eigenvalues of

. We can further express out the right hand side of Eq (1) as follows

Based on the above approximation, we can estimate the value of

in Definition 3: . For relatively large , this implies that is the average curvature at . Large variance is a reasonable setting because meaningful differential privacy guarantee requires non-trivial amount of noise.

The above analysis suggests that can be independent of and much larger than . This is indeed true for many convex objectives. Let us take the

regularized logistic regression as an example. The objective is strongly convex only due to the

regularizer. Thus, the minimum curvature (strongly convex coefficient) is the regularization coefficient . Sharmir et al. [1] shows the optimal choice of is (Section 4.3 in [1]). In practice, typical choice of is even smaller and could be on the order of . Figure 2 compares the minimum and average curvatures of regularized logistic regression during the training process. The average curvature is basically unaffected by the regularization term . In contrast, the minimum curvature reaches in first few steps. Therefore removing the dependence on minimum curvature is a significant improvement. We also plot the curvatures for another dataset KDDCup99 in the Appendix C. The resulting curvatures are similar to Figure 2.

Perturbation noise is necessary to attain . We note that when the training process does not involve perturbation noise (corresponding to in Definition 3). For example, objective/output perturbation cannot utilize this expected curvature condition as no noise is injected in their training process. Therefore, among three existing perturbation methods, gradient perturbation is the only method can leverage such effect of noise.

We note that does not necessarily lead to . A concrete example is given in Figure 2 (from negahban2012unified

). It provides an illustration of the loss function in the high-dimensional (

) setting, i.e., the resticted strongly convex scenario: the loss is curved in certain directions but completely flat in others. The average curvature of such objective is always positive but the worst curvature is . Though some recent work shows the utility guarantee of high dimensional DP-ERM task may not depend on the worst curvature (wang2019differentially), Figure 2 still provides a good illustration for the case of . Moreover, as shown in Figure 2, the average curvature of logistic regression on Adult dataset is above 0 during the training procedure even the regularization term is . As we will show later, a positive over the optimization path is sufficient for our optimization analysis.

Figure 1: Curvatures of regularized logistic regression on Adult dataset over training. Dot/cross symbol represents average/minimum curvature respectively.
Figure 2: Illustration of a generic loss function in the high dimensional setting (>, Figure 3 in negahban2012unified).

3.2 Utility Guarantee of DP-GD Based on Expected Curvature

In this section we show that the expected curvature can be used to improve the utility bound of DP-GD (Algorithm 1).

  Input: Privacy parameters ; running steps ; learning rate . Loss function with Lipschitz constant .
  for  to  do
     Compute .
     Update parameter , where .
  end for
Algorithm 1 Differentially Private Gradient Descent (DP-GD)

Algorithm 1 is -DP if we set (jayaraman2018distributed). Let be the training path and be the minimum expected curvature over the path. Now we present the utility guarantee of DP-GD for the case of .

Theorem 1 (Utility guarantee, .).

Suppose is -Lipschitz and -smooth with expected curvature. Set , and , we have

Proof.

All proofs in this paper are relegated to Appendix A. ∎

Remark 1.

Theorem 3 only depends on the expected curvature over the training path .

The expectation is taken over the algorithm randomness if without specification. Theorem 1 significantly improves the original analysis of DP-GD because of our arguments in Section 3.1. If , then the curvatures are flatten in all directions. One example is the linear function, which is used by bassily2014differentially to derive their utility lower bound. Such simple function may not be commonly used as loss function in practice. Nonetheless, we give the utility guarantee for the case of in Theorem 2.

Theorem 2 (Utility guarantee, .).

Suppose is -Lipschitz and -smooth. Set , and . Let , we have

We use parameter averaging to reduce the influence of perturbation noise because gradient update does not have strong contraction effect when .

3.3 Utiltiy Guarantee of DP-SGD Based on Expected Curvature

Stochastic gradient descent has become one of the most popular optimization methods because of the cheap one-iteration cost. In this section we show that expected curvature can also improve the utility analysis for DP-SGD (Algorithm 2). We note that represents an element from the subgradient set evaluated at when the objective is not smooth. Before stating our theorem, we introduce the moments accountant technique (Lemma 1) that is essential to establish privacy guarantee.

Lemma 1 (abadi2016deep).

There exist constants and so that given running steps , for any , Algorithm 2 is -differentially private for any if we choose .

Input : Dataset . Individual loss function: with Lipschitz constant . Number of iterations: . Learning rate: .
1 for  do
2       Sample from uniformly. Compute . Update parameter , where .
3 end for
Algorithm 2 Differentially Private Stochastic Gradient Descent (DP-SGD)

For the case of , Theorem 3 presents the utility guarantee of DP-SGD.

Theorem 3 (Utility guarantee, .).

Suppose is -Lipschitz with expected curvature. Choose based on Lemma 1 to guarantee -DP. Set and , we have

Remark 2.

Theorem 3 does not require smooth assumption.

Theorem 3 shows the utility guarantee of DP-SGD also depends on rather than . We set following bassily2014differentially. We note that is necessary even for non-private SGD to reach precision. We next show for a relatively coarse precision, the running time can be reduced significantly.

Theorem 4.

Suppose is -Lipschitz with expected curvature. Choose based on Lemma 1 to guarantee -DP. Set and . Suppose , we have

We note that the analysis of bassily2014differentially yields if setting , which still depends on the minimum curvature. Theorem 5 shows the utility for the case of .

Theorem 5 (Utility guarantee, .).

Suppose is -Lipschitz. Assume for . Choose based on Lemma 1 to guarantee -DP. Let , set and , we have

This utility guarantee can be derived from Theorem 2 in (shamir2013stochastic).

3.4 Discussion on three perturbation approaches.

In this section, we briefly discuss two other perturbation approaches and compare them to the gradient perturbation approach.

Output perturbation (wu2017bolt; zhang2017efficient) perturbs the learning algorithm after training. It adds noise to the resulting model of non-private learning process. The magnitude of perturbation noise is propositional to the maximum influence one record can cause on the learned model. Take the gradient descent algorithm as an example. At each step, the gradient of different records would diverge the two sets of parameters generated by neighboring datasets, the maximum distance expansion is related to the Lipschitz coefficient. At the same time, the gradient of the same records in two datasets would shrink the parameter distance because of the contraction effect of the gradient update. The contraction effect depends on the smooth and strongly convex coefficient. Smaller strongly convex coefficient leads to weaker contraction. The sensitivity of output perturbation algorithm is the upper bound on the largest possible final distance between two sets of parameters.

Objective perturbation (chaudhuri2011differentially; kifer2012private; iyengar2019towards) perturbs the objective function before training. It requires the objective function to be strongly convex to guarantee the uniqueness of the solution. It first adds regularization to obtain strong convexity if the original objective is not strongly convex. Then it perturbs the objective with a random linear term. The sensitivity of objective perturbation is the maximum change of the minimizer that one record can produce. chaudhuri2011differentially and kifer2012private use the largest and the smallest eigenvalue (i.e. the smooth and strongly convex coefficient) of the objective’s Hessian matrix to upper bound such change.

In comparison, gradient perturbation is more flexible than output/objective perturbation. For example, to bound the sensitivity, gradient perturbation only requires Lipschitz coefficient which can be easily obtained by using the gradient clipping technique. However, both output and objective perturbation further need to compute the smooth coefficient, which is hard for some common objectives such as softmax regression.

More critically, output/objective perturbation cannot utilize the expected curvature condition because their training process does not contain perturbation noise. Moreover, they have to consider the worst performance of learning algorithm. That is because DP makes the worst case assumption on query function and output/objective perturbation treat the whole learning algorithm as a single query to private dataset. This explains why their utility guarantee depends on the worst curvature of the objective.

4 Experiment

In this section, we evaluate the performance of DP-GD and DP-SGD on multiple real world datasets. We use the benchmark datasets provided by iyengar2019towards. Objective functions are logistic regression and softmax regression for binary and multi-class datasets, respectively.

Datasets. The benchmark datasets includes two multi-class datasets (MNIST, Covertype) and five binary datasets, and three of them are high dimensional (Gisette, Real-sim, RCV1). Following iyengar2019towards, we use data for training and the rest for testing. Detailed description of datasets can be found in Appendix B

Implementation details. We track Rényi differentialy privacy (RDP) (mironov2017renyi) and convert it to -DP. Running step is chosen from for both DP-GD and DP-SGD. For DP-SGD, we use moments accountant to track the privacy loss and the sampling ratio is set as

. The standard deviation of the added noise

is set to be the smallest value such that the privacy budget is allowable to run desired steps. We ensure each loss function is Lipschitz by clipping individual gradient. The method in goodfellow2015efficient allows us to clip individual gradient efficiently. Clipping threshold is set as ( for high dimensional datasets because of the sparse gradient). For DP-GD, learning rate is chosen from ( for high dimensional datasets). The learning rate of DP-SGD is twice as large as DP-GD and it is divided by at the middle of training. Privacy parameter is set as . The regularization coefficient is set as . All reported numbers are averaged over 20 runs.

Baseline algorithms. The baseline algorithms include state-of-the-art objective and output perturbation algorithms. For objective perturbation, we use Approximate Minima Perturbation (AMP) (iyengar2019towards). For output perturbation, we use the algorithm in wu2017bolt

(Output perturbation SGD). We adopt the implementation and hyperparameters in

iyengar2019towards for both algorithms. For multi-class classification tasks, wu2017bolt and iyengar2019towards

divide the privacy budget evenly and train multiple binary classifiers because their algorithms need to compute smooth coefficient before training and therefore are not directly applicable to softmax regression.

Experiment results. The validation accuracy results for all evaluated algorithms with ( for multi-class datasets) are presented in Table 2. We also plot the accuracy results with varying in Figure 3. These results confirm our theory in Section 3: gradient perturbation achieves better performance than other perturbation methods as it leverages the average curvature.

KDDCup99 Adult MNIST Covertype Gisette Real-sim RCV1
Non private 99.1 84.8 91.9 71.2 96.6 93.3 93.5
AMP111For multi-class datas sets MNIST and Covertype, we use the numbers reported in iyengar2019towards directly because of the long running time of AMP especially on multi-class datasets. 97.5 79.3 71.9 64.3 62.8 73.1 64.5
Out-SGD 98.1 77.4 69.4 62.4 62.3 73.2 66.7
DP-SGD 98.7 80.4 87.5 67.7 63.0 73.8 70.4
DP-GD 98.7 80.9 88.6 66.2 67.3 76.1 74.9
Table 2: Algorithm validation accuracy (in %) on various kinds of real world datasets. Privacy parameter is for binary dataset and for multi-classes datasets.

Figure 3: Algorithm validation accuracy (in %) with varying . NP represents non-private baseline. Detailed description about evaluated datasets can be found in Table 3.

5 Conclusion

In this paper, we show the privacy noise actually helps optimization analysis, which can be used to improve the utility guarantee of both DP-GD and DP-SGD. Our result theoretically justifies the empirical superiority of gradient perturbation over other methods and advance the state of the art utility guarantee of DP-ERM algorithms. Experiments on real world datasets corroborate our theoretical findings nicely. In the future, it is interesting to consider how to utilize the expected curvature condition to improve the utility guarantee of other gradient perturbation based algorithms.

References

Appendix A Proofs Related to DP-GD and DP-SGD

Proof of Theorem 1.

Let be the path generated by optimization procedure. Since contains Gaussian perturbation noise , Definition 3 gives us

Since is -smooth, we have

Take linear combination of above inequalities,

(2)

Let be the solution error at step . We have the following inequalities between and .

(3)

Take expectation with respect to , we have

(4)

Further take expectation with respect to and use Eq 2, we have

(5)

Set and ,

(6)

Applying Eq (6) and taking expectation with respect to iteratively yields

(7)

Uniform privacy budget allocation scheme sets

Therefore

(8)

Set , we have

(9)

Last inequality holds because for .

Therefore, for , we have the excepted solution error satisfies

(10)

Since is -smooth, we have

(11)

Using Eq (10) and Eq (11), we have the excepted excess risk satisfies

for . The utility bound is minimized when . ∎

Proof of Theorem 2.

The smooth condition gives us,

(12)

Take expectation with respect to and substitute ,

(13)

Subtract on both sides and use convexity,

(14)

Substitute ,

(15)

Summing over and take expectation with respect to ,

(16)

Use convexity,

(17)

Choose , we have

(18)

Proof of Theorem 3 and 4.

We start by giving a useful lemma.

Lemma 2.

Choose , the expected solution error of in Algorithm 2 for any satisfies

Proof of Lemma 2.

We have

(19)

Take expectation with respect to perturbation noise and uniform sampling, we have

(20)

Further take expectation to and apply Definition 3,

(21)

Now we use induction to conduct the proof. Substitute into Eq 21, we have Lemma 2 hold for .

Assume holds for , then

(22)

It’s easy to check that Eq 20 holds for arbitrary rather than . Rearrange Eq 20 and take expectation, we have

(23)

Let be arbitrarily chosen from . Summing over the last iterations and use convexity to lower bound by ,

(24)

Substitute and follow the idea in shamir2013stochastic by choosing , we arrive at