1 Introduction
Machine learning has become a powerful tool for many practical applications. The training process often needs access to some private dataset, e.g., applications in financial and medical fields. Recent work has shown that the model learned from training data may leak unintended information of individual records (fredrikson2015model; wu2016methodology; shokri2017membership; hitaj2017deep). It is known that Differential privacy (DP) (dwork2006our; dwork2006calibrating) is a golden standard for privacy preserving data analysis. It provides provable privacy guarantee by ensuring the influence of any individual record is negligible. It has been deployed into real world applications by largescale corporations and U.S. Census Bureau (erlingsson2014rappor; APPLE; abowd2016challenge; ding2017collecting).
We study the fundamental problem when differential privacy meets machine learning: the differentially private empirical risk minimization (DPERM) problem (chaudhuri2009privacy; chaudhuri2011differentially; kifer2012private; bassily2014differentially; talwar2015nearly; wu2017bolt; zhang2017efficient; wang2017differentially; smith2017interaction; jayaraman2018distributed; feldman2018privacy; iyengar2019towards; wang2019differentially). DPERM minimizes the empirical risk while guaranteeing that the output of learning algorithm is differentially private with respect to the training data. Such privacy guarantee provides strong protection against potential adversaries (hitaj2017deep; rahman2018membership). In order to guarantee privacy, it is necessary to introduce randomness to the algorithm. There are usually three ways to introduce randomness according to the time of adding noise: output perturbation, objective perturbation and gradient perturbation.
Output perturbation (wu2017bolt; zhang2017efficient) first runs the learning algorithm the same as in the nonprivate case then adds noise to the output parameter. Objective perturbation (chaudhuri2011differentially; kifer2012private; iyengar2019towards) perturbs the objective (i.e., the empirical loss) then release the minimizer of the perturbed objective. Gradient perturbation (song2013stochastic; bassily2014differentially; abadi2016deep; wang2017differentially; lee2018concentrated; jayaraman2018distributed) perturbs each intermediate update. If each update is differentially private, the composition theorem of differential privacy ensures the whole learning procedure is differentially private.
Gradient perturbation comes with several advantages over output/objective perturbations. Firstly, gradient perturbation does not require strong assumption on the objective because it only needs to bound the sensitivity of gradient update rather than the whole learning process. Secondly, gradient perturbation can release the noisy gradient at each iteration without damaging the privacy guarantee as differential privacy is immune to post processing (algofound). Thus, it is a more favorable choice for certain applications such as distributed optimization (rajkumar2012differentially; agarwal2018cpsgd; jayaraman2018distributed). At last, gradient perturbation often achieves better empirical utility than output/objective perturbations for DPERM.
However, the existing theoretical utility guarantee for gradient perturbation is the same as or strictly inferior to that of other perturbation methods as shown in Table 1. This motivates us to ask
“What is wrong with the theory for gradient perturbation? Can we justify the empirical advantage of gradient perturbation theoretically?”
We revisit the analysis for gradient perturbation approach. Previous work (bassily2014differentially; wang2017differentially; jayaraman2018distributed)
derive the utility guarantee of gradient perturbation via two steps. They first determine the noise variance at each step that meets the privacy requirement and then derive the utility guarantee by using the convergence analysis the same as in nonprivate case. However, the noise to guarantee privacy naturally affects the optimization procedure, but previous approach does not exploit the interaction between privacy noise and optimization of gradient perturbation.
In this paper, we utilize the fact the privacy noise affects the optimization procedure and establish new and much tighter utility guarantees for gradient perturbation approaches. Our contribution can be summarized as follows.

We introduce an expected curvature that can characterize the optimization property accurately when there is perturbation noise at each gradient update.

We establish the utility guarantees for DPGD for both convex and strongly convex objectives based on the expected curvature rather than the usual minimum curvature.

We also establish the the utility guarantees for DPSGD for both convex and strongly convex objectives based on the expected curvature. To the best of our knowledge, this is the first work to remove the dependency on minimum curvature for DPERM algorithms.
In DPERM literature, there is a gap between the utility guarantee of nonstrongly convex objectives and that of strongly convex objectives. However, by using the expected curvature, we show that some of the nonstrongly convex objectives can achieve the same order of utility guarantee as the strongly convex objectives, matching the empirical observation. This is because the expected curvature could be relatively large even for nonstrongly convex objectives.
As we mentioned earlier, prior to our work, there is a mismatch between theoretical guarantee and empirical observation of gradient perturbation approach compared with other two perturbation approaches. Our result theoretically justifies the advantage of gradient perturbation and close the mismatch.
1.1 Paper Organization
The rest of this paper is organized as follows. Section 2 introduces notations and the DPERM task. In Sections 3, we first introduce the expected curvature and establish the utility guarantee of both DPGD and DPSGD based on such expected curvature. Then we give some discussion on three perturbation approaches. We conduct extensive experiments in Section 4. Finally, we conclude in Section 5.
Authors  Perturbation  Algorithm  Utility ()  Utility () 
chaudhuri2011differentially  Objective  N/A  
zhang2017efficient  Output  GD  
bassily2014differentially  Gradient  SGD  
jayaraman2018distributed  Gradient  GD  N/A  
Ours  Gradient  GD  
Ours  Gradient  SGD 
2 Preliminary
We introduce notations and definitions in this section. Given dataset , the objective function is defined as , where is the loss of model for the record .
For simplicity, we use to denote . We use to denote the
norm of a vector
. We use to denote the set of optimal solutions of . Throughout this paper, we assume nonempty.Definition 1 (Objective properties).
For any , a function

is Lipschitz if .

is smooth if .

is convex if .

is strongly convex (or SC) if .
The strong convexity coefficient is the lower bound of the minimum curvature of function over the domain.
We say that two datasets are neighboring datasets (denoted as ) if can be obtained by arbitrarily modifying one record in (or vice versa). In this paper we consider differential privacy as follows.
Definition 2 ( Dp (dwork2006our; dwork2006calibrating)).
A randomized mechanism guarantees differential privacy if for any two neighboring input datasets and for any subset of outputs it holds that .
We note that
can be viewed as the probability that original
DP fails and a meaningful setting requires . By its definition, differential privacy controls the maximum influence that any individual record can produce. Smaller implies less information leak but usually leads to worse utility. One can adjust to trade off between privacy and utility.DPERM requires the output is differentially private with respect to the input dataset . Let be one of the optimal solutions of , the utility of DPERM algorithm is measured by expected excess empirical risk: , where the expectation is taken over the algorithm randomness.
3 Main Results
In this section, we first define the expected curvature and explain why it depends only on the average curvature. We then use such expected curvature to improve the analysis of both DPSGD and DPGD.
3.1 Expected Curvature
In nonprivate setting, the analysis of convex optimization relies on the strongly convex coefficient , which is the minimum curvature over the domain and can be extremely small for some common objectives. Previous work on DPERM uses the same analysis as in nonprivate case and therefore the resulting utility bounds rely on the minimum curvature. In our analysis, however, we avoid the dependency on the minimum curvature by exploiting how the privacy noise affects the optimization. With the perturbation noise, the expected curvature that the optimization path encounters is related to the average curvature instead of the minimum curvature. Definition 3 uses to capture such average curvature with Gaussian noise. We use to denote the closest solution to the initial point.
Definition 3 (Expected curvature).
A convex function , has expected curvature with respect to noise if for any and where , it holds that
(1) 
where the expectation is taken with respect to .
Claim 1.
If is strongly convex, we have .
Proof.
It can be verified that always holds because of the strongly convex definition. ∎
In fact, represents the average curvature and is much larger than . We use to denote the transpose of . Let be the Hessian matrix evaluated at . We use Taylor expansion to approximate the left hand side of Eq (1) as follows
For convex objective, the Hessian matrix is positive semidefinite and
is the sum of the eigenvalues of
. We can further express out the right hand side of Eq (1) as followsBased on the above approximation, we can estimate the value of
in Definition 3: . For relatively large , this implies that is the average curvature at . Large variance is a reasonable setting because meaningful differential privacy guarantee requires nontrivial amount of noise.The above analysis suggests that can be independent of and much larger than . This is indeed true for many convex objectives. Let us take the
regularized logistic regression as an example. The objective is strongly convex only due to the
regularizer. Thus, the minimum curvature (strongly convex coefficient) is the regularization coefficient . Sharmir et al. [1] shows the optimal choice of is (Section 4.3 in [1]). In practice, typical choice of is even smaller and could be on the order of . Figure 2 compares the minimum and average curvatures of regularized logistic regression during the training process. The average curvature is basically unaffected by the regularization term . In contrast, the minimum curvature reaches in first few steps. Therefore removing the dependence on minimum curvature is a significant improvement. We also plot the curvatures for another dataset KDDCup99 in the Appendix C. The resulting curvatures are similar to Figure 2.Perturbation noise is necessary to attain . We note that when the training process does not involve perturbation noise (corresponding to in Definition 3). For example, objective/output perturbation cannot utilize this expected curvature condition as no noise is injected in their training process. Therefore, among three existing perturbation methods, gradient perturbation is the only method can leverage such effect of noise.
We note that does not necessarily lead to . A concrete example is given in Figure 2 (from negahban2012unified
). It provides an illustration of the loss function in the highdimensional (
) setting, i.e., the resticted strongly convex scenario: the loss is curved in certain directions but completely flat in others. The average curvature of such objective is always positive but the worst curvature is . Though some recent work shows the utility guarantee of high dimensional DPERM task may not depend on the worst curvature (wang2019differentially), Figure 2 still provides a good illustration for the case of . Moreover, as shown in Figure 2, the average curvature of logistic regression on Adult dataset is above 0 during the training procedure even the regularization term is . As we will show later, a positive over the optimization path is sufficient for our optimization analysis.3.2 Utility Guarantee of DPGD Based on Expected Curvature
In this section we show that the expected curvature can be used to improve the utility bound of DPGD (Algorithm 1).
Algorithm 1 is DP if we set (jayaraman2018distributed). Let be the training path and be the minimum expected curvature over the path. Now we present the utility guarantee of DPGD for the case of .
Theorem 1 (Utility guarantee, .).
Suppose is Lipschitz and smooth with expected curvature. Set , and , we have
Proof.
All proofs in this paper are relegated to Appendix A. ∎
Remark 1.
Theorem 3 only depends on the expected curvature over the training path .
The expectation is taken over the algorithm randomness if without specification. Theorem 1 significantly improves the original analysis of DPGD because of our arguments in Section 3.1. If , then the curvatures are flatten in all directions. One example is the linear function, which is used by bassily2014differentially to derive their utility lower bound. Such simple function may not be commonly used as loss function in practice. Nonetheless, we give the utility guarantee for the case of in Theorem 2.
Theorem 2 (Utility guarantee, .).
Suppose is Lipschitz and smooth. Set , and . Let , we have
We use parameter averaging to reduce the influence of perturbation noise because gradient update does not have strong contraction effect when .
3.3 Utiltiy Guarantee of DPSGD Based on Expected Curvature
Stochastic gradient descent has become one of the most popular optimization methods because of the cheap oneiteration cost. In this section we show that expected curvature can also improve the utility analysis for DPSGD (Algorithm 2). We note that represents an element from the subgradient set evaluated at when the objective is not smooth. Before stating our theorem, we introduce the moments accountant technique (Lemma 1) that is essential to establish privacy guarantee.
Lemma 1 (abadi2016deep).
There exist constants and so that given running steps , for any , Algorithm 2 is differentially private for any if we choose .
For the case of , Theorem 3 presents the utility guarantee of DPSGD.
Theorem 3 (Utility guarantee, .).
Suppose is Lipschitz with expected curvature. Choose based on Lemma 1 to guarantee DP. Set and , we have
Remark 2.
Theorem 3 does not require smooth assumption.
Theorem 3 shows the utility guarantee of DPSGD also depends on rather than . We set following bassily2014differentially. We note that is necessary even for nonprivate SGD to reach precision. We next show for a relatively coarse precision, the running time can be reduced significantly.
Theorem 4.
Suppose is Lipschitz with expected curvature. Choose based on Lemma 1 to guarantee DP. Set and . Suppose , we have
We note that the analysis of bassily2014differentially yields if setting , which still depends on the minimum curvature. Theorem 5 shows the utility for the case of .
Theorem 5 (Utility guarantee, .).
Suppose is Lipschitz. Assume for . Choose based on Lemma 1 to guarantee DP. Let , set and , we have
This utility guarantee can be derived from Theorem 2 in (shamir2013stochastic).
3.4 Discussion on three perturbation approaches.
In this section, we briefly discuss two other perturbation approaches and compare them to the gradient perturbation approach.
Output perturbation (wu2017bolt; zhang2017efficient) perturbs the learning algorithm after training. It adds noise to the resulting model of nonprivate learning process. The magnitude of perturbation noise is propositional to the maximum influence one record can cause on the learned model. Take the gradient descent algorithm as an example. At each step, the gradient of different records would diverge the two sets of parameters generated by neighboring datasets, the maximum distance expansion is related to the Lipschitz coefficient. At the same time, the gradient of the same records in two datasets would shrink the parameter distance because of the contraction effect of the gradient update. The contraction effect depends on the smooth and strongly convex coefficient. Smaller strongly convex coefficient leads to weaker contraction. The sensitivity of output perturbation algorithm is the upper bound on the largest possible final distance between two sets of parameters.
Objective perturbation (chaudhuri2011differentially; kifer2012private; iyengar2019towards) perturbs the objective function before training. It requires the objective function to be strongly convex to guarantee the uniqueness of the solution. It first adds regularization to obtain strong convexity if the original objective is not strongly convex. Then it perturbs the objective with a random linear term. The sensitivity of objective perturbation is the maximum change of the minimizer that one record can produce. chaudhuri2011differentially and kifer2012private use the largest and the smallest eigenvalue (i.e. the smooth and strongly convex coefficient) of the objective’s Hessian matrix to upper bound such change.
In comparison, gradient perturbation is more flexible than output/objective perturbation. For example, to bound the sensitivity, gradient perturbation only requires Lipschitz coefficient which can be easily obtained by using the gradient clipping technique. However, both output and objective perturbation further need to compute the smooth coefficient, which is hard for some common objectives such as softmax regression.
More critically, output/objective perturbation cannot utilize the expected curvature condition because their training process does not contain perturbation noise. Moreover, they have to consider the worst performance of learning algorithm. That is because DP makes the worst case assumption on query function and output/objective perturbation treat the whole learning algorithm as a single query to private dataset. This explains why their utility guarantee depends on the worst curvature of the objective.
4 Experiment
In this section, we evaluate the performance of DPGD and DPSGD on multiple real world datasets. We use the benchmark datasets provided by iyengar2019towards. Objective functions are logistic regression and softmax regression for binary and multiclass datasets, respectively.
Datasets. The benchmark datasets includes two multiclass datasets (MNIST, Covertype) and five binary datasets, and three of them are high dimensional (Gisette, Realsim, RCV1). Following iyengar2019towards, we use data for training and the rest for testing. Detailed description of datasets can be found in Appendix B
Implementation details. We track Rényi differentialy privacy (RDP) (mironov2017renyi) and convert it to DP. Running step is chosen from for both DPGD and DPSGD. For DPSGD, we use moments accountant to track the privacy loss and the sampling ratio is set as
. The standard deviation of the added noise
is set to be the smallest value such that the privacy budget is allowable to run desired steps. We ensure each loss function is Lipschitz by clipping individual gradient. The method in goodfellow2015efficient allows us to clip individual gradient efficiently. Clipping threshold is set as ( for high dimensional datasets because of the sparse gradient). For DPGD, learning rate is chosen from ( for high dimensional datasets). The learning rate of DPSGD is twice as large as DPGD and it is divided by at the middle of training. Privacy parameter is set as . The regularization coefficient is set as . All reported numbers are averaged over 20 runs.Baseline algorithms. The baseline algorithms include stateoftheart objective and output perturbation algorithms. For objective perturbation, we use Approximate Minima Perturbation (AMP) (iyengar2019towards). For output perturbation, we use the algorithm in wu2017bolt
(Output perturbation SGD). We adopt the implementation and hyperparameters in
iyengar2019towards for both algorithms. For multiclass classification tasks, wu2017bolt and iyengar2019towardsdivide the privacy budget evenly and train multiple binary classifiers because their algorithms need to compute smooth coefficient before training and therefore are not directly applicable to softmax regression.
Experiment results. The validation accuracy results for all evaluated algorithms with ( for multiclass datasets) are presented in Table 2. We also plot the accuracy results with varying in Figure 3. These results confirm our theory in Section 3: gradient perturbation achieves better performance than other perturbation methods as it leverages the average curvature.
KDDCup99  Adult  MNIST  Covertype  Gisette  Realsim  RCV1  

Non private  99.1  84.8  91.9  71.2  96.6  93.3  93.5 
AMP^{1}^{1}1For multiclass datas sets MNIST and Covertype, we use the numbers reported in iyengar2019towards directly because of the long running time of AMP especially on multiclass datasets.  97.5  79.3  71.9  64.3  62.8  73.1  64.5 
OutSGD  98.1  77.4  69.4  62.4  62.3  73.2  66.7 
DPSGD  98.7  80.4  87.5  67.7  63.0  73.8  70.4 
DPGD  98.7  80.9  88.6  66.2  67.3  76.1  74.9 
5 Conclusion
In this paper, we show the privacy noise actually helps optimization analysis, which can be used to improve the utility guarantee of both DPGD and DPSGD. Our result theoretically justifies the empirical superiority of gradient perturbation over other methods and advance the state of the art utility guarantee of DPERM algorithms. Experiments on real world datasets corroborate our theoretical findings nicely. In the future, it is interesting to consider how to utilize the expected curvature condition to improve the utility guarantee of other gradient perturbation based algorithms.
References
Appendix A Proofs Related to DPGD and DPSGD
Proof of Theorem 1.
Let be the path generated by optimization procedure. Since contains Gaussian perturbation noise , Definition 3 gives us
Since is smooth, we have
Take linear combination of above inequalities,
(2)  
Let be the solution error at step . We have the following inequalities between and .
(3)  
Take expectation with respect to , we have
(4) 
Further take expectation with respect to and use Eq 2, we have
(5)  
Set and ,
(6) 
Applying Eq (6) and taking expectation with respect to iteratively yields
(7) 
Uniform privacy budget allocation scheme sets
Therefore
(8) 
Set , we have
(9)  
Last inequality holds because for .
Therefore, for , we have the excepted solution error satisfies
(10) 
Since is smooth, we have
(11) 
Proof of Theorem 2.
The smooth condition gives us,
(12)  
Take expectation with respect to and substitute ,
(13) 
Subtract on both sides and use convexity,
(14)  
Substitute ,
(15)  
Summing over and take expectation with respect to ,
(16) 
Use convexity,
(17)  
Choose , we have
(18) 
∎
Proof of Theorem 3 and 4.
We start by giving a useful lemma.
Lemma 2.
Choose , the expected solution error of in Algorithm 2 for any satisfies
Proof of Lemma 2.
We have
(19)  
Take expectation with respect to perturbation noise and uniform sampling, we have
(20)  
Further take expectation to and apply Definition 3,
(21)  
Assume holds for , then
(22)  
∎
It’s easy to check that Eq 20 holds for arbitrary rather than . Rearrange Eq 20 and take expectation, we have
(23) 
Let be arbitrarily chosen from . Summing over the last iterations and use convexity to lower bound by ,
(24)  
Substitute and follow the idea in shamir2013stochastic by choosing , we arrive at
Comments
There are no comments yet.