1 Introduction
Privacy preserving is an important issue in learning. Nowadays, learning algorithms are often required to deal with sensitive data. This means that the algorithm needs to not only learn effectively from the data but also provide a certain level of guarantee on privacy preserving. Differential privacy dwork2006calibrating
is a rigorous privacy definition for data analysis which provides meaningful guarantees regardless of what an adversary knows ahead of time about individual’s data. As a commonly used supervised learning method, Empirical Risk Minimization (ERM) also faces the challenge of achieving simultaneously privacy preserving and learning. Differentially Private (DP) ERM with convex loss function has been extensively studied in the last decade, starting from
chaudhuri2009privacy . In this paper, we revisit this problem and present several improved results.Problem Setting
Given a dataset from a data universe , and a closed convex set , DPERM is to find
with the guarantee of being differentially private. We refer to as loss function. is some simple (non)smooth convex function called regularizer. If the loss function is convex, the utility of the algorithm is measured by the expected excess empirical risk, i.e. . The expectation is over the coins of the algorithm.
A number of approaches exist for this problem with convex loss function, which can be roughly classified into three categories. The first type of approaches is to perturb the output of a nonDP algorithm.
chaudhuri2009privacy first proposed output perturbation approach which is extended by ijcai2017548 . The second type of approaches is to perturb the objective function chaudhuri2009privacy . We referred to it as objective perturbation approach. The third type of approaches is to perturb gradients in first order optimization algorithms. bassily2014private proposed gradient perturbation approach and gave the lower bound of the utility for both general convex and strongly convex loss functions. Later, talwar2014private showed that this bound can actually be broken by adding more restrictions on the convex domain of the problem.As shown in the following tables^{1}^{1}1 Bound and complexity ignore multiplicative dependence on .
, the output perturbation approach can achieve the optimal bound of utility for strongly convex case. But it cannot be generalized to the case with nonsmooth regularizer. The objective perturbation approach needs to obtain the optimal solution to ensure both differential privacy and utility, which is often intractable in practice, and cannot achieve the optimal bound. The gradient perturbation approach can overcome all the issues and thus is preferred in practice. However, its existing results are all based on Gradient Descent (GD) or Stochastic Gradient Descent (SGD). For large datasets, they are slow in general. In the first part of this paper, we present algorithms with tighter utility upper bound and less running time. Almost all the aforementioned results did not consider the case where the loss function is nonconvex. Recently,
ijcai2017548 studied this case and measured the utility by gradient norm. In the second part of this paper, we generalize the expected excess empirical risk from convex to PolyakLojasiewicz condition, and give a tighter upper bound of the utility given in ijcai2017548 . Due to space limit, we leave many details, proofs, and experimental studies in the supplement.2 Related Work
There is a long list of works on differentially private ERM in the last decade which attack the problem from different perspectives. jain2012differentially thakurta2013nearly and pmlrv70agarwal17a investigated regret bound in online settings. kasiviswanathan2017private studied regression in incremental settings. wu2015revisiting and wang2016learning explored the problem from the perspective of learnability and stability. We will compare to the works that are most related to ours from the utility and gradient complexity (i.e., the number (complexity) of first order oracle () being called) points of view. Table 1 is the comparison for the case that loss function is strongly convex and smooth. Our algorithm achieves near optimal bound with less gradient complexity compared with previous ones. It is also robust to nonsmooth regularizers.
Method  Utility Upper Bd  Gradient Complexity  Non smooth Regularizer?  
chaudhuri2011differentially chaudhuri2009privacy  Objective Perturbation  N/A  No  
kifer2012private  Objective Perturbation  N/A  Yes  
bassily2014private  Gradient Perturbation  Yes  
ijcai2017548  Output Perturbation  No  
This Paper  Gradient Perturbation  Yes 
Tables 2 and 3 show that for nonstrongly convex and highdimension cases, our algorithms outperform other peer methods. Particularly, we improve the gradient complexity from to while preserving the optimal bound for nonstrongly convex case. For highdimension case, gradient complexity is reduced from to . Note that kasiviswanathan2016efficient also considered highdimension case via dimension reduction. But their method requires the optimal value in the dimensionreduced space, in addition they considered loss functions under the condition rather than  norm Lipschitz.
For nonconvex problem under differential privacy, hardt2013beyond chaudhuri2012near dwork2014analyze studied private SVD. feldman2009private investigated kmedian clustering. ijcai2017548 studied ERM with nonconvex smooth loss functions. In ijcai2017548 , the authors defined the utility using gradient norm as . They achieved a qualified utility in gradient complexity via DPSGD. In this paper, we use DPGD and show that it has a tighter utility upper bound.
Method  Utility Upper Bd  Gradient Complexity  Non smooth Regularizer?  
kifer2012private  Objective Perturbation  N/A  Yes  
bassily2014private  Gradient Perturbation  Yes  
ijcai2017548  Output Perturbation  No  
This paper  Gradient Perturbation  Yes 
Method  Utility Upper Bd  Gradient Complexity  Non smooth Regularizer?  

talwar2014private  Gradient Perturbation  Yes  
talwar2014private  Objective Perturbation  N/A  No  
talwar2015nearly  Gradient Perturbation  Yes  
This paper  Gradient Perturbation  No 
3 Preliminaries
Notations:
We let denote
. Vectors are in column form. For a vector
, we use to denote its norm. For the gradient complexity notation, are omitted unless specified. is a dataset of n individuals.Definition 3.1 (Lipschitz Function over ).
A loss function is GLipschitz (under norm) over , if for any and , we have .
Definition 3.2 (Lsmooth Function over ).
A loss function is Lsmooth over with respect to the norm if for any and , we have
where is the dual norm of . If is differentiable, this yields
We say that two datasets are neighbors if they differ by only one entry, denoted as .
Definition 3.3 (Differentially Privatedwork2006calibrating ).
A randomized algorithm is differentially private if for all neighboring datasets and for all events in the output space of , we have
when and is differentially private.
We will use Gaussian Mechanism dwork2006calibrating
and moments accountant
abadi2016deep to guarantee DP.Definition 3.4 (Gaussian Mechanism).
Given any function , the Gaussian Mechanism is defined as:
where Y is drawn from Gaussian Distribution
with . Here is the sensitivity of the function , i.e. Gaussian Mechanism preservers differentially private.The moments accountant proposed in abadi2016deep is a method to accumulate the privacy cost which has tighter bound for and . Roughly speaking, when we use the Gaussian Mechanism on the (stochastic) gradient descent, we can save a factor of
in the asymptotic bound of standard deviation of noise compared with the advanced composition theorem in
dwork2010boosting .Theorem 3.1 (abadi2016deep ).
For Lipschitz loss function, there exist constants and
so that given the sampling probability
and the number of steps T, for any , a DP stochastic gradient algorithm with batch size that injects Gaussian Noise with standard deviation to the gradients (Algorithm 1 in abadi2016deep ), is differentially private for any if4 Differentially Private ERM with Convex Loss Function
In this section we will consider ERM with (non)smooth regularizer^{2}^{2}2 All of the algorithms and theorems in this section are applicable to closed convex set rather than ., i.e.
(1) 
The loss function is convex for every . We define the proximal operator as
and denote .
4.1 Strongly convex case
We first consider the case that is strongly convex, Algorithm 1 is based on the ProxSVRG xiao2014proximal , which is much faster than SGD or GD. We will show that DPSVRG is also faster than DPSGD or DPGD in terms of the time needed to achieve the near optimal excess empirical risk bound.
Definition 4.1 (Strongly Convex).
The function is strongly convex with respect to norm if for any , there exist such that
(2) 
where is any subgradient on of .
Theorem 4.1.
In DPSVRG(Algorithm 1), for with some constant and , it is differentially private if
(3) 
for some constant .
Remark 4.1.
The constraint on in Theorems 4.1 and 4.3 comes from Theorem 3.1. This constraint can be removed if the noise is amplified by a factor of in (3) and (6). But accordingly there will be a factor of in the utility bound in (5) and (7). In this case the guarantee of differential privacy is by advanced composition theorem and privacy amplification via samplingbassily2014private .
Theorem 4.2 (Utility guarantee).
Suppose that the loss function is convex, GLipschitz and Lsmooth over . is strongly convex w.r.t norm. In DPSVRG(Algorithm 1), let be as in (3). If one chooses and sufficiently large so that they satisfy inequality
(4) 
then the following holds for ,
(5) 
where some insignificant logarithm terms are hiding in the notation. The total gradient complexity is .
Remark 4.2.
We can further use some acceleration methods to reduce the gradient complexity, see nitanda2014stochastic allen2017katyusha .
4.2 Nonstrongly convex case
In some cases, may not be strongly convex. For such cases, AllenYang2016 has recently showed that SVRG++ has less gradient complexity than Accelerated Gradient Descent. Following the idea of DPSVRG, we present the algorithm DPSVRG++ for the nonstrongly convex case. Unlike the previous one, this algorithm can achieve the optimal utility bound.
Theorem 4.3.
In DPSVRG++(Algorithm 2), for with some constant and , it is differentially private if
(6) 
for some constant .
Theorem 4.4 (Utility guarantee).
Suppose that the loss function is convex, GLipschitz and Lsmooth. In DPSVRG++(Algorithm 2), if is chosen as in (6), , and is sufficiently large, then the following holds for ,
(7) 
The gradient complexity is .
5 Differentially Private ERM for Convex Loss Function in High Dimensions
The utility bounds and gradient complexities in Section 4 depend on dimensionality . In highdimensional (i.e., ) case, such a dependence is not very desirable. To alleviate this issue, we can usually get rid of the dependence on dimensionality by reformulating the problem so that the goal is to find the parameter in some closed centrally symmetric convex set (such as norm ball), i.e.,
(8) 
where the loss function is convex.
talwar2014private ,talwar2015nearly showed that the term in (5),(7) can be replaced by the Gaussian Width of , which is no larger than and can be significantly smaller in practice (for more detail and examples one may refer to talwar2014private ). In this section, we propose a faster algorithm to achieve the upper utility bound. We first give some definitions.
Definition 5.1 (Minkowski Norm).
The Minkowski norm (denoted by ) with respect to a centrally symmetric convex set is defined as follows. For any vector ,
The dual norm of is denoted as , for any vector , .
The following lemma implies that for every smooth convex function which is Lsmooth with respect to norm, it is smooth with respect to norm.
Lemma 5.1.
For any vector , we have , where is the diameter and .
Definition 5.2 (Gaussian Width).
Let be a Gaussian random vector in . The Gaussian width for a set is defined as .
Lemma 5.2 (talwar2014private ).
For where , we have .
Our algorithm DPAccMD is based on the Accelerated Mirror Descent method, which was studied in AllenOrecchia2017 ,nesterov2005smooth .
Theorem 5.3.
In DPAccMD( Algorithm 3), for , it is differentially private if
(9) 
for some constant .
Theorem 5.4 (Utility Guarantee).
Suppose the loss function is GLipschitz , and Lsmooth over . In DPAccMD, let be as in (9) and be a function that is 1strongly convex with respect to . Then if
we have
The total gradient complexity is .
6 ERM for General Functions
In this section, we consider nonconvex functions with similar objective function as before,
(10) 
Theorem 6.1.
In DPGD( Algorithm 4), for , it is differentially private if
(11) 
for some constant .
6.1 Excess empirical risk for functions under PolyakLojasiewicz condition
In this section, we consider excess empirical risk in the case where the objective function satisfies PolyakLojasiewicz condition. This topic has been studied in karimi2016linear reddi2016stochastic polyak1963gradient nesterov2006cubic li2016calculus .
Definition 6.1 ( PolyakLojasiewicz condition).
For function , denote and . Then there exists and for every ,
(12) 
(12) guarantees that every critical point (i.e., the point where the gradient vanish) is the global minimum. karimi2016linear shows that if is differentiable and smooth w.r.t norm, then we have the following chain of implications:
Strong Convex Essential Strong Convexity Weak Strongly Convexity Restricted Secant Inequality PolyakLojasiewicz Inequality Error Bound
Theorem 6.2.
Suppose that is GLipschitz, and Lsmooth over , and satisfies the PolyakLojasiewicz condition. In DPGD( Algorithm 4), let be as in (11) with . Then if the following holds
(13) 
where hides other terms.
DPGD achieves near optimal bound since strongly convex functions can be seen as a special case in the class of functions satisfying PolyakLojasiewicz condition. The lower bound for strongly convex functions is bassily2014private . Our result has only a logarithmic multiplicative term comparing to that. Thus we achieve near optimal bound in this sense.
6.2 Tight upper bound for (non)convex case
In ijcai2017548 , the authors considered (non)convex smooth loss functions and measured the utility as . They proposed an algorithm with gradient complexity . For this algorithm, they showed that . By using DPGD( Algorithm 4), we can eliminate the term.
Theorem 6.3.
Suppose that is GLipschitz, and Lsmooth. In DPGD( Algorithm 4), let be as in (11) with . Then when , we have
(14) 
Remark 6.1.
Although we can obtain the optimal bound by Theorem 3.1 using DPSGD, there will be a constraint on . Also, we still do not know the lower bound of the utility using this measure. We leave it as an open problem.
7 Discussions
From the discussion in previous sections, we know that when gradient perturbation is combined with linearly converge first order methods, near optimal bound with less gradient complexity can be achieved. The remaining issue is whether the optimal bound can be obtained in this way. In Section 6.1, we considered functions satisfying the PolyakLojasiewicz condition, and achieved near optimal bound on the utility. It will be interesting to know the bound for functions satisfying other conditions (such as general Gradientdominated functions nesterov2006cubic , quasiconvex and locallyLipschitz in hazan2015beyond ) under the differential privacy model. For general nonsmooth convex loss function (such as SVM ), we do not know whether the optimal bound is achievable with less time complexity. Finally, for nonconvex loss function, proposing an easier interpretable measure for the utility is another direction for future work.
References
 [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.

[2]
N. Agarwal and K. Singh.
The price of differential privacy for online learning.
In D. Precup and Y. W. Teh, editors,
Proceedings of the 34th International Conference on Machine Learning
, volume 70 of Proceedings of Machine Learning Research, pages 32–40, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. 
[3]
Z. AllenZhu.
Katyusha: the first direct acceleration of stochastic gradient
methods.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1200–1205. ACM, 2017.  [4] Z. AllenZhu and L. Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017.
 [5] Z. AllenZhu and Y. Yuan. Improved SVRG for NonStronglyConvex or SumofNonConvex Objectives. In Proceedings of the 33rd International Conference on Machine Learning, ICML ’16, 2016.
 [6] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464–473. IEEE, 2014.
 [7] M. Bun and T. Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.

[8]
K. Chaudhuri and C. Monteleoni.
Privacypreserving logistic regression.
In Advances in Neural Information Processing Systems, pages 289–296, 2009.  [9] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.
 [10] K. Chaudhuri, A. Sarwate, and K. Sinha. Nearoptimal differentially private principal components. In Advances in Neural Information Processing Systems, pages 989–997, 2012.
 [11] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, volume 3876, pages 265–284. Springer, 2006.
 [12] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 51–60. IEEE, 2010.

[13]
C. Dwork, K. Talwar, A. Thakurta, and L. Zhang.
Analyze gauss: optimal bounds for privacypreserving principal component analysis.
In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 11–20. ACM, 2014.  [14] D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private coresets. In Proceedings of the fortyfirst annual ACM symposium on Theory of computing, pages 361–370. ACM, 2009.
 [15] M. Hardt and A. Roth. Beyond worstcase analysis in private singular vector computation. In Proceedings of the fortyfifth annual ACM symposium on Theory of computing, pages 331–340. ACM, 2013.
 [16] E. Hazan, K. Levy, and S. ShalevShwartz. Beyond convexity: Stochastic quasiconvex optimization. In Advances in Neural Information Processing Systems, pages 1594–1602, 2015.
 [17] P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. In COLT, volume 23, pages 24–1, 2012.
 [18] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximalgradient methods under the polyakłojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
 [19] S. P. Kasiviswanathan and H. Jin. Efficient private empirical risk minimization for highdimensional learning. In Proceedings of The 33rd International Conference on Machine Learning, pages 488–497, 2016.
 [20] S. P. Kasiviswanathan, K. Nissim, and H. Jin. Private incremental regression. In Proceedings of the 36th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, pages 167–182. ACM, 2017.
 [21] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and highdimensional regression. Journal of Machine Learning Research, 1(41):3–1, 2012.
 [22] G. Li and T. K. Pong. Calculus of the exponent of kurdyka L ojasiewicz inequality and its applications to linear convergence of firstorder methods. arXiv preprint arXiv:1602.02915, 2016.
 [23] Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical programming, 103(1):127–152, 2005.
 [24] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 [25] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014.
 [26] B. T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864–878, 1963.

[27]
S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola.
Stochastic variance reduction for nonconvex optimization.
In International conference on machine learning, pages 314–323, 2016.  [28] K. Talwar, A. Thakurta, and L. Zhang. Private empirical risk minimization beyond the worst case: The effect of the constraint set geometry. arXiv preprint arXiv:1411.5417, 2014.
 [29] K. Talwar, A. Thakurta, and L. Zhang. Nearly optimal private lasso. In Advances in Neural Information Processing Systems, pages 3025–3033, 2015.
 [30] A. G. Thakurta and A. Smith. (nearly) optimal algorithms for private online learning in fullinformation and bandit settings. In Advances in Neural Information Processing Systems, pages 2733–2741, 2013.
 [31] Y.X. Wang, J. Lei, and S. E. Fienberg. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of erm principle. Journal of Machine Learning Research, 17(183):1–40, 2016.
 [32] X. Wu, M. Fredrikson, W. Wu, S. Jha, and J. F. Naughton. Revisiting differentially private regression: Lessons from learning theory and their consequences. arXiv preprint arXiv:1512.06388, 2015.
 [33] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.

[34]
J. Zhang, K. Zheng, W. Mou, and L. Wang.
Efficient private erm for smooth objectives.
In
Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17
, pages 3922–3928, 2017.
Appendix A Experiments
In this section, we validate our methods using Covertype dataset^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/covertype and logistic regression. This dataset contains 581012 samples with 54 features. We use 200000 samples for training. We compare our DPSVRG algorithm with the DPGD method in [34] for logistic regression with norm regularization.
where is set to be .
We also compare our DPSVRG++ algorithm with the DPGD method in [34] for logistic regression,
We evaluate the optimality gap and the running time for and .
From the figure, it is clear that our method outperform the previous results in both cases.
Appendix B Details and proofs
b.1 Using Advance Composition Theorem to Guarantee differential private
As we can see that there are constrains on in Theorem 4.1 and Theorem 4.3. The constrains come from Theorem 3.1 (see the proof below). For general , we can just amplify a factor of on the . However, in this case, we will amplify a factor of (neglecting other terms) in (5) and (7) in Theorem 4.2 and 4.4; the guarantee of DP is by advanced composition theorem and privacy amplification via sampling [6]. Below we will show this. Consider the ith query:
where is the uniform sampling. There are compositions of these queries. By advanced composition theorem, we know that in order to guarantee the differential private, we need differential private in each for some constant . Now consider on the whole dataset (i.e., with no random sample).
From the above, we can see that the sensitive of is . Thus if for some , will be differential private. This implies that the query will be differential private, which comes from the following lemma (see Theorem 2.1 and Lemma 2.2 in [6]).
Lemma B.1.
If an algorithm is differentially private, then for any element dataset , executing on uniformly random entries ensures differential private.
Let and , that is and
We can guarantee that composition of queries is differential private.
b.2 Proof of Theorem 4.1 and 4.3
Proof.
W.l.o.g, we assume , i.e., (otherwise we can rescale ).The Proof of Theorem 4.1 and Theorem 4.3 are the same instead of the iteration number (or number of queries). Let the difference data of be the nth data. Now, consider the ith query:
where is a uniform sample. This query can be thought as the composition of two queries:
(15) 
and
(16) 
for some . By Theorem 2.1 in [1] we have . Now we bound and .