DeepAI

# Differentially Private Empirical Risk Minimization Revisited: Faster and More General

In this paper we study the differentially private Empirical Risk Minimization (ERM) problem in different settings. For smooth (strongly) convex loss function with or without (non)-smooth regularization, we give algorithms that achieve either optimal or near optimal utility bounds with less gradient complexity compared with previous work. For ERM with smooth convex loss function in high-dimensional (p≫ n) setting, we give an algorithm which achieves the upper bound with less gradient complexity than previous ones. At last, we generalize the expected excess empirical risk from convex loss functions to non-convex ones satisfying the Polyak-Lojasiewicz condition and give a tighter upper bound on the utility than the one in ijcai2017-548.

• 94 publications
• 2 publications
• 31 publications
03/29/2021

### Private Non-smooth Empirical Risk Minimization and Stochastic Convex Optimization in Subquadratic Steps

We study the differentially private Empirical Risk Minimization (ERM) an...
05/27/2014

### Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds

In this paper, we initiate a systematic investigation of differentially ...
06/02/2021

### Statistical optimality conditions for compressive ensembles

We present a framework for the theoretical analysis of ensembles of low-...
05/19/2021

### Localization, Convexity, and Star Aggregation

Offset Rademacher complexities have been shown to imply sharp, data-depe...
11/30/2020

### Persistent Reductions in Regularized Loss Minimization for Variable Selection

In the context of regularized loss minimization with polyhedral gauges, ...
11/20/2014

### Private Empirical Risk Minimization Beyond the Worst Case: The Effect of the Constraint Set Geometry

Empirical Risk Minimization (ERM) is a standard technique in machine lea...
04/22/2022

### Sharper Utility Bounds for Differentially Private Models

In this paper, by introducing Generalized Bernstein condition, we propos...

## 1 Introduction

Privacy preserving is an important issue in learning. Nowadays, learning algorithms are often required to deal with sensitive data. This means that the algorithm needs to not only learn effectively from the data but also provide a certain level of guarantee on privacy preserving. Differential privacy dwork2006calibrating

is a rigorous privacy definition for data analysis which provides meaningful guarantees regardless of what an adversary knows ahead of time about individual’s data. As a commonly used supervised learning method, Empirical Risk Minimization (ERM) also faces the challenge of achieving simultaneously privacy preserving and learning. Differentially Private (DP) ERM with convex loss function has been extensively studied in the last decade, starting from

chaudhuri2009privacy . In this paper, we revisit this problem and present several improved results.

#### Problem Setting

Given a dataset from a data universe , and a closed convex set , DP-ERM is to find

 x∗∈argminx∈CFr(x,D)=F(x,D)+r(x)=1nn∑i=1f(x,zi)+r(x)

with the guarantee of being differentially private. We refer to as loss function. is some simple (non)-smooth convex function called regularizer. If the loss function is convex, the utility of the algorithm is measured by the expected excess empirical risk, i.e. . The expectation is over the coins of the algorithm.

A number of approaches exist for this problem with convex loss function, which can be roughly classified into three categories. The first type of approaches is to perturb the output of a non-DP algorithm.

chaudhuri2009privacy first proposed output perturbation approach which is extended by ijcai2017-548 . The second type of approaches is to perturb the objective function chaudhuri2009privacy . We referred to it as objective perturbation approach. The third type of approaches is to perturb gradients in first order optimization algorithms. bassily2014private proposed gradient perturbation approach and gave the lower bound of the utility for both general convex and strongly convex loss functions. Later, talwar2014private showed that this bound can actually be broken by adding more restrictions on the convex domain of the problem.

As shown in the following tables111 Bound and complexity ignore multiplicative dependence on .

, the output perturbation approach can achieve the optimal bound of utility for strongly convex case. But it cannot be generalized to the case with non-smooth regularizer. The objective perturbation approach needs to obtain the optimal solution to ensure both differential privacy and utility, which is often intractable in practice, and cannot achieve the optimal bound. The gradient perturbation approach can overcome all the issues and thus is preferred in practice. However, its existing results are all based on Gradient Descent (GD) or Stochastic Gradient Descent (SGD). For large datasets, they are slow in general. In the first part of this paper, we present algorithms with tighter utility upper bound and less running time. Almost all the aforementioned results did not consider the case where the loss function is non-convex. Recently,

ijcai2017-548 studied this case and measured the utility by gradient norm. In the second part of this paper, we generalize the expected excess empirical risk from convex to Polyak-Lojasiewicz condition, and give a tighter upper bound of the utility given in ijcai2017-548 . Due to space limit, we leave many details, proofs, and experimental studies in the supplement.

## 2 Related Work

There is a long list of works on differentially private ERM in the last decade which attack the problem from different perspectives. jain2012differentially thakurta2013nearly and pmlr-v70-agarwal17a investigated regret bound in online settings. kasiviswanathan2017private studied regression in incremental settings. wu2015revisiting and wang2016learning explored the problem from the perspective of learnability and stability. We will compare to the works that are most related to ours from the utility and gradient complexity (i.e., the number (complexity) of first order oracle () being called) points of view. Table 1 is the comparison for the case that loss function is strongly convex and -smooth. Our algorithm achieves near optimal bound with less gradient complexity compared with previous ones. It is also robust to non-smooth regularizers.

Tables 2 and 3 show that for non-strongly convex and high-dimension cases, our algorithms outperform other peer methods. Particularly, we improve the gradient complexity from to while preserving the optimal bound for non-strongly convex case. For high-dimension case, gradient complexity is reduced from to . Note that kasiviswanathan2016efficient also considered high-dimension case via dimension reduction. But their method requires the optimal value in the dimension-reduced space, in addition they considered loss functions under the condition rather than - norm Lipschitz.

For non-convex problem under differential privacy, hardt2013beyond chaudhuri2012near dwork2014analyze studied private SVD. feldman2009private investigated k-median clustering. ijcai2017-548 studied ERM with non-convex smooth loss functions. In ijcai2017-548 , the authors defined the utility using gradient norm as . They achieved a qualified utility in gradient complexity via DP-SGD. In this paper, we use DP-GD and show that it has a tighter utility upper bound.

## 3 Preliminaries

#### Notations:

We let denote

. Vectors are in column form. For a vector

, we use to denote its -norm. For the gradient complexity notation, are omitted unless specified. is a dataset of n individuals.

###### Definition 3.1 (Lipschitz Function over θ).

A loss function is G-Lipschitz (under -norm) over , if for any and , we have .

###### Definition 3.2 (L-smooth Function over θ).

A loss function is L-smooth over with respect to the norm if for any and , we have

 ||∇f(θ1,z)−∇f(θ2,z)||∗≤L||θ1−θ2||,

where is the dual norm of . If is differentiable, this yields

 f(θ1,z)≤f(θ2,z)+⟨∇f(θ2,z),θ1−θ2⟩+L2||θ1−θ2||2.

We say that two datasets are neighbors if they differ by only one entry, denoted as .

###### Definition 3.3 (Differentially Privatedwork2006calibrating ).

A randomized algorithm is -differentially private if for all neighboring datasets and for all events in the output space of , we have

 Pr(A(D)∈S)≤eϵPr(A(D′)∈S)+δ,

when and is -differentially private.

We will use Gaussian Mechanism dwork2006calibrating

and moments accountant

abadi2016deep to guarantee -DP.

###### Definition 3.4 (Gaussian Mechanism).

Given any function , the Gaussian Mechanism is defined as:

 MG(D,q,ϵ)=q(D)+Y,

where Y is drawn from Gaussian Distribution

with . Here is the -sensitivity of the function , i.e.  Gaussian Mechanism preservers -differentially private.

The moments accountant proposed in abadi2016deep is a method to accumulate the privacy cost which has tighter bound for and . Roughly speaking, when we use the Gaussian Mechanism on the (stochastic) gradient descent, we can save a factor of

in the asymptotic bound of standard deviation of noise compared with the advanced composition theorem in

dwork2010boosting .

###### Theorem 3.1 (abadi2016deep ).

For -Lipschitz loss function, there exist constants and

so that given the sampling probability

and the number of steps T, for any , a DP stochastic gradient algorithm with batch size that injects Gaussian Noise with standard deviation to the gradients (Algorithm 1 in abadi2016deep ), is -differentially private for any if

 σ≥c2q√Tln(1/δ)ϵ.

## 4 Differentially Private ERM with Convex Loss Function

In this section we will consider ERM with (non)-smooth regularizer222 All of the algorithms and theorems in this section are applicable to closed convex set rather than ., i.e.

 minx∈RpFr(x,D)=F(x,D)+r(x)=1nn∑i=1f(x,zi)+r(x). (1)

The loss function is convex for every . We define the proximal operator as

 proxr(y)=argminx∈Rp{12||x−y||22+r(x)},

and denote .

### 4.1 Strongly convex case

We first consider the case that is -strongly convex, Algorithm 1 is based on the Prox-SVRG xiao2014proximal , which is much faster than SGD or GD. We will show that DP-SVRG is also faster than DP-SGD or DP-GD in terms of the time needed to achieve the near optimal excess empirical risk bound.

###### Definition 4.1 (Strongly Convex).

The function is -strongly convex with respect to norm if for any , there exist such that

 f(y)≥f(x)+⟨∂f,y−x⟩+μ2||y−x||2, (2)

where is any subgradient on of .

###### Theorem 4.1.

In DP-SVRG(Algorithm 1), for with some constant and , it is -differentially private if

 σ2=cG2Tmln(1δ)n2ϵ2 (3)

for some constant .

###### Remark 4.1.

The constraint on in Theorems 4.1 and 4.3 comes from Theorem 3.1. This constraint can be removed if the noise is amplified by a factor of in (3) and (6). But accordingly there will be a factor of in the utility bound in (5) and (7). In this case the guarantee of differential privacy is by advanced composition theorem and privacy amplification via samplingbassily2014private .

###### Theorem 4.2 (Utility guarantee).

Suppose that the loss function is convex, G-Lipschitz and L-smooth over . is -strongly convex w.r.t -norm. In DP-SVRG(Algorithm 1), let be as in (3). If one chooses and sufficiently large so that they satisfy inequality

 1η(1−8ηL)μm+8Lη(m+1)m(1−8Lη)<12, (4)

then the following holds for ,

 E[Fr(~xT,D)]−Fr(x∗,D)≤~O(plog(n)G2log(1/δ)n2ϵ2μ), (5)

where some insignificant logarithm terms are hiding in the -notation. The total gradient complexity is .

###### Remark 4.2.

We can further use some acceleration methods to reduce the gradient complexity, see nitanda2014stochastic allen2017katyusha .

### 4.2 Non-strongly convex case

In some cases, may not be strongly convex. For such cases, AllenYang2016 has recently showed that SVRG++ has less gradient complexity than Accelerated Gradient Descent. Following the idea of DP-SVRG, we present the algorithm DP-SVRG++ for the non-strongly convex case. Unlike the previous one, this algorithm can achieve the optimal utility bound.

###### Theorem 4.3.

In DP-SVRG++(Algorithm 2), for with some constant and , it is -differentially private if

 σ2=cG22Tmln(2δ)n2ϵ2 (6)

for some constant .

###### Theorem 4.4 (Utility guarantee).

Suppose that the loss function is convex, G-Lipschitz and L-smooth. In DP-SVRG++(Algorithm 2), if is chosen as in (6), , and is sufficiently large, then the following holds for ,

 E[Fr(~xT,D)]−Fr(x∗,D)≤O(G√pln(1/δ))nϵ). (7)

The gradient complexity is .

## 5 Differentially Private ERM for Convex Loss Function in High Dimensions

The utility bounds and gradient complexities in Section 4 depend on dimensionality . In high-dimensional (i.e., ) case, such a dependence is not very desirable. To alleviate this issue, we can usually get rid of the dependence on dimensionality by reformulating the problem so that the goal is to find the parameter in some closed centrally symmetric convex set (such as -norm ball), i.e.,

 minx∈CF(x,D)=1nn∑i=1f(x,zi), (8)

where the loss function is convex.

talwar2014private ,talwar2015nearly showed that the term in (5),(7) can be replaced by the Gaussian Width of , which is no larger than and can be significantly smaller in practice (for more detail and examples one may refer to talwar2014private ). In this section, we propose a faster algorithm to achieve the upper utility bound. We first give some definitions.

###### Definition 5.1 (Minkowski Norm).

The Minkowski norm (denoted by ) with respect to a centrally symmetric convex set is defined as follows. For any vector ,

 ||⋅||C=min{r∈R+:v∈rC}.

The dual norm of is denoted as , for any vector , .

The following lemma implies that for every smooth convex function which is L-smooth with respect to norm, it is -smooth with respect to norm.

###### Lemma 5.1.

For any vector , we have , where is the -diameter and .

###### Definition 5.2 (Gaussian Width).

Let be a Gaussian random vector in . The Gaussian width for a set is defined as .

###### Lemma 5.2 (talwar2014private ).

For where , we have .

Our algorithm DP-AccMD is based on the Accelerated Mirror Descent method, which was studied in AllenOrecchia2017 ,nesterov2005smooth .

###### Theorem 5.3.

In DP-AccMD( Algorithm 3), for , it is -differentially private if

 σ2=cG2Tln(1/δ)n2ϵ2 (9)

for some constant .

###### Theorem 5.4 (Utility Guarantee).

Suppose the loss function is G-Lipschitz , and L-smooth over . In DP-AccMD, let be as in (9) and be a function that is 1-strongly convex with respect to . Then if

 T2=O⎛⎜ ⎜⎝L||C||22√Bw(x∗,x0)nϵG√ln(1/δ)√G2C+||C||22⎞⎟ ⎟⎠,

we have

 E[F(yT,D)]−F(x∗,D)≤O⎛⎜ ⎜⎝√Bw(x∗,x0)√G2C+||C||22G√ln(1/δ)nϵ⎞⎟ ⎟⎠.

The total gradient complexity is .

## 6 ERM for General Functions

In this section, we consider non-convex functions with similar objective function as before,

 minx∈RpF(x,D)=1nn∑i=1f(x,zi). (10)
###### Theorem 6.1.

In DP-GD( Algorithm 4), for , it is -differentially private if

 σ2=cG2Tln(1/δ)n2ϵ2 (11)

for some constant .

### 6.1 Excess empirical risk for functions under Polyak-Lojasiewicz condition

In this section, we consider excess empirical risk in the case where the objective function satisfies Polyak-Lojasiewicz condition. This topic has been studied in karimi2016linear reddi2016stochastic polyak1963gradient nesterov2006cubic li2016calculus .

###### Definition 6.1 ( Polyak-Lojasiewicz condition).

For function , denote and . Then there exists and for every ,

 ||∇F(x)||2≥2μ(F(x)−F∗). (12)

(12) guarantees that every critical point (i.e., the point where the gradient vanish) is the global minimum. karimi2016linear shows that if is differentiable and -smooth w.r.t norm, then we have the following chain of implications:

Strong Convex Essential Strong Convexity Weak Strongly Convexity Restricted Secant Inequality Polyak-Lojasiewicz Inequality Error Bound

###### Theorem 6.2.

Suppose that is G-Lipschitz, and L-smooth over , and satisfies the Polyak-Lojasiewicz condition. In DP-GD( Algorithm 4), let be as in (11) with . Then if the following holds

 E[F(xT,D)]−F(x∗,D)≤O(G2plog2(n)log(1/δ)n2ϵ2), (13)

where hides other terms.

DP-GD achieves near optimal bound since strongly convex functions can be seen as a special case in the class of functions satisfying Polyak-Lojasiewicz condition. The lower bound for strongly convex functions is bassily2014private . Our result has only a logarithmic multiplicative term comparing to that. Thus we achieve near optimal bound in this sense.

### 6.2 Tight upper bound for (non)-convex case

In ijcai2017-548 , the authors considered (non)-convex smooth loss functions and measured the utility as . They proposed an algorithm with gradient complexity . For this algorithm, they showed that . By using DP-GD( Algorithm 4), we can eliminate the term.

###### Theorem 6.3.

Suppose that is G-Lipschitz, and L-smooth. In DP-GD( Algorithm 4), let be as in (11) with . Then when , we have

 E[||∇F(xm,D)||2]≤O(√LG√plog(1/δ)nϵ). (14)
###### Remark 6.1.

Although we can obtain the optimal bound by Theorem 3.1 using DP-SGD, there will be a constraint on . Also, we still do not know the lower bound of the utility using this measure. We leave it as an open problem.

## 7 Discussions

From the discussion in previous sections, we know that when gradient perturbation is combined with linearly converge first order methods, near optimal bound with less gradient complexity can be achieved. The remaining issue is whether the optimal bound can be obtained in this way. In Section 6.1, we considered functions satisfying the Polyak-Lojasiewicz condition, and achieved near optimal bound on the utility. It will be interesting to know the bound for functions satisfying other conditions (such as general Gradient-dominated functions nesterov2006cubic , quasi-convex and locally-Lipschitz in hazan2015beyond ) under the differential privacy model. For general non-smooth convex loss function (such as SVM ), we do not know whether the optimal bound is achievable with less time complexity. Finally, for non-convex loss function, proposing an easier interpretable measure for the utility is another direction for future work.

## References

• [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.
• [2] N. Agarwal and K. Singh. The price of differential privacy for online learning. In D. Precup and Y. W. Teh, editors,

Proceedings of the 34th International Conference on Machine Learning

, volume 70 of Proceedings of Machine Learning Research, pages 32–40, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
• [3] Z. Allen-Zhu. Katyusha: the first direct acceleration of stochastic gradient methods. In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

, pages 1200–1205. ACM, 2017.
• [4] Z. Allen-Zhu and L. Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017.
• [5] Z. Allen-Zhu and Y. Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In Proceedings of the 33rd International Conference on Machine Learning, ICML ’16, 2016.
• [6] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464–473. IEEE, 2014.
• [7] M. Bun and T. Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.
• [8] K. Chaudhuri and C. Monteleoni.

Privacy-preserving logistic regression.

In Advances in Neural Information Processing Systems, pages 289–296, 2009.
• [9] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.
• [10] K. Chaudhuri, A. Sarwate, and K. Sinha. Near-optimal differentially private principal components. In Advances in Neural Information Processing Systems, pages 989–997, 2012.
• [11] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, volume 3876, pages 265–284. Springer, 2006.
• [12] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 51–60. IEEE, 2010.
• [13] C. Dwork, K. Talwar, A. Thakurta, and L. Zhang.

Analyze gauss: optimal bounds for privacy-preserving principal component analysis.

In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 11–20. ACM, 2014.
• [14] D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private coresets. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 361–370. ACM, 2009.
• [15] M. Hardt and A. Roth. Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 331–340. ACM, 2013.
• [16] E. Hazan, K. Levy, and S. Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems, pages 1594–1602, 2015.
• [17] P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. In COLT, volume 23, pages 24–1, 2012.
• [18] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
• [19] S. P. Kasiviswanathan and H. Jin. Efficient private empirical risk minimization for high-dimensional learning. In Proceedings of The 33rd International Conference on Machine Learning, pages 488–497, 2016.
• [20] S. P. Kasiviswanathan, K. Nissim, and H. Jin. Private incremental regression. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 167–182. ACM, 2017.
• [21] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and high-dimensional regression. Journal of Machine Learning Research, 1(41):3–1, 2012.
• [22] G. Li and T. K. Pong. Calculus of the exponent of kurdyka- L ojasiewicz inequality and its applications to linear convergence of first-order methods. arXiv preprint arXiv:1602.02915, 2016.
• [23] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127–152, 2005.
• [24] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
• [25] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014.
• [26] B. T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864–878, 1963.
• [27] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola.

Stochastic variance reduction for nonconvex optimization.

In International conference on machine learning, pages 314–323, 2016.
• [28] K. Talwar, A. Thakurta, and L. Zhang. Private empirical risk minimization beyond the worst case: The effect of the constraint set geometry. arXiv preprint arXiv:1411.5417, 2014.
• [29] K. Talwar, A. Thakurta, and L. Zhang. Nearly optimal private lasso. In Advances in Neural Information Processing Systems, pages 3025–3033, 2015.
• [30] A. G. Thakurta and A. Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. In Advances in Neural Information Processing Systems, pages 2733–2741, 2013.
• [31] Y.-X. Wang, J. Lei, and S. E. Fienberg. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of erm principle. Journal of Machine Learning Research, 17(183):1–40, 2016.
• [32] X. Wu, M. Fredrikson, W. Wu, S. Jha, and J. F. Naughton. Revisiting differentially private regression: Lessons from learning theory and their consequences. arXiv preprint arXiv:1512.06388, 2015.
• [33] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
• [34] J. Zhang, K. Zheng, W. Mou, and L. Wang. Efficient private erm for smooth objectives. In

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17

, pages 3922–3928, 2017.

## Appendix A Experiments

In this section, we validate our methods using Covertype dataset and logistic regression. This dataset contains 581012 samples with 54 features. We use 200000 samples for training. We compare our DP-SVRG algorithm with the DP-GD method in [34] for logistic regression with -norm regularization.

 Fr(w,D)=1nn∑i=1log(1+exp(1+yiwTxi))+λ2||w||2,

where is set to be .

We also compare our DP-SVRG++ algorithm with the DP-GD method in [34] for logistic regression,

 Fr(w,D)=1nn∑i=1log(1+exp(1+yiwTxi))

We evaluate the optimality gap and the running time for and .

From the figure, it is clear that our method outperform the previous results in both cases.

## Appendix B Details and proofs

### b.1 Using Advance Composition Theorem to Guarantee (ϵ,δ)-differential private

As we can see that there are constrains on in Theorem 4.1 and Theorem 4.3. The constrains come from Theorem 3.1 (see the proof below). For general , we can just amplify a factor of on the . However, in this case, we will amplify a factor of (neglecting other terms) in (5) and (7) in Theorem 4.2 and 4.4; the guarantee of DP is by advanced composition theorem and privacy amplification via sampling [6]. Below we will show this. Consider the i-th query:

 Mi=∇f(xst−1,zist)−∇f(~x,zist)+1nn∑i=1∇f(~x,zi)+N(0,σ2Ip),

where is the uniform sampling. There are -compositions of these queries. By advanced composition theorem, we know that in order to guarantee the -differential private, we need -differential private in each for some constant . Now consider on the whole dataset (i.e., with no random sample).

 ~Mi=n∑i=1∇f(xst−1,zi)−n∑i=1∇f(~x,zi)+1nn∑i=1∇f(~x,zi)+N(0,σ2Ip).

From the above, we can see that the -sensitive of is . Thus if for some , will be -differential private. This implies that the query will be -differential private, which comes from the following lemma (see Theorem 2.1 and Lemma 2.2 in [6]).

###### Lemma B.1.

If an algorithm is -differentially private, then for any -element dataset , executing on uniformly random entries ensures -differential private.

Let and , that is and

 σ2≥c2GTlog(T/δ)log(1/δ)ϵ2n2.

We can guarantee that composition of queries is -differential private.

### b.2 Proof of Theorem 4.1 and 4.3

###### Proof.

W.l.o.g, we assume , i.e., (otherwise we can rescale ).The Proof of Theorem 4.1 and Theorem 4.3 are the same instead of the iteration number (or number of queries). Let the difference data of be the n-th data. Now, consider the i-th query:

 Mi=∇f(xst−1,zist)−∇f(~x,zist)+1nn∑i=1∇f(~x,zi)+ust,ust∼N(0,σ2Ip),

where is a uniform sample. This query can be thought as the composition of two queries:

 Mi,1=∇f(xst−1,zist)−∇f(~x,zist)+N(0,σ21Ip) (15)

and

 Mi,2=∇F(~x,D)+N(0,σ22Ip)=1nn∑i=1∇f(~x,zi)+N(0,σ22Ip) (16)

for some . By Theorem 2.1 in [1] we have . Now we bound and .

For , we can use Lemma 3 in [1] directly, where . For some constant and any integer , we have

 αMi,1(λ)≤c1λ2n2σ21+O(λ3n3σ31). (17)

For , we use the relationship between moment account and Rényi divergence. By Definition 2.1 in [7] we have:

 αMi,2(λ)=λDλ+1(P||Q), (18)

where and . By Lemma 2.5 in [7], we have for some :

 λDλ+1(P||Q)=λ(λ+1)||∇F(~x,D)−∇F(~x,D′)||22σ2≤2λ(λ+1)n2σ22≤c1λ2n2σ22. (19)

Combining (17), (18) and (19</