Differentially Private Accelerated Optimization Algorithms

08/05/2020 ∙ by Nurdan Kuru, et al. ∙ Erasmus University Rotterdam Rutgers University Sabancı University 0

We present two classes of differentially private optimization algorithms derived from the well-known accelerated first-order methods. The first algorithm is inspired by Polyak's heavy ball method and employs a smoothing approach to decrease the accumulated noise on the gradient steps required for differential privacy. The second class of algorithms are based on Nesterov's accelerated gradient method and its recent multi-stage variant. We propose a noise dividing mechanism for the iterations of Nesterov's method in order to improve the error behavior of the algorithm. The convergence rate analyses are provided for both the heavy ball and the Nesterov's accelerated gradient method with the help of the dynamical system analysis techniques. Finally, we conclude with our numerical experiments showing that the presented algorithms have advantages over the well-known differentially private algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real applications involving data analysis, the data owners and the data analyst may be different parties. In such cases, privacy of the data could be a major concern. Differential privacy promises securing an individual’s data while still revealing useful information about a population [11]. It is based on constructing a mechanism, for which output stays probabilistically similar whenever a new item is added or an existing one is removed from the data set. Such incremental mechanisms have been shown to ensure data privacy [12]

. Differential privacy is used within various types of methods in machine learning, such as; boosting, linear and logistic regression and support vector machines

[15, 9, 36, 44].

In this work, we consider the scenario where a data analyst performs analysis on a dataset owned by another party by means of solving an optimization problem with (stochastic) first-order methods for empirical risk minimization. There is in fact a large body of work on differentially private empirical risk minimization [10, 23, 5, 45]. We will specifically focus on privacy preserving gradient-based iterative algorithms, which are a popular choice for large-scale problems due to their scalability properties [1, 38, 43]. Our contributions specifically regard two gradient-based stochastic accelerated algorithms, Polyak’s heavy ball (HB) algorithm [33], and Nesterov’s accelerated gradient (NAG) algorithm [30] as well as a recent variant of NAG [2].

Differential privacy can be achieved by adding carefully adjusted random noise to the input (data) such as in [19], to the output (some function of data), such as in [9], or to the iteration vectors of an iterative algorithm, such as in [1, 32]

. In this paper, we focus on the latter case in connection with gradient-based algorithms, where the iteration vectors of a gradient-based algorithm are revealed at the intermediate steps. This scenario is particularly relevant, for example, when some assessment should done publicly on the convergence of the algorithms, or when the available data are shared among multiple users. Although the intrinsic randomness in a stochastic gradient descent algorithm has been shown to provide some level of privacy in a recent study

[22], the authors report high levels of privacy loss for most datasets. That is why most of the studies in the literature consider adding a suitable noise vector to the gradient at each step. However, this noise does harm the performance of the algorithm in such a way that it may even cause divergence. Therefore, the utility of a privacy preserving algorithm is always a concern, as in our work.

There is a large amount of work for improving the utility of gradient based algorithms while preserving a given amount of privacy ‘budget’ (a mathematical definition of this budget is given in Section 2). A well known computational tool is, for example, subsampling, which is analyzed in a broader context in [5]. Norm clipping, that is, bounding the norm of the gradient according to a threshold, is also used to control the amount of noise; see for instance [1, 32, 39]. Analytical developments are also present: The authoros of [1]

focus on tracking higher moments of the privacy loss to obtain tighter estimates on the privacy loss. Other forms of differential privacy are also employed to conduct tighter analysis of the privacy loss

[14, 7, 27, 45].

Contributions: In this paper, we contribute to the existing literature on privacy preserving gradient-based algorithms by proposing, and providing a theoretical analysis of, differentially private versions of HB and NAG.

Our first algorithm is a variant of HB, which employs a smoothing approach by the help of the information from the previous iterations. We use this mechanism to improve the privacy level by taking the weighted average of the current and the previous noisy gradients. We give a convergence rate analysis using the dynamical system analysis techniques for optimization algorithms [25, 21, 16]. Although this kind of analysis exists for the deterministic HB method [21], to the best of our knowledge, the case with noisy gradients has not been considered in the literature, except in [8], where a special case of quadratic objectives is studied for a particular choice of the stepsize and the momentum parameter (corresponding to the traditional choice of parameters in deterministic HB methods). By extending on [8, Theorem 12], we give general results in terms of the error bounds for any selection of stepsize and momentum parameters.

The main motivation behind our error analysis is to shed light on the effect of the free parameters in the algorithm, such as the stepsize and the momentum parameters, and the number of iterations, on the performance. In the typical stochastic optimization setting, the noise in the gradients is assumed to have a bounded variance which does not depend on the number of iterations, therefore the performance bounds obtained for the accuracy of momentum-based algorithms such as NAG or HB (measured in terms of expected suboptimality of the iterates) with constant parameters can improve monotonically as the number of iterations is increased (see e.g.

[24, 8, 2]). However, this is not necessarily the case in privacy preserving versions of these algorithms. This is because each iteration causes some privacy loss and the amount of noise in the gradients has to be increased as the total number of iterations increases. Likewise, it is not clear how to set the stepsize and the momentum parameter for optimum performance of a privacy preserving version of an algorithm because of the complex trade-off between the convergence rate and additive error due to noise. We address such issues for the differentially private HB algorithm by providing performance bounds and error rates in terms of the number of iterations and the momentum parameters. We extend the existing results from the literature [21, 8] to provide an analysis for general stepsize and momentum parameter choices for both quadratic objectives as well as for smooth strongly convex objectives for the HB method under noisy gradients. In particular, tuning the stepsize and the momentum parameters to the level of desired privacy level allows us to achieve better accuracy in the privacy setting compared to traditional choice of parameters previously used for the deterministic HB method.

Our second contribution regards differentially private versions of NAG [30]. NAG can simply be made differentially private by merely adding noise to the its gradient calculations. However, how to distribute the privacy preserving noise to iterations to have an optimal performance has not been concretely addressed in the literature. This question can be reformulated as how to distribute a given, fixed, privacy budget to the iterations of the algorithm. The relevance of this question is due to the fact that in each iteration a noisy gradient is revealed, causing privacy loss. We address this problem for the differentially private versions of NAG. In doing so, we exploit some explicit bounds in [2] on the expected error of those algorithms when they are used with noisy gradients. Our findings show that distributing privacy budget to iterations uniformly, which corresponds to using the same variance for the privacy preserving noise for all iterations, is not the optimal way in terms of accuracy.

We also consider a differentially private version of a recent variant of NAG, the multi-stage accelerated stochastic gradient (MASG), introduced in [2] to improve error behavior. The method is tailored to deal with noisy gradients in NAG, hence is quite relevant to our setting in which noise is used to help with preserving privacy. However, the authors have not considered differential privacy while designing their algorithm. Similar techniques to NAG will be used for the error analysis of the differentially private version of MASG. Moreover, our novel scheme of optimally distributing the privacy budget to the iterations can also be applied to MASG in a similar manner.

We would like to mention the techniques for, and the scope of, the analysis of our proposed algorithms. By their nature, the proposed algorithms are stochastic, where the gradient vector is augmented with privacy preserving noise at each iteration. There exist several studies that analyze the convergence of stochastic accelerated algorithms; for instance, see [26, 20, 35] for works related to stochastic HB, and [42, 41, 28] for a unified analysis of stochastic versions of GD, NAG and HB methods. We adopt a dynamical system representation approach that is preferred to analyze the first order optimization algorithms [25, 16, 21, 3, 2, 29, 28]. In this approach, the convergence rate is found with respect to the rate of decrease of a Lyapunov function of the system state of the dynamic system induced by the algorithm.

Finally, we remark that the given results are satisfied even when the noise that corrupts the gradient is uncorrelated with the state of the algorithm, provided that the noise variance can be bounded. The case of uncorrelated noise is evidently more general than the case of independent noise. In our setting, uncorrelatedness of the noise in the gradient is ensured by the noise being zero mean with a bounded variance conditioned on the state of the algorithm. Such characteristics of the gradient noise is quite relevant to differential privacy for two reasons: First, subsampling is a common technique used in privacy preserving algorithms, and the error due to subsampling has zero mean and its variance is typically dependent on the current iterate of the algorithm. Second, the variance of the privacy preserving noise is adjusted by a so-called sensitivity function of the state of the algorithm, which may be state dependent.

2 Preliminaries

A vast variety of problems in machine learning can be written as unconstrained optimization problems of the form

(1)

where is a parameter vector of dimension . This papers concerns a data-oriented optimization problem, where the objective function depends on a given dataset . The objective function in (1) is a sum of functions that correspond to contributions of the individual data points to the global objective. More specifically, we are interested in objective functions of the form

(2)

where for

. These problems arise in empirical risk minimization in the context of supervised learning

[40]. Note that one could write in order to emphasize the dependency of on . However, for the sake of simplicity, we suppress in the notation. In this paper, we further restrict our attention to the set of strongly convex and smooth (that is with a Lipschitz continuous gradient) functions; see Definition A.1 in Appendix A.

Gradient-based methods are arguably the most popular methods for the optimization problem in (1). We define the gradient vectors for the additive functions

so that the (full) gradient is given by

(3)

The iterates of the basic gradient descent method for the solution of (1) is given by

(4)

where is the (constant) learning rate. There are two well-known modifications of the basic gradient descent; Polyak’s heavy ball (HB) method [33] and Nesterov’s accelerated gradient (NAG) method [30]. Both introduce a momentum parameter to improve upon the convergence of gradient descent. The update rule for HB at iteration is given by

(5)

whereas the update rule for NAG at iteration is simply

(6)

There exist stochastic versions of these gradient-based methods that are employed when either the gradients are noisy or an exact calculation per iteration is too expensive. In the former case,

is simply replaced by the noisy gradient, provided that the noisy gradient is an unbiased estimator of the true gradient. In the second case, the computationally costly

is replaced by a mini-batch estimator

(7)

where is a subset with , formed by sampling without replacement so that is unbiased.

In our subsequent discussion, we will modify the steps of the gradient-based methods to have privacy-preserving updates. Our setting is as follows: The data holder makes public the iterates for a total of iterations. The algorithm is known with all its parameters (and ). If the data holder applies the related update of the method directly, the vectors are revealed. This violates privacy since the revealed terms are deterministic functions of the data. Therefore, due to privacy concerns, the iterates have to be randomized by using a noisy gradient.

Differential privacy quantifies the privacy level that one guarantees by such randomizations. A randomized algorithm takes an input dataset and returns the random output . Such an algorithm can be associated with a function that maps a dataset from

to a probability distribution

such that the output is random with . For datasets and , let denote the Hamming distance between and . This distance indicates the number of different elements between the two datasets. A differentially private algorithm ensures that and are “not much different,” if . This statement is formally expressed by [12] (see Definition A.2 in Appendix A).

Most existing differentially private methods perturb certain functions of data with a suitably chosen random noise. The amount of this noise is related to the sensitivity of the function, which is the maximum amount of change in the function when one single entity of the data is changed (see Definition A.3 in Appendix A). There are many results proposed in the literature that provide differential privacy for iterative algorithms. Among those results, we will mainly use three of them concerning Laplace mechanism, composition and subsampling. For ease of reference, corresponding three theorems are also given in Appendix A.

When privacy is of concern for the optimization problem (1), one approach is to update the parameter of iteration using a noisy (stochastic) gradient vector

(8)

where is the indices of full (sampled) data with size and is a vector of independent noise terms having Laplace distribution with its parameter value chosen suitably to provide the desired level of privacy. Although the privacy of an algorithm can be guaranteed in this way, the performance will be affected because of the noise added at each iteration. In this paper, we analyze the present trade-offs between accuracy and privacy in gradient based algorithms, and propose accelerated algorithms with good performance under the differential privacy noise.

3 Differentially Private Heavy Ball Algorithm

We start with investigating a differentially private version of the stochastic HB algorithm, which we will abbreviate as DP-SHB. The update rule of this algorithm operates on a dataset of size with steps

(9)

where is the momentum parameter of HB, ’s are i.i.d. random subsamples of size sampled without replacement, and ’s are independent random vectors having i.i.d. noise components with which is the Laplace distribution with a zero mean and variance . Here, differential privacy of (9) is sought through the noisy gradient . The minimum value, required for the parameter of the Laplace distribution to have -differential privacy, depends on the number of iterations , the subsample size , and the sensitivity at , where the sensitivity function is defined as

(10)

Observing (7), we see that changing and in one data item corresponds to the existence of a single pair of different values . Hence, the change in is by at most .

Consider the DP-SHB algorithm, where at iteration , we draw a subsample of size from a dataset of size , and add Laplacian noise with parameter to the mini-batch estimator in (8). Then, using the result regarding the Laplace mechanism in Theorem A.1, and the privacy amplification result stated in Theorem A.3, the privacy leak at the iteration can be shown to be

where the function is given as

(11)

Note that, for , i.e., under no subsampling, we end up with . The following proposition uses this fact and states the required amount of noise variance in order to have an -differentially private algorithm after iterations.

Proposition 3.1.

The DP-SHB algorithm in (9) leads to an differentially private algorithm if the parameter of the Laplace distribution for each component of the noise vector at iteration is chosen as

(12)

where is the output value at iteration , is the number of data points,

(13)

is the subsample size, and is the maximum number of iterations.

Proof.

Using the given in the proposition, the privacy loss in one iteration is, . Finally, we apply Theorem A.2 to conclude that the privacy loss after iterations is . ∎

We are interested in DP-SHB, for it lends itself to an interpretation quite relevant to the differential privacy setting. The noise used in the differentially private versions of the gradient descent algorithm has to be higher as the number of iterations grows, i.e., needs to be larger for a larger . This can be seen from equation (12). One way to reduce the required noise is to use a smoothed noisy gradient, where the smoothing is recursively performed on the past and the current gradient estimates. This is indeed how DP-SHB works. The update in (9) can be rewritten as

(14)

where is a geometrically weighted average of all the gradients up to the current iteration defined recursively as

(15)

with the initial condition . We note that a similar smoothing strategy as in DP-SHB, which combines mini-batching with a noise-adding mechanism for averaged gradients, has been used in [31]

; however in a different setting, namely for the purpose of private variational Bayesian inference.

3.1 Analysis of DP-SHB

For analyzing the convergence of DP-SHB, we first cast it as a dynamical system. We introduce the (random) variable

which accounts for the error due to subsampling. Using this definition, we can write

(16)

Then, the dynamical system representation of DP-SHB becomes

(17)

where is the identity matrix, denotes the Kronecker product, and the state vector and the system matrices , , and are given as

(18)

In our error analysis, we will consider both stochastic and deterministic versions of HB. In order to do that, we need a uniform bound (in ) for the conditional covariance of given (for the case without subsampling, we simply take ). Note that, due to independence of and conditional on , the conditional covariance of given satisfies

To handle the contribution to the overall noise by the privacy preserving noise , we make the following assumption.

Assumption 3.2 (Bounded sensitivity).

The sensitivity function defined in (10) is bounded in . That is, there exists a scalar constant such that

(19)

Assumption 3.2 is common in the differential privacy literature. For example, the logistic regression model, which we will use to show our numerical experiments in Section 5, easily admits such a bound. It turns out that Assumption 3.2 readily guarantees a bound on the variance of , the subsampling noise. The next proposition formally shows this observation. The proof is given in Appendix B.1.

Proposition 3.3.

If Assumption 3.2 holds, the norm of the conditional covariance of is bounded for all uniformly in as

(20)

where is given in (12) and is an upper bound on the norm of the covariance of the error due to subsampling given by

(21)

Note that depends on the total number of iterations through , hence the subscript.

Before going into the detailed technical analysis, we find it useful to provide a sketch of it. Our purpose is to find an upper bound for the expected sub-optimality where is the optimal solution of (1) and is the minimum value of . The upper bound we will prove is of the form

for some rate , a non-negative that is related to the initial point , and a non-negative . As we will show soon, this bound in the DP setting has interesting aspects: Note that, as an issue unique to the differential privacy context, the term increases with the total number of iterations, . This is because for fixed privacy level , as increases defined in (12) decreases. Hence, increasing the number of iterations makes the first term smaller, however it leads to an increase in the second term . This makes the analysis of DP-SHB fundamentally different compared to the analysis of the standard SHB in the stochastic optimization literature (see e.g. [8, 20, 18]), where the second term is scaled with the fixed noise variance parameter that does not change with the number of iterations.

For analysis purposes, we define such that for , we have . Also, for a symmetric positive-definite matrix and a positive scalar , we set the Lyapunov function

with . The following proposition, which is constructed in a similar vein as Proposition 4.6 in [3], allows us to obtain expected sub-optimality bounds depending on the parameters and as well as the noise level and a convergence rate . A proof is given in Appendix B.2.

Proposition 3.4.

Given , consider running DP-SHB algorithm with constant parameters and for iterations and with in Proposition 3.1 so that -differential privacy is satisfied. Suppose that Assumption 3.2 holds and there exists , a positive semi-definite symmetric matrix , and constants such that

(22)

where

the matrices are as in (18), and

Then, for all , we obtain

(23)

where is defined in (20), denotes the determinant of and we have the convention for the last factor.

As distinct from the approach in [3], which is developed for Nesterov’s accelerated gradient method, the bound in (23) is constructed by adapting the results for the deterministic HB [21] to the stochastic setting. We also note that the matrix inequality (22) is and can be solved numerically for and in practice by a simple grid search over the rate and entries of the matrix (see, e.g., [21, 25, 8]). Therefore, the right-hand side of (23) that provides performance bounds can be computed numerically in practice.

3.2 Analysis of quadratic objective function case

In this section, we will present explicit bounds for a quadratic objective function in order to provide more insight into the interplay between , , and the number of iterations . We consider the following quadratic function

(24)

where is symmetric positive definite, a column vector and is a scalar. For such a strongly convex quadratic objective function, an exact bound for the objective error can be presented.

To put it in a differential privacy context, we can assume that the parameters of depend on some data . For example, is a sum of functions that are quadratic in (hence itself is quadratic in ), and the coefficients of the quadratic expression for each depend on . We will assume that, the sensitivity of is such that the required DP noise satisfies for some . For simplicity, we assume that no subsampling is performed, i.e., .

The optimal values for HB in the non-noisy setting has been given in [34] as and where . However, those “optimal” values may not be the best selection for and for DP-SHB. There are two reasons for this: First, due to privacy concerns, noise is inevitable in DP-SHB. Presence of noise shows as a second additive term in the bound for the error. This second term is affected by the selection of . Second, the amount of privacy preserving noise increases with the total number of iterations. In general, the error bound is a sum of two terms. The first of these two terms decreases with the convergence rate of the algorithm and the second term is due to privacy preserving noise. It will be shown that and have an influence on both the convergence rate and the multiplicative constant of the additive error due to noise. We will additionally see that a selection of pair that improves the rate also increases the additive error term due to the presence of privacy preserving noise. Therefore, we can talk about a trade-off between the convergence rate and the additive noise term in our performance bounds, which is adjusted by the parameters and . In that respect, the “optimal” and in the non-noisy setting is typically not the best choice of and in the DP setting.

By adapting [8, Thm 12], given for parameter choices , we present our result for the error bound given by any pair . A proof is given in Appendix B.3.

Theorem 3.5.

Let be a quadratic function given in (24). Consider the iterates of the DP-SHB method, which is run for iterations with noisy gradients where and for some positive constant . If DP-SHB is run with parameters , then

(25)

where

with

’s being the eigenvalues of

. In (25), we have

(26)

where

and is given by

with being a sequence of scalar coefficients, provided that .

Note that in Theorem 3.5 we considered the case with uncorrelated and bounded noise variance, which generalizes over the independent noise setting. To the best of our knowledge, such a result has not been shown before in the literature.

Numerical demonstration:

Here, we illustrate the effect of algorithm parameters over the error bound given in Theorem 3.5. The dimension of the objective function is taken , and is chosen as the diagonal matrix with and on its diagonal, so that its eigenvalues are and .

We take for simplicity of the presentation.111 is a constant multiple of but the constant in front of would not change the qualitative behavior of the plots, only shifting the graphs by a constant factor in the logarithmic scale. With fixed stepsizes , the convergence rate in (26) versus is plotted in Figure 1 for several values of . As for the noise variance, we considered to represent increasing noise variance in the total number of iterations. We repeated our experiments for two different values of , namely for (representing a less noisy, hence less private scenario), and (representing a more noisy, hence more private scenario). We observe that the “optimal” value in terms of convergence rate (which is indicated at the bottom row of Figure 1) shows a reliable performance.

Figure 1: DP-SHB performance for the quadratic objective function case

4 Differentially Private Accelerated Algorithms

In this section, we will investigate NAG in a differential privacy setting, and propose two ways to tailor it for improved performance under differential privacy.

In the following discussion, we will assume that Assumption 3.2 on the existence of an upper bound on the sensitivity holds, like in the previous section. Furthermore, we will assume that the upper bound is considered while determining the parameter of the privacy preserving Laplace, so that is independent from the current state. Using a state-independent sensitivity to determine the Laplace parameter is not uncommon, especially when it is hard to identify for all . An example to this case can be found in Section 5, in particular the sensitivity bound in (36) for the logistic regression model, which is independent of the state .

Recall the NAG update in (6). A straightforward differentially private version of NAG would be obtained by cluttering the gradient with the privacy-preserving noise, just as in the DP-SHB algorithm. The corresponding change in NAG would be

(27)

where for , and with is given in (13). The resulting algorithm will be referred to as DP-NAG.

In DP-NAG, the stepsize (hence ) and the DP noise parameter are taken constant. That begs the question whether the performance of DP-NAG could be improved, if we let and depend on , the iteration number. We propose two methods to improve the performance of DP-NAG while preserving the same level of privacy. The first method seeks to improve the algorithm by making the DP variance parameter dependent on the iteration number, whereas the second considers varying (hence ) with iterations.

4.1 NAG with optimized DP variance

We first present an error bound for NAG that uses noisy gradients. Let . The following theorem is adapted from [2, Theorem 2.3].

Theorem 4.1.

Let and suppose that Assumption 3.2 holds. Consider a stochastic version of the NAG algorithm that runs with a stepsize and the momentum parameter and uses noisy gradients for as in (8) with a subsampling size and for all . Then, for any , we have

(28)

Note that in (28), the term is an upper bound on the norm of the covariance of the gradient estimator, and it simplifies to when , i.e., without subsampling. By starting the recursion in (28) at the last iteration and recursing backward until , we end up with

It will prove useful later to express the error of NAG generically as

(29)

The in (29) can be identified as

In the DP framework, we have control on the noise parameters , with a constraint due to our privacy budget . Suppose that we are committed to run the algorithm for a total of iterations. When is used, the privacy leak at iteration becomes . Given a desired privacy level , we have the constraint , by Theorem A.2. Therefore, one question is, with fixed and , how we should arrange so that the bound in (29) is optimized. Factoring our privacy budget into the scene, we have the following constrained optimization problem.

(30)

For general , the constrained optimization problem is analytically intractable and needs a numerical solution. This is due to the non-linearity in . However, for the special case of (no subsampling), the constraint in (30) simplifies to , allowing for the following tractable result. (A proof is given in Appendix B.4.)

Proposition 4.2.

When , the optimization problem in (30) is solved by

(31)

We can express the solution (31) also in terms of the privacy leak at iteration as:

Since is decreasing in , the solution (31) suggests that the variance should start high and then should be decreased. This means that the privacy budget should be distributed to the iterations in an unevenly way. A larger part of the privacy budget should be spent for later rather than for early iterations.

Remark 4.1.

The solution in (31) for yields the optimum bound

(32)

which could further be optimized with respect to the number of iterations, provided that one has an accurate guess on the initial error . We note that increasing the number of iterations may degrade the performance in the DP context, since the required noise per iteration increases unlike the deterministic setting where one may improve the performance monotonically as the number of iterations grows.

Remark 4.2.

Although the result in Proposition (4.2) is valid for no subsampling, it can be used as a guide for arranging ’s even under subsampling. Note that for values of , , , and such that and , we have , owing to the approximation for .

4.2 Multi-stage NAG

An alternative for improving the performance of NAG is to make the stepsize vary with iterations. In fact, the MASG algorithm of [2] has been proposed with that motivation. The authors prove that MASG achieves optimal rate both in deterministic and stochastic versions.

In this paper, we present a DP-MASG, a differentially private version of MASG introduced in [2]. In order to study and improve the error behavior of the algorithm, an explicit bound for the objective error that accommodates iteration dependent noise variance parameter is presented. We demonstrate that the approach of dividing noise into iterations can be applied to MASG as well.

The original algorithm MASG is a multistage accelerated algorithm which uses Nesterov’s accelerated gradient method with noisy full gradient. The total iterations are divided into stages, with stage lengths , and for each stage a different stepsize is used. For the optimal convergence rate, the stage lengths and the corresponding stepsizes are recommended in [2] as

(33)

where .

The MASG algorithm can easily be modified to be differential private by adding a Laplace noise to the gradient as in (20). We will refer to the resulting algorithm as DP-MASG. The selections in (33) for the stage lengths and the stepsizes were designed for constant noise variance per iteration. In the following, we will instead propose a new version that uses a variable noise variance parameter at iteration , which can improve performance. The main idea is to rely on Proposition 4.2 to optimize over ’s with the privacy budget constraint.

In order to study how the privacy noise can be optimally distributed to the iterations of DP-MASG, we provide an explicit bound that not only accommodates iteration-dependent noise variance, but also is in the same form as (29) so that the noise variances can be optimized to minimize the bound. For MASG, stepsizes change across stages, therefore, the recursion in (28) cannot be applied for all iterations. Instead, by Lemma 3.3 of [2], we have a factor of two that appears when the algorithm transitions from one stage to the next. This leads to the following theorem.

Theorem 4.3.

Let . Consider the DP-MASG algorithm with stage lengths and step-sizes during those stages given as in (33), and with noisy gradients , where for . Then,

(34)

where is the stage that contains iteration , provided that for all .

Observing that the bound in (34) is in the same form as (29), can be optimized as in (30) but with indicated by (34) as

Once again, the optimal ’s when can be written in terms of , , and ’s as in (31). To show the effect of algorithm parameters on noise variance, we plot the optimum values in Figure 2 for , , , , and , representing the constant factor in front of the stepsize.

Figure 2: Optimal values for the multi-stage NAG algorithm.

5 Experimental results

Our experiments concern a regularized logistic regression problem.222The results are produced with the code at https://github.com/sibirbil/DPAccGradMethods The model has observations , , where is a vector of covariates and is a binary response whose conditional probability given depends on a parameter vector as follows:

Since the probability distribution of ’s does not depend on , the (regularized) maximum likelihood problem is defined as determining

(35)

where . One can verify that for all , upon observing that, for all , and , we have

(36)

For the experiments to follow, we use a synthetic data with and and the value of regularization parameter is taken as . The set is taken as the set of all real-valued vectors with an -norm less than equal to . Hence, Assumption 3.2 holds for this example with . We set and

is estimated as the largest singular value of

, where is the matrix with being its column .

In our experiments, we compared six differentially private algorithms. The first four, DP-GD, DP-NAG, DP-MASG and DP-HB are the straightforward differentially private versions of GD, NAG, MASG and HB, respectively. The last two algorithms in the comparison are named DP-NAG-opt and DP-MASG-opt, who stand for the alterations of DP-NAG and DP-MAGS for which the privacy preserving noise is distributed to the iterations according to Proposition 4.2.

The algorithms are compared across different values of , , and , where is the subsampling size, is the number of iterations, and determines the step size as in . For DP-MASG and DP-MASG-opt, the general stepsize formulation in (33), presented for the original versions, is preserved; however the stepsizes are scaled by . We tried all the combinations of , , and . We fixed throughout the whole experiments.

For DP-NAG-opt and DP-MASG-opt, we also adjusted the given value of as follows: With an initial guess of , we computed the bound in (32) for each , and we decided the number of iterations to be that that gives the minimum bound. This procedure was detailed in Remark 4.1.

Figures 3 and 4 show, respectively for (no subsampling) and for , the performances of the algorithms for the tried values of and . Each subfigure shows the log-difference between the objective function evaluated at the current iterate , and the objective function evaluated at the optimum solution . The optimum solution was found with a non-private NAG algorithm that is run for 1000 iterations and without subsampling. The plotted values are the averages from 20 independent runs for each combination of . Trace plots of the iterates for the different values of are plotted together with different colors. Note that for some cases plots overlap, leading some colors invisible.

Figure 3: Errors with various and and without subsampling. Top: , Bottom: .
Figure 4: Errors with various and and with . Top: , Bottom: .

Comparing DP-GD against the accelerated algorithms, we observe that the accelerated algorithms, DP-HB, DP-NAG, and DP-NAG-opt, outperform DP-SGD. Furthermore, among the accelerated algorithms, we have the best results with DP-NAG-opt and DP-MASG-opt. The advantage of accelerating is more striking for , representing a too small value for the stepsize. While DP-DG is dramatically slow with a too small stepsize, the accelerated algorithms DP-HB, and DP-NAG, and DP-NAG-opt seem to suffer less from that ill choice for the stepsize. When , DP-GD recovers from slow convergence, however the accelerated algorithms are still able to beat it. Our observations hold both for and for . The multistage algorithm DP-MASG is also prone to a small value for ; but it recovers dramatically when as recommended in the original work [2].

In all instances, we can see the advantage of accelerated algorithms in the speed of convergence. However, when we compare the the error levels that the algorithms have reached for the same , we see that sometimes DP-GD has a better performance over DP-NAG or DP-HB. See, for example, the lower half of Figure 3, at (red line): while DP-GD converged more slowly than DP-HB and DP-NAG, it reached a smaller error level. However, if we conduct an overall comparison between DP-GD and DP-HB in terms of their best performances among all the choices , we see that the best of DP-HB (at ) outperforms the best of DP-GD (at ). This observation is repeated in our experiments and is suggestive of a general recommendation: The accelerated algorithms can promise faster convergence when used with a small number of iterations.

This general recommendation about the selection of is further supported by the traces belonging to the DP-NAG-opt and DP-MASG-opt (when ), where is re-adjusted according to (32). We can see from the subplots belonging to DP-NAG-opt and DP-MASG-opt that, the re-adjustment prefers small , and this selection indeed improves the performance. This further justifies the use of the optimized algorithms DP-NAG-opt and DP-MASG-opt, where the distribution of privacy preserving variance as well as the number of iterations are chosen automatically.

We also compare between the NAG-based schemes and their multi-stage versions. When the stepsize is chosen properly (c.f. ), both DP-NAG-opt and DP-MASG-opt perform very closely and outperform the others. However, DP-NAG-opt seems more robust to a poor selection of the stepsize (exampled by ).

Finally, we compare the selections and , where the first one corresponds to subsampling (with a rate of ), and the other corresponds to no subsampling. Firstly, we can see that, even when we subsample, optimizing ’s and according to Proposition 4.2 does improve the performance of DP-NAG and DP-MASG significantly. (Recall that Proposition 4.2 holds under no subsampling, yet its use under subsampling was discussed in Remark 4.2). Secondly, a comparison of Figures 3 and 4 on the whole shows that using full data improves the performance of the accelerated algorithms, especially for (compare the lower halves of the figures). However, the difference does not seem to be by an order of magnitude. Since the additional randomness introduced by subsampling helps to decrease the required noise level for DP and using a sample instead of full data at each iteration is faster (in terms of per iteration running time), many DP methods in the literature consider stochastic algorithms in the DP context, where stochastic algorithms can improve the running time compared to deterministic algorithms. However, if the running time is not of concern for reaching a given privacy level, our experiments show that using full data results in a smaller bound on the objective error.

6 Conclusions

In this paper, we presented two classes of differentially private optimization algorithms based on momentum averaging based on the heavy ball method and Nesterov’s accelerated gradient method. We provided performance bounds for our algorithms for a given iteration budget while preserving a desired privacy level depending on the choice of the parameters (stepsize and the momentum). We showed that, for NAG, homogenous distribution of the privacy budget over all iterations, as typically done in the literature so far, is not the best way, and we propose a method to improve it. Numerical experiments showed that the presented algorithms have advantages over their well-known straightforward versions.

Our analysis and methodology can be adapted to other forms of privacy to certain extents. For this, existence of a tractable formula for the noise parameter to satisfy a certain level of privacy is the key requirement. For example, a weaker form of

differential privacy can be satisfied if the normal distribution is used for the privacy preserving noise, and, the required noise variance is well known

[13]. Furthermore, provided no subsampling, privacy loss can be optimally distributed to the iterations of DP-NAG using a closed from formula as in Proposition 4.2, by exploiting the relation between zero-concentrated differential privacy of [7] and differential privacy.

Our theoretical work formally investigates, both for DP-HB, DP-NAG, and the multi-stage version of DP-NAG, the effect of the algorithm parameters on the error bound. For DP-NAG and its multi-stage version, we also provide explicit formulas about how the variance of the gradient noise should be tuned at each stage to preserve a certain given level of privacy requirement, provided the choice for the total number of iterations. However, in our setup, tuning of these parameters require knowledge about the constants and . The Lipschitz constant can often be estimated from data using line search techniques (see e.g. [37, Alg. 2] or [6]). The strong convexity constant may also be known in some cases, for instance if a regularization term with is added to a convex empirical risk minimization problem of the form (2), the strong convexity constant can be taken as . However, in general, may not be known and it may need to be estimated from data. As part of future work, it would be interesting to investigate whether restarting techniques developed for accelerated deterministic algorithms such as [17] which do not require the knowledge of the strong convexity constant a priori can be adapted to the privacy setting.

Appendix A Definitions and Known Results

Definition A.1 (Strongly convex and smooth functions).

A continuously differentiable function is called strongly convex with modulus and L-smooth with a Lipschitz constant