Private Stochastic Convex Optimization: Optimal Rates in Linear Time

05/10/2020 ∙ by Vitaly Feldman, et al. ∙ Tel Aviv University 38

We study differentially private (DP) algorithms for stochastic convex optimization: the problem of minimizing the population loss given i.i.d. samples from a distribution over convex loss functions. A recent work of Bassily et al. (2019) has established the optimal bound on the excess population loss achievable given n samples. Unfortunately, their algorithm achieving this bound is relatively inefficient: it requires O(min{n^3/2, n^5/2/d}) gradient computations, where d is the dimension of the optimization problem. We describe two new techniques for deriving DP convex optimization algorithms both achieving the optimal bound on excess loss and using O(min{n, n^2/d}) gradient computations. In particular, the algorithms match the running time of the optimal non-private algorithms. The first approach relies on the use of variable batch sizes and is analyzed using the privacy amplification by iteration technique of Feldman et al. (2018). The second approach is based on a general reduction to the problem of localizing an approximately optimal solution with differential privacy. Such localization, in turn, can be achieved using existing (non-private) uniformly stable optimization algorithms. As in the earlier work, our algorithms require a mild smoothness assumption. We also give a linear-time algorithm achieving the optimal bound on the excess loss for the strongly convex case, as well as a faster algorithm for the non-smooth case.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic convex optimization (SCO) is the problem of minimizing the expected loss (also referred to as population loss) for convex loss functions of over some -dimensional convex body given access to i.i.d. samples from the data distribution . The performance of an algorithm for the problem is measured by bounding the excess (population) loss of a solution , that is the value

. This problem is central to numerous applications in machine learning and arises for example in least squares/logistic regression, or minimizing a convex surrogate loss for a classification problem. It also serves as the basis for the development of continuous optimization algorithms in the non-convex setting. In this work we study this problem with the constraint of differential privacy with respect to the set of samples

[DMNS06].

Placing a differential privacy constraint usually comes at a cost in terms of utility. In this case, it is measured by the excess population loss of the solution, for a given number of samples

. Additionally, runtime efficiency of an optimization method is crucial for modern applications on large high-dimensional datasets, and this is the primary reason for the popularity of stochastic gradient descent-based methods. This motivates the problem of understanding the trade-offs between computational efficiency, and excess population loss in the presence of privacy constraints.

Differentially private convex optimization is one of most well-studied problems in private data analysis [CM08, CMS11, JKT12, KST12, ST13, SCS13, DJW13, Ull15, JT14, BST14, TTZ15, STU17, WLK17, WYX17, INS19]. However, most of the prior work focuses on the easier problem of minimizing the empirical loss (referred to as empirical risk minimization (ERM)) for which tight upper and lower bounds on the excess loss are known in a variety of settings. Upper bounds for the differentially private ERM can be translated to upper bounds on the population loss by appealing to uniform convergence of empirical loss to population loss, namely an upper bound on . However, in general,111At the same time, uniform convergence suffices to derive optimal bounds on the excess population loss in a number of special cases, such as regression for generalized linear models. this approach leads to suboptimal bounds: it is known that there exist distributions over loss functions over for which the best bound on uniform convergence is [Fel16]. As a result, in the high-dimensional settings often considered in modern ML (when ), bounds based on uniform convergence are and do not lead to meaningful bounds on population loss.

The first work to address the population loss for SCO with differential privacy (DP-SCO) is [BST14] who give a bound of order [BST14, Sec. F].222For clarity, in the introduction we focus on the dependence on , and for -DP, suppressing the dependence on and on parameters of the loss function such as Lipschitz constant and the diameter of . For the most relevant case where and , this results in a bound of on excess population loss. More recent work of Bassily et al. [BFTT19] demonstrates the existence of an efficient algorithm that achieves a bound of , which is also shown to be tight. Notably, this bound is comparable to the non-private SCO bound of as long as . Their algorithm is based on solving the ERM via noisy stochastic gradient descent (SGD) [BST14] but requires relatively large batch sizes for the privacy analysis. As a result, their algorithm uses gradient computations. This is substantially less efficient than the optimal non-private algorithms for the problem which require only gradient evaluations. They also give a near-linear-time algorithm under an additional strong assumption that the Hessian of each loss function is rank-1 over the entire domain.

Along the other axis, several of the aforementioned works on private ERM [WLK17, WYX17, INS19] are geared towards finding computationally efficient algorithms for the problem, often at the cost of worse utility bounds.

We describe two new techniques for deriving linear-time algorithms that achieve the (asymptotically) optimal bounds on the excess population loss. Thus our results show that for the problem of Stochastic Convex Optimization, under mild assumptions, a privacy constraint come for free. For , there is no overhead in terms of either excess loss or the computational efficiency. When , the excess loss provably increases, but the optimal bounds can still be achieved without any computational overhead. Unlike the earlier algorithm [BFTT19] that solves the ERM and relies on uniform stability of the algorithm to ensure generalization, our algorithms directly optimize the population loss.

Formally, our algorithms satisfy the following bounds:

Theorem 1.1.

Let be a convex set of diameter and be a family of convex -Lipschitz and -smooth functions over . For every , there exists an algorithm that given a starting point , and returns a point . For all , uses evaluations of the gradient of and satisfies -RDP as long as , where is a universal constant. Further, if consists of samples drawn i.i.d. from a distribution over , then

where, for all , , and the expectation is taken over the random choice of and randomness of .

Our guarantees are stated in terms of Rényi differential privacy (RDP) [Mir17] for all orders and can also be equivalently stated as 0-mean -concentrated differential privacy (or -zCDP) [BS16]. Standard properties of RDP/zCDP imply that our algorithms satisfy -DP for all as long as . Thus for -DP our bound is

matching the tight bound in [BFTT19]. We now overview the key ideas and tools used in these techniques.

Snowball-SGD:

Our first algorithm relies on a one-pass noisy SGD with gradually growing batch sizes. Namely, at step out of the batch size is proportional to . We refer to SGD with such schedule of batch size as Snowball-SGD. The analysis of this algorithm relies on two tools. The first one is privacy amplification by iteration [FMTT18]. This privacy amplification technique ensures that for the purposes of analyzing the privacy guarantees of a point used at step one can effectively treat all the noise added at subsequent steps as also added to the gradient of the loss at . A direct application of this technique to noisy SGD results in different privacy guarantees for different points [FMTT18] and, as a result, the points used in the last steps will not have sufficient privacy guarantees. However, we show that by increasing the batch size in those steps we can achieve the optimal privacy guarantees for all the points.

A limitation of relying on this analysis technique is that the privacy guarantees apply only to the algorithm that outputs the last iterate of SGD. In contrast, the optimization guarantees usually apply to the average of all the iterates (see Section 6 for an example in which the privacy guarantees for the average iterate are much worse than those for the last iterate). Thus the second tool we rely on is the recent work of Jain et al. [JNN19] showing that, for an appropriate choice of step sizes in SGD, the last iterate has the (asymptotically) optimal excess population loss. (Without the special step sizes the last iterate has excess loss larger by a factor [SZ13, HLPR19].) See Section 3 for additional details of this approach.

No Privacy Amplification for the Average Iterate:

It is natural to ask if the last iterate analysis is really needed or if the average iterate itself be proven to have good privacy properties. In Section 6, we address this question and show that in general the average iterate can be very non-private even when the noise is sufficient to give strong privacy guarantees for the last iterate.

Localization:

Our second approach is based on an (implicit) reduction to an easier problem of localizing an approximate minimizer of the population loss. Specifically, the reduction is to a differentially private algorithm that given a point that is within distance from the minimizer of the loss, finds a point that is within distance from a point that approximately minimizes the loss. By iteratively using such a localizing algorithm with appropriately chosen parameters, a sufficiently good solution will be found after a logarithmic number of applications of the algorithm. Each application operates on its own subset of the dataset and thus this reduction preserves the privacy guarantees of the localizing algorithm.

A simple way to implement a localization algorithm is to start with non-private SCO algorithm whose output has optimal sensitivity. Namely, solutions produced by the algorithm on any two datasets that differ in one point are at distance on the order of (this property is also referred to as uniform stability in the parameter space). Given such an algorithm one can simply add Gaussian noise to the output. This is a standard approach to differentially private optimization referred to as output perturbation [CMS11, WLK17]. However, for the purposes of localization, we only need to be within of the solution output by the algorithm and so we can add much more noise than in the standard applications, thereby getting substantially better privacy guarantees.

We note that in order to ensure that the addition of Gaussian noise localizes the solution with probability at least

we would need to increase the noise variance by an additional

factor making the resulting rate suboptimal by a logarithmic factor. Thus, instead we rely on the fact that for algorithms based on SGD the bound on excess loss can be stated in terms of the second moment of the distance to the optimum.

We can now plug in existing uniformly stable algorithms for SCO. Specifically, it is known that under mild smoothness assumptions, one-pass SGD finds a solution that both achieves optimal bounds on the excess population loss and stability [HRS15, FV19]. This leads to the second algorithm satisfying the guarantees in Theorem 1.1. See Section 4 for additional details of this approach.

Non-smooth case:

Both of our algorithms require essentially the same and relatively mild smoothness assumption: namely that the smoothness parameter is at most (ignoring the scaling with and for simplicity focusing on the case when and ). Bassily et al. [BFTT19] show that optimal rates are still achievable even without this smoothness assumption. Their algorithm for the problem relies on using the prox operator instead of gradient steps which is known to be equivalent to gradient steps on the loss function smoothed via the Moreau-Yosida envelope. Unfortunately, computing the prox step with sufficient accuracy requires many gradient computations and very high accuracy is needed due to potential error accumulation. As a result, implementing the algorithm in [BFTT19] requires gradient computations.

Our reduction based technique gives an alternative and simpler way to deal with the non-smooth case. One can simply plug in a uniformly stable algorithms for SCO in the non-smooth case from [SSSSS10]. This algorithm relies on solving ERM with an added strongly convex term. In this case the analysis of the accuracy to which the ERM needs to be solved is straightforward. However achieving such accuracy with high probability requires gradient computations thus giving an algorithm for the non-smooth version of our problem. Improving this running time is a natural avenue for future work. We remark that finding a faster uniformly stable (non-private) SCO for the non-smooth case is an interesting problem in itself.

Strongly convex case:

When the loss functions are strongly convex, the optimal (non-private) excess population loss is of the order of rather than . The excess loss due to privacy is known to be . The best known upper bounds for this problem due to [BST14] are . We show a nearly linear time algorithm that has excess loss matching the known lower bounds. As in the convex case, when , privacy has virtually no additional cost in terms of utility or efficiency. We describe several approaches that achieve these bounds (up to, possibly, a logarithmic overhead). The first approach is based on a folklore reduction to the convex case which can then be used with any of our algorithms for the (non-strongly-convex) convex case. We also give two direct algorithms that rely on a new analysis of SGD with fixed step-size in the strongly convex case. The first algorithm uses iterative localization approach and the second one relies on privacy amplification by iteration.

2 Preliminaries

2.1 Convex Loss Minimization

Let be the domain of data sets, and be a distribution over . Let be a dataset drawn i.i.d. from . Let be a convex set denoting the space of all models. Let be a loss function, which is convex in its first parameter (the second parameter is a data point and dependence on this parameter can be arbitrary). The excess population loss of solution is defined as

In order to argue differential privacy we place certain assumptions on the loss function. To that end, we need the following two definitions of Lipschitz continuity and smoothness.

Definition 2.1 (-Lipschitz continuity).

A function is -Lipschitz continuous over the domain if the following holds for all : .

Definition 2.2 (-smoothness).

A function is -smooth over the domain if for all , .

2.2 Probability Measures

In this work, we will primarily be interested in the -dimensional Euclidean space endowed with the metric and the Lebesgue measure. We say a distribution is absolutely continuous with respect to if whenever for all measurable sets . We will denote this by .

Given two distributions and on a Banach space , one can define several notions of distance between them. The primary notion of distance we consider is Rényi divergence:

Definition 2.3 (Rényi Divergence [Rén61]).

Let and be measures with . The Rényi divergence of order between and is defined as

Here we follow the convention that . If , we define the Rényi divergence to be . Rényi divergence of orders is defined by continuity.

2.3 (Rényi ) Differential Privacy

The notion of differential privacy is by now a de facto standard for statistical data privacy [DMNS06, Dwo06, DR14].

Definition 2.4 ([Dmns06, Dkm06]).

A randomized algorithm is -differentially private (-DP) if, for all datasets and that differ in a single data element and for all events in the output space of , we have

Starting with Concentrated Differential Privacy [DR16]

, definitions that allow more fine-grained control of the privacy loss random variable have proven useful. The notions of zCDP 

[BS16], Moments Accountant [ACG16], and Rényi differential privacy (RDP) [Mir17] capture versions of this definition. This approach improves on traditional -DP accounting in numerous settings, often leading to significantly tighter privacy bounds as well as being applicable when the traditional approach fails [PAE17, PSM18].

Definition 2.5 ([Mir17]).

For and , a randomized algorithm is -Rényi differentially private, or -RDP if for all neighboring data sets and we have

The following two lemmas allow translating Rényi differential privacy to -differential privacy, and give a composition rule for RDP.

Lemma 2.6 ([Mir17, Bs16]).

If satisfies -Rényi differential privacy, then for all it also satisfies -DP. In particular, if satisfies -RDP for every then for all it also satisfies -DP.

The standard composition rule for Rényi differential privacy, when the outputs of all algorithms are revealed, takes the following form.

Lemma 2.7 ([Mir17]).

If are randomized algorithms satisfying, respectively, -RDP,…,-RDP, then their composition defined as is -RDP. Moreover, the ’th algorithm can be chosen on the basis of the outputs of .

2.4 Contractive Noisy Iteration

We start by recalling the definition of a contraction.

Definition 2.8 (Contraction).

For a Banach space , a function is said to be contractive if it is 1-Lipschitz. Namely, for all ,

A canonical example of a contraction is projection onto a convex set in the Euclidean space.

Proposition 2.9.

Let be a convex set in . Consider the projection operator:

The map is a contraction.

Another example of a contraction, which will be important in our work, is a gradient descent step for a smooth convex function. The following is a standard result in convex optimization [Nes04].

Proposition 2.10.

Suppose that a function is convex and -smooth. Then the function defined as:

is contractive as long as .

We will be interested in a class of iterative stochastic processes where we alternate between adding noise and applying some contractive map.

Definition 2.11 (Contractive Noisy Iteration (Cni)).

Given an initial random state , a sequence of contractive functions , and a sequence of noise distributions , we define the Contractive Noisy Iteration (CNI) by the following update rule:

where is drawn independently from . For brevity, we will denote the random variable output by this process after steps as .

As usual, we denote by the convolution of and , that is the distribution of the sum where we draw and independently.

Definition 2.12.

For a noise distribution over a Banach space we measure the magnitude of noise by considering the function that for , measures the largest Rényi divergence of order between and the same distribution

shifted by a vector of length at most

:

We denote the standard Gaussian distribution over

with variance by . By the well-known properties of Gaussians, for any , and , . This implies that in the Euclidean space, .

When and are sampled from and respectively, we will often abuse notation and write .

2.5 Privacy Amplification by Iteration

The main result in [FMTT18] states that

Theorem 2.13.

Let and denote the output of and . Let . Let be a sequence of reals and let . If for all , then

We now give a simple corollary of this general theorem for the case when the iterative processes differ in a single index and, in addition, the noise distribution with parameter ensures that Rényi divergence for a shift of scales as . As discussed above, this is exactly the case for Gaussian distribution.

Corollary 2.14.

Let and denote the output of and . Let . Assume that there exists such that for all , . For assume that there exists such that for every and , and , for some . Then

Proof.

We use Theorem 2.13 with for and . The resulting bound we get

3 DP SCO via Privacy Amplification by Iteration

We start by describing a general version of noisy SGD and analyze its privacy using the privacy amplification by iteration technique from [FMTT18]. Recall that in our problem we are given a family of convex loss functions over some convex set parameterized by , that is is convex and differentiable in the first parameter for every . Given a dataset , starting point , a number of steps , batch size parameters such that are positive integers and , step sizes , and noise scales the algorithm works as follows. Starting from perform the following update and , where is the average of loss functions for samples in batch , that is

is a freshly drawn sample from ; and , denotes the Euclidean projection to set . We refer to this algorithm as PNSGD and describe it formally in Algorithm 1. For a value we denote the fixed sequence of parameters of length by .

0:  Data set , a convex function in the first parameter, starting point , batch sizes , step sizes , noise parameters .
1:  for  do
2:     , where .
3:     , where is the -projection on .
4:  return  the final iterate .
Algorithm 1 Projected noisy stochastic gradient descent (PNSGD)

3.1 Privacy Guarantees for Noisy SGD

As in [FMTT18], the key property that allows us to treat noisy gradient descent as a contractive noisy iteration is the fact that for any convex function, a gradient step is contractive as long as the function satisfies a relatively mild smoothness condition (see Proposition 2.10). In addition, as is well known, for any convex set , the (Euclidean) projection to is contractive (see Proposition 2.9). Naturally, a composition of two contractive maps is a contractive map and therefore we can conclude that PNSGD is an instance of contractive noisy iteration. More formally, consider the sequence . In this sequence, is obtained from by first applying a contractive map that consists of projection to followed by the gradient step at and then addition of Gaussian noise of scale . Note that the final output of the algorithm is but it does not affect our analysis of privacy guarantees as it can be seen as an additional post-processing step.

More formally, for this algorithm we prove the following privacy guarantees.

Theorem 3.1.

Let be a convex set and be a family of convex -Lipschitz and -smooth functions over . Then, for every batch-size sequence , step-size sequence such that for all , noise parameters , , starting point , and , PNSGD satisfies -RDP, where

Proof.

For , let and be two arbitrary datasets that differ at index and let be the index of the batch in which -th example is used by PNSGD with batch-size sequence . Note that each is an average of -smooth, -Lipschitz convex functions and thus is itself -smooth, -Lipschitz and convex over . Thus, as discussed above, under the condition , the steps of PNSGD are a contractive noisy iteration. Specifically, on the dataset , the CNI is defined by the initial point , sequence of functions and sequence of noise distributions . Similarly, on the dataset , the CNI is defined in the same way with the exception of , where includes loss function for instead of . Namely, .

By our assumption, is -Lipschitz for every and and therefore

We can now apply Corollary 2.14 with . Note that and thus we obtain that

Maximizing this expression over all indices gives the claim. ∎

The important property of this analysis is that it allows for batch size to be used to improve the privacy guarantees. The specific batch size choice depends on the step sizes and noise rates. Next we describe the setting of these parameters that ensures convergence at the optimal rate.

3.2 Utility Guarantees for the Last Iterate of SGD

In order to analyze the performance of the noisy projected gradient descent algorithm we will use the convergence guarantees for the last iterate of SGD given in [SZ13, JNN19]. For the purpose of these results we let be an arbitrary convex function over for which we are given an unbiased stochastic (sub-)gradient oracle . That is for every , . Let PSGD denote the execution of the following process: starting from point , use the update for . Shamir and Zhang [SZ13] prove that the suboptimality of the last iterate of SGD with the step size being proportional to scales as . This variant of SGD relies on relatively large step sizes in the early iterates which would translate into a relatively strong assumption on smoothness in Theorem 3.1. However, it is known [Har19] that the analysis in [SZ13] also applies to the fixed step size scaling as (in fact, it is simpler and gives a slightly better constants in this case).

Theorem 3.2 ([Sz13]).

Let be a convex body of diameter , let be an arbitrary convex function over and let be an unbiased stochastic (sub-)gradient oracle for . Assume that for every , . For and , let denote the iterates produced by PSGD. Then

where and the expectation is taken over the randomness of .

Further, Jain et al. [JNN19] show that the factor can be eliminated by using faster decaying rates. Their step-size schedule is defined as follows.

Definition 3.3.

For an integer , let . For , let and let . For a constant , every and , we define . We denote the resulting sequence of step sizes by .

Jain et al. [JNN19] prove that the following guarantees hold for SGD with step sizes given by .

Theorem 3.4 ([Jnn19]).

Let be a convex body of diameter , let be an arbitrary convex function over and let be an unbiased stochastic (sub-)gradient oracle for . Assume that for every , . For and , let denote the iterates produced by PSGD. Then

where and the expectation is taken over the randomness of .

We remark that the results in [JNN19] are stated for an oracle that gives (sub)-gradients bounded by almost surely. This condition is necessary for the high-probability version of their result but a bound on the variance of suffices to upper bound . In addition, while the results are stated for a fixed gradient oracle, the same results hold when a different stochastic gradient oracle is used in step as long as all the oracles satisfy the assumptions (namely, and for all ).

3.3 Snowball-SGD

Finally we derive the privacy and utility guarantees for noisy SGD by calculating the batch sizes needed to ensure the privacy guarantees for the settings in Theorems 3.2 and 3.4. The sum of batch sizes in turn gives us the number of samples necessary to implement steps of these algorithms. The resulting batch sizes will be proportional to and we refer to such batch size schedule as Snowball-SGD.

Theorem 3.5.

Let be a convex set of diameter and be a family of convex -Lipschitz and -smooth functions over . For , , and all let , , , , If then for all , starting point , and , PNSGD satisfies -RDP. Further, if consists of samples drawn i.i.d. from a distribution , then and

where, for all , , and the expectation is taken over the random choice of and noise added by PNSGD.

Proof.

We first establish the privacy guarantees. By Theorem 3.1, all we need is to verify that for our choice of and we have for every ,

This implies that

where we used the fact that .

To establish the utility guarantees, we first note that for all ,

Thus for sampled i.i.d. from and index in batch , . In particular, for , and therefore each gives an independent sample from a stochastic gradient oracle for . Our setting of the noise scale ensures that for every

This implies that for our choice of parameters PNSGD can be seen as an execution PSGD with stochastic gradient oracles with variance upper-bounded by . Plugging this value in Theorem 3.2 gives our bound on the utility of the algorithm. To obtain the bound in terms of we note that , implies that and thus

Next, we give a differentially private version of the step-size schedule from [JNN19].

Theorem 3.6.

Let be a convex set of diameter and be a family of convex -Lipschitz and -smooth functions over . For , , and all let , , , , If then for all , starting point , and , PNSGD satisfies -RDP. Further, if consists of samples drawn i.i.d. from a distribution , then and

where, for all , , and the expectation is taken over the random choice of and noise added by PNSGD.

Proof.

The utility guarantees for this algorithm follow from the same argument as in the proof of Theorem 3.5 together with Theorem 3.4. As before, by Theorem 3.1, all we need to establish the privacy guarantees is to verify that for our choice of and we have for every ,

(1)

We first observe that for we have that

(2)

For , let be such that . Then we note that for

and therefore