On the Differentially Private Nature of Perturbed Gradient Descent

01/18/2021 ∙ by Thulasi Tholeti, et al. ∙ Indian Institute Of Technology, Madras 0

We consider the problem of empirical risk minimization given a database, using the gradient descent algorithm. We note that the function to be optimized may be non-convex, consisting of saddle points which impede the convergence of the algorithm. A perturbed gradient descent algorithm is typically employed to escape these saddle points. We show that this algorithm, that perturbs the gradient, inherently preserves the privacy of the data. We then employ the differential privacy framework to quantify the privacy hence achieved. We also analyze the change in privacy with varying parameters such as problem dimension and the distance between the databases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Given the abundant amount of data openly available about various aspects of an individual, privacy has become one of the major concerns while handling data. Differential privacy is a privacy guarantee on preserving the privacy of an individual when a statistical database is publicly released [1]. When a differentially private mechanism is applied to a pair of databases that differ by a single record, an external agent should ideally not be able to identify the presence or absence of that record. Differential privacy quantifies the extent to which this guarantee is preserved. Differential privacy guarantees are now being provided in various problems such as online learning [2], Empirical Risk Minimization (ERM) [3, 4, 5], boosting [6], matrix factorization [7], etc. Typically, these mechanisms achieve privacy by adding noise or perturbation at the input, output or an intermediate step in the mechanism [8]

. In this work, we quantify the privacy of ERM where the stochastic gradient descent updates are perturbed.

There has been renewed interest in the convergence of first-order iterative optimization methods for non-convex functions. It has been observed in [9]

that in many non-convex problems such as tensor decomposition, dictionary learning, matrix retrieval, etc., the presence of saddle points impedes the convergence of the stochastic gradient descent (SGD) algorithm. It has also been suggested that local minima can be as good as global minima in high dimensional neural networks but saddle points cause a bottleneck in convergence in

[10]. Therefore, variants of the gradient descent algorithms have been proposed to accelerate the convergence of SGD in the presence of saddles. In [11], the authors suggest adding noise from the surface of the unit sphere to the gradient so as to escape saddle points. In [9], noise from a unit sphere is added when the magnitude of the gradient is below a certain threshold. A modified version of [9] is also proposed in [12] where gradient descent and SGD are interleaved. In all the above works, we note the following pattern: a sample from an isotropic noise distribution is added to either the gradient or the iterate such that the caused perturbation helps in escaping the saddle point.

As accelerating stochastic gradient descent in the presence of saddle points also involves perturbation, we hypothesize that it should also inherently provide some privacy guarantees. In this work, we quantify the privacy guarantees that are obtained by employing an algorithm that perturbs the gradient with the aim of escaping saddle points. Using the pattern observed in works to escape saddle points, we provide a generic format of the Perturbed gradient descent (PrGD) algorithm. Our major contribution is identifying and quantifying the privacy provided by the PrGD algorithm. We also quantify the privacy obtained by adding noise from a -dimensional ball which, to the best of our knowledge, has not been done before. In the forthcoming section, we discuss some basic definitions from both the optimization and the differential privacy literature. We provide the problem setting and provide a generic algorithm to escape saddle points. We then provide the privacy guarantees of the algorithms and discuss the results.

Ii Definitions and background

Differential privacy was introduced to formally provide privacy guarantees. We initially define some terms regarding differential privacy from the seminal work [1]:

Definition 1 (Neighbouring databases).

Two databases and are said to be neighbouring databases if they differ by a single entry. The maximum distance between the databases is denoted by .

(1)
Definition 2 (-private mechanism).

A randomized mechanism with range is said to preserve privacy, if for all pairs of neighbouring databases and and for any ,

(2)

Note that when , we get privacy which is formally defined as follows

Definition 3 (-private mechanism).

A randomized mechanism with range is said to preserve privacy, if for all pairs of neighbouring databases and and for any ,

(3)

The privacy measure, , is also known as the total variation distance of the query output for neighbouring databases. Note that for a given , privacy is a stronger guarantee than privacy. We also bring to note that a smaller value for implies greater privacy.

We now define the necessary terms in the optimization framework.

Definition 4 (Stationary points).

For a differentiable function , we say that is a first-order stationary point if and a second-order stationary point if .

The iterative gradient methods such as gradient descent, stochastic gradient descent, RMSProp, Adam, etc. are first-order methods that guarantee convergence to the first-order stationary point. For a convex objective, convergence to the first-order stationary point guarantees convergence to the minimum. However, for a non-convex function, a first-order stationary point may be a maximum, minimum or a saddle point. It has been observed that the presence of saddle points greatly impede the convergence of gradient descent

[10].

Definition 5 (Strict saddle).

We say is a saddle point if it is a first-order stationary point, but not a local minimum. Moreover, we say that a saddle point is strict if .

A strict saddle implies that there is a direction of functional decrease and hence, there is a chance that the perturbation may help the iterate escape the saddle point.

In the forthcoming sections, we discuss the problem setting where we consider a non-convex objective function (this implies that there may be saddle points in play) and a typical perturbed gradient descent algorithm to minimize it. We then quantify the privacy guarantees provided by this algorithm.

Iii Problem setting

We consider the empirical risk minimization problem given a database. We denote the database as a set of points where and for . We aim to learn the function from the given data; let the function be parameterized by which are tuned using an iterative optimization algorithm. The minimization objective can be written as a function of the inputs ’s and the parameters . The problem is to minimize the empirical loss which is given by

(4)

Here, the loss function

is the loss incurred for predicting when the given output is . We do not assume the convexity of the objective to be minimized. As the objective function may be non-convex, there arises the problem of saddle points [13] which significantly slows down the convergence of the optimizer. We make the following assumptions about the objective:

  1. The loss is bounded, is -smooth and has -Lipschitz Hessian.

  2. All the saddle points are strict

These assumptions are significant in proving the convergence of the perturbed gradient descent algorithm to a local minimum and do not affect the privacy guarantees provided. In the process, the privacy of the database also needs to be preserved. Privacy is said to be compromised if an adversary can identify whether an individual entry belongs to the given database based on the output of an algorithm given a pair of neighbouring databases. In our problem, we consider a pair of databases with a maximum distance between their gradients as . In the subsequent section, we analyze the privacy guarantees of the iterative optimization algorithm employed to minimize the objective listed in Eq. 4.

Iv Algorithm and guarantees

In this section, we list a perturbed version of the stochastic gradient descent algorithm that is employed to minimize a non-convex objective. Previously, perturbation was added to the gradient to escape saddle points in [11, 9, 14]; the noisy/perturbation added to the gradient is a sample from a unit ball. Also, from the work done in [8, 15, 16], we note that privacy can be preserved by adding any noise/perturbation to the gradient values. Previous works on differential privacy such as [1] deal with the addition of Gaussian or Laplacian noise to the data and computing the resulting privacy. Even in the context of gradient descent, [16] deals with the addition of Gaussian noise to the gradient. However, to escape a saddle point, isotropic noise (especially noise sampled from a unit ball) is advocated in [9]. Therefore, in this work, it is shown that both privacy and faster convergence can be achieved in the presence of saddle points when perturbation sampled from a unit ball is added to the gradient. A version of the perturbed gradient descent algorithm is presented below:

1:Input: Initial parameters
2:for  t = 1,2,…T do
3:     Choose a data point uniformly at random from the available data points
4:     Sample from the volume of a unit ball
5:     
6:end for
Algorithm 1 Perturbed gradient descent (PrGD)

Note that this is the typical format of any perturbed gradient descent algorithm [11, 15] where the distribution of the added noise (usually isotropic) varies. Our major contribution is to show that the PrGD algorithm also inherently provides privacy guarantees. We derive the privacy guarantees provided by Algorithm 1 below.

Iv-a Privacy guarantees

Let denote the gradient for input at iteration ; for simplicity, we drop the subscripts and denote the gradients of two different inputs as respectively. With a slight overload of notation, we assume that the maximum difference in the gradients and is also denoted by , i.e., . The farther the points, the easier it is for an adversary to distinguish between them. Therefore, the worst-case analysis is done for .

Theorem 1.

Let . Algorithm 1 provides - differential privacy guarantees where

where denotes the maximum gradient distance between two neighbouring databases and is the regularized incomplete beta function defined in [17].

Proof.

We approach the quantification of privacy in three steps. We characterize the privacy provided by the addition of a sample from the volume of a unit ball for a single step. Then, we incorporate the fact that a single time step uses only one of the available data points through random sampling. Finally, the composition theorem is used to characterize the cumulative privacy over the time steps.

  • Privacy guarantees for addition of noise sampled from the volume of a unit ball:
    To establish privacy guarantees, let us initially consider the privacy obtained at an iteration for a data sample . As the noise is added from a unit ball, if , the points can always be distinguished. Therefore, we assume that .
    Let and and . Here

    is a noise vector sampled from the volume of a unit sphere as demonstrated in Fig.

    1. We aim to prove that for any .

    The volume of a unit ball in dimensions is given by the following formula

    (5)

    Fig. 1: Distribution of and

    Now, we find the volume of intersection of two hyperspheres with unit radii and centers located at a Euclidean distance of . This can be obtained as a sum of two hypersphere caps. The volume of a hypersphere cap is derived in [18] and is given by

    where is the regularized incomplete beta function and is the difference between the radius of sphere and the height of the cap; here, . The overlapping volume will be twice of that as the spherical cap as shown in Fig. 1.

    Consider a set .

    The maximum difference between and is obtained when the set is the non-overlapping volume of either one hypersphere. The maximum value for signifies signifies the worst-case privacy.

    Using the identity of regularized incomplete beta function [17], we rewrite the above expression as

    (6)

    We analyse the effect of variation of the dimension and on the privacy in the next subsection.

  • Effect of random sampling of data points on privacy:
    As each data point is sampled at random from a set of data points, we obtain improved privacy guarantees. According to the privacy amplification theorem employed in [15], the privacy guarantee offered at each step is now .

  • Privacy over time steps:
    As the adversary can view multiple input-output pairs over the evolution of the algorithm, there is a compromise on the privacy of the database. This is characterized by the strong composition theorem [19]. The adaptive composition theorem can be applied when the adversary has information about the databases as well as the mechanism employed by the differentially private agent; in addition, the adversary is allowed to modify its future queries based on the outputs it sees. The parameter for future queries is affected by the output of the differentially private agent. The adaptive strong composition theorem states that the composition of mechanisms each providing for results in a mechanism with privacy . A direct application of the adaptive composition theorem to our problem results in a -fold composition of equivalent mechanisms. The final guarantee that we get is

    (7)

Note: When noise sampled from the surface of a unit ball is added to the gradient instead of the volume as done in [11], we will not be able to provide privacy guarantees. This is because the output achieved after the addition of noise, i.e., will be exactly a distance of 1 unit away from . Therefore, the adversary can easily detect which of the two databases contributed to the specific output. Hence, we consider noise sampled from the volume of a unit ball. ∎

Iv-A1 Impact of and on privacy

The privacy parameter varies from 0 to 1, where corresponds to the case of maximum privacy (when the outputs from neighbouring databases are indistinguishable) and corresponds to minimum privacy (where the outputs can be surely distinguished). In this section, we consider the expression for privacy obtained by adding noise sampled from a unit ball as derived in Eq. 6 and analyze the effect of the parameters and on the privacy metric . Note that as the overall privacy derived in Eq. 7 is a scaled version of Eq. 6, the same trend applies for the overall privacy as well.

From [17], the expansion of the regularized incomplete beta function is given by

(8)

Applying the above expansion to Eq. 6, under the assumption that

is odd to ensure that

, we have

(9)

Note that the assumption on is only made to study the trend and is not a requirement for Eq. 7 to hold. For a fixed , let us initially study the impact of the dimension . Let us consider . This simplifies to the addition of uniform noise from the interval . On substituting in Eq. 9, we obtain which corroborates with the result obtained for

-privacy in case of a uniform distribution in

[20]. To analyse the variation of the quantity with , let us consider . For , we obtain terms corresponding to in the summation. Note that all the terms in the summation are positive and hence, as increases, more terms get added to the summation, the value of increases for the same value of . As , we observe that decreases with increasing . This results in an increase in thereby resulting in decrease of privacy.

We then analyse how varies with the quantity . Note that is an increasing quantity in whereas is decreasing in . Therefore, we rely on the sign of the gradient to characterize if is an increasing or a decreasing function in . Using the differentiation formulas for Beta regularized functions in [17], we have

(10)

where denotes the beta function as defined in [21]. We note that all the terms in Eq. 10 are positive for all values of . As the gradient is positive, we can conclude that is an increasing function of for any given . These observations are further confirmed by plotting the variation of with for different dimensions in Fig 2.

Fig. 2: Variation of privacy with for different

Iv-B Convergence guarantees

We can guarantee the convergence of the SGD algorithm arbitrarily close to a local minimum by following the analysis in [11]. The result in [11] is reproduced here for convenience:

Theorem 2.

Suppose a function that is strict saddle, and has a stochastic gradient oracle where the noise satisfy for some . Further, suppose the function is bounded by , is -smooth and has -Lipschitz Hessian. Then there exists a threshold , so that for any , and for any

, with probability at least

in iterations, SGD outputs a point that is -close to some local minimum .

The property of strict saddle and equivalence of local and global minima is commonly encountered, especially when we deal with an over-parameterized shallow neural network [22]. Also, as the noise is sampled uniformly at random from the volume of a unit ball, it is isometric. That automatically satisfies the requirement . Therefore, Algorithm 1 is guaranteed to output a point that is close to a local minimum.

Iv-C Discussion

The privacy guarantee provided is and not , as the perturbation added was from a bounded distribution, the unit ball. The unit ball for a single dimension corresponds to the uniform distribution. Here, we quantified the privacy achieved by adding noise from a unit ball. However, if we wish to address the alternate problem, i.e., if a certain privacy guarantee is desired, the radius of the -dimensional ball can be appropriately scaled to achieve it. The convergence guarantees of the algorithm still holds as long as the noise added is isotropic.

V Conclusion

This work aims to bring out the inherent privacy provided by the algorithm which perturbs for saddle point escape in a non-convex setting and hence the factor for a typical saddle point escape algorithm is derived. Our major contribution lies in quantifying the privacy achieved by adding noise randomly sampled from a -dimensional ball which has not been attempted before. We also analyze the change in privacy with varying dimensions. We also quantify the overall privacy obtained when the PrGD algorithm is applied to a database over time steps while providing convergence guarantees.

References

  • [1] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
  • [2] P. Jain, P. Kothari, and A. Thakurta, “Differentially private online learning,” in Conference on Learning Theory, 2012, pp. 24–1.
  • [3] D. Kifer, A. Smith, A. Thakurta, S. Mannor, N. Srebro, and R. C. Williamson, “Private Convex Empirical Risk Minimization and High-dimensional Regression,” 25th Annual Conference on Learning Theory, vol. 23, no. 25, pp. 1–40, 2012.
  • [4] J. Zhang, K. Zheng, W. Mou, and L. Wang, “Efficient private erm for smooth objectives,” in

    Proceedings of the 26th International Joint Conference on Artificial Intelligence

    .   AAAI Press, 2017, pp. 3922–3928.
  • [5] S. Song, K. Chaudhuri, and A. D. Sarwate, “Stochastic gradient descent with differentially private updates,” 2013 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2013 - Proceedings, no. 2, pp. 245–248, 2013.
  • [6] C. Dwork, G. N. Rothblum, and S. Vadhan, “Boosting and differential privacy,” in 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.   IEEE, 2010, pp. 51–60.
  • [7] A. Berlioz, A. Friedman, M. A. Kaafar, R. Boreli, and S. Berkovsky, “Applying differential privacy to matrix factorization,” in Proceedings of the 9th ACM Conference on Recommender Systems.   ACM, 2015, pp. 107–114.
  • [8]

    A. D. Sarwate and K. Chaudhuri, “Signal Processing and Machine Learning with Differential Privacy,”

    IEEE Signal Processing Magazine, vol. 30, no. 5, pp. 86–94, 2013.
  • [9] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points efficiently,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 1724–1732.
  • [10] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” in Artificial Intelligence and Statistics, 2015, pp. 192–204.
  • [11] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Conference on Learning Theory, 2015, pp. 797–842.
  • [12] H. Daneshmand, J. Kohler, A. Lucchi, and T. Hofmann, “Escaping saddles with stochastic gradients,” in International Conference on Machine Learning, 2018, pp. 1163–1172.
  • [13] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Advances in neural information processing systems, 2014, pp. 2933–2941.
  • [14] K. Y. Levy, “The power of normalization: Faster evasion of saddle points,” arXiv preprint arXiv:1611.04831, 2016.
  • [15]

    M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in

    Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2016, pp. 308–318.
  • [16] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimization: Efficient algorithms and tight error bounds,” in Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on.   IEEE, 2014, pp. 464–473.
  • [17] I. Wolfram Research. Betaregularized. [Online]. Available: http://functions.wolfram.com/GammaBetaErf/BetaRegularized/
  • [18] S. Li, “Concise formulas for the area and volume of a hyperspherical cap,” Asian Journal of Mathematics and Statistics, vol. 4, no. 1, pp. 66–70, 2011.
  • [19] P. Kairouz, S. Oh, and P. Viswanath, “The composition theorem for differential privacy,” IEEE Transactions on Information Theory, vol. 63, no. 6, pp. 4037–4049, 2017.
  • [20] J. He and L. Cai, “Differential private noise adding mechanism: Basic conditions and its application,” Proceedings of the American Control Conference, pp. 1673–1678, 2017.
  • [21] I. Wolfram Research. Beta function. [Online]. Available: http://functions.wolfram.com/GammaBetaErf/Beta/02/
  • [22] S. S. Du and J. D. Lee, “On the power of over-parametrization in neural networks with quadratic activation,” in International Conference on Machine Learning, 2018, pp. 1328–1337.