1 Introduction
Consider the general problem of optimizing a function defined with respect to a dataset and a parameter : . This general class of problems includes classical empirical risk minimization, amongst others, and is a basic problem in learning and optimization. We say that such a function is sensitive in the dataset if changing one datapoint in can change the value of by at most 1, for any parameter value . Suppose that we want to solve an optimization problem like this subject to the constraint of differential privacy. The exponential mechanism provides a powerful, generalpurpose, and often erroroptimal method to solve this problem [MT07]. It requires no assumptions on the function other than that it is sensitive (this is a minimal assumption for privacy: more generally, its guarantees are parameterized by the sensitivity of the function). Unfortunately, the exponential mechanism is generally infeasible to run: its implementation (and the implementation of related mechanisms, like “ReportNoisyMax” [DR14]) requires the ability to enumerate the parameter range , making it infeasible in most learning settings, despite its use in proving general information theoretic bounds in private PAC learning [KLN+11]. When is continuous, convex, and satisfies second order conditions like strong convexity or smoothness, the situation is better: there are a number of algorithms available, including simple output perturbation [CMS11] and objective perturbation [CMS11, KST12, INS+19]. This partly mirrors the situation in nonprivate data analysis, in which convex optimization problems can be solved quickly and efficiently, and most nonconvex problems are NPhard in the worst case.
In the nonprivate case, however, the worstcase complexity of optimization problems does not tell the whole story. For many nonconvex optimization problems, such as integer programming, there are fast heuristics that not only reliably succeed in optimizing functions deriving from real inputs, but can also certify their own success. In such settings, can we leverage these heuristics to obtain practical private optimization algorithms? In this paper, we give two novel analyses of
objective perturbation algorithms that extend their applicability to 1sensitive nonconvex problems (and more generally, bounded sensitivity functions). We also get new results for convex problems, without the need for second order conditions like smoothness or strong convexity. Our first algorithm operates over a discrete parameter space , and requires no further assumptions beyond 1sensitivity for either its privacy or accuracy analysis — i.e. it is comparable in generality to the exponential mechanism. The second algorithm operates over a continuous parameter space , and requires only thatbe Lipschitzcontinuous in its second argument. Its privacy analysis does not require convexity. Its accuracy analysis does — but does not require any 2nd order conditions. We implement our first algorithm to directly optimize classification error over a discrete set of linear functions on the Adult dataset, and find that it substantially outperforms private logistic regression.
1.1 Related work
Objective perturbation was first introduced by [CMS11], and analyzed for the special case of strongly convex functions. Its analysis was subsequently improved and generalized [KST12, INS+19] to apply to smooth convex functions, and to tolerate a small degree of error in the optimization procedure. Our paper is the first to give an analysis of objective perturbation without the assumption of convexity, and the first to give an accuracy analysis without making second order assumptions on the objective function even in the convex case. [CMS11] also introduced the related technique of output perturbation which perturbs the exact optimizer of a strongly convex function.
The work most closely related to our first algorithm is [NRW18], who also give a similar “oracle efficient” algorithm for nonconvex differentially private optimization: i.e. reductions from nonprivate optimization to private optimization. Their algorithm (“Report Separator Perturbed Noisy Max”, or RSPM) relies on an implicit perturbation of the optimization objective by augmenting the dataset with a random collection of examples drawn from a separator set. The algorithms which we introduce in this paper are substantially more general: because they directly perturb the objective, they do not rely on the existence of a small separator set for the class of functions in question. One of the contributions of our paper is the first experimental analysis of RSPM, in section 5. [NRW18] also give a generic method to transform an algorithm (like ours) whose privacy analysis depends on the success of the optimization oracle, to an algorithm whose privacy analysis does not depend on this, whenever the optimization heuristic can certify its success (integer program solvers have this property). Their method applies to the algorithms we develop in this paper. Our second algorithm crucially uses an stability result recently proven by [SN19] in the context of online learning.
2 Preliminaries
We first define a dataset, a loss function with respect to a dataset, and the two types of optimization oracles we will call upon. We then define differential privacy, and state basic properties.
A dataset is defined as a (multi)set of Lipschitz loss functions . For in a parameter space , the loss on dataset is defined to be
We will define two types of perturbed loss functions, and the corresponding oracles which are assumed to be able to optimize each type. These will be used in our discrete objective perturbation algorithm in Section 3 and our sampling based objective perturbation algorithm in Section 4 respectively.
Given a vector
, we define the perturbed loss to be:This is simply the loss function augmented with a linear term.
Let be the projection formally defined in Section 3, which informally maps a dimensional vector with norm at most to a unit vector in . Given a vector We define the perturbed projected loss to be:
Definition 2.1 (Approximate Linear Optimization Oracle).
Given as input a dataset and a dimensional vector , an approximate linear optimization oracle returns such that
When we say is a linear optimization oracle.
Definition 2.2 (Approximate Projected Linear Optimization Oracle).
Given as input a dataset and a dimensional vector , an approximate projected linear optimization oracle returns such that
When we say is a projected linear optimization oracle. We remark that while it seems less natural to assume an oracle for the projected perturbed loss which involves the nonlinearity , in Section D.2 we show how we can linearize this term by introducing an auxiliary variable and introducing a convex constraint. This is ultimately how we implement this oracle in our experiments.
Definition 2.3.
A randomized algorithm is an minimizer for if for every dataset
, with probability
, it outputs such that:Certain optimization routines will have guarantees only for discrete parameter spaces:
Definition 2.4 (Discrete parameter spaces).
A separated discrete parameter space is a discrete set such that for any pair of distinct vectors we have .
Finally we define differential privacy.
We call two data sets neighbors (written as ) if can be derived from by replacing a single loss function with some other element of .
Definition 2.5 (Differential Privacy [DMN+06b, DKM+06a]).
Fix . A randomized algorithm is differentially private (DP) if for every pair of neighboring data sets , and for every event :
The Laplace distribution centered at with scale
is the distribution with probability density function
. We also make use of the exponential distribution which has density function
if and otherwise.3 Objective perturbation over a discrete decision space
In this section we give an objective perturbation algorithm that is differentially private for any nonconvex Lipschitz objective over a discrete decision space . We assume that each is Lipschitz over w.r.t. norm: that is for any , . Note that if takes values in , then we know is also Lipschitz due to the separation in .
Let be a bound on the maximum norm of any vector in . We will make use of a projection onto the unit sphere in one higher dimension. The projection function is defined as:
Note that for all , and also that for any , . This shows that while projecting to the dimensional sphere, can’t force points too much closer together than they start, which will be useful in the privacy proof.
We first prove an accuracy bound for OPDisc, which follows from a simple tail bound on the random linear perturbation term.
Theorem 1 (Utility).
Algorithm 1 is an ()minimizer for with
Proof.
For we have the following tail bound:
Now let where each
is a Gaussian random variable with variance
. It follows that for . With probability we have:Thus with probability ,
By symmetry with probability . Thus by a union bound, with probability .
Let be the output of algorithm 1 and be the minimizer for . Then with probability : and . Combining these two bounds we get:
(1) 
The second inequality is because is the minimizer for the reguralized loss ∎
We now prove OPDisc preserves DP. We defer the full proof to the Appendix.
Theorem 2.
Algorithm 1 is differentially private.
Proof Sketch.
For any realized noise vector , we write as the output. We first want to show that there exists a mapping such that is the parameter vector output on any neighboring dataset when the noise vector is realized as : that is, . If we can show that , then the probability of outputting any particular on input should be close to the corresponding probability, on input as desired.
Denote the set of of noise vectors that induce output on dataset by . Define our mapping:
We now state key lemmas. First, Lemma 3 shows that our mapping preserves the minimizer even after switching to the adjacent dataset ; so long as the minimizer is unique.
Lemma 3.
Fix any and any pair of neighboring datasets . Let be such that is the unique minimizer . Then . Hence:
Proof.
Let . Suppose that is the output on neighboring dataset when the noise vector is . We will derive a contradiction. Since is the unique minimizer on :
(2) 
Let be the index where and are different, such that and . Then . Now, write the loss function in terms of and rearranging terms:
Since is a unique minimizer for and then term in the square bracket is positive. Hence:
Since are Lipschitz functions . Also, , by expanding and using . Substituting this becomes:
Since :
(3) 
This contradicts . ∎
Lemma 4 shows that the minimizer is unique with probability .
Lemma 4.
Fix any separated vector space . For every dataset there is a subset such that and for any :
Finally Lemma 5 shows that with high probability over the draw of , .
Lemma 5.
Let . Then there exists a set such that , and for all if denotes the probability density function of :
Finally, we focus on noise vectors in the set of , which has probability mass at least , and show that for any in that induces output solution on , the noise vector also induces on the neighbor . Then the differential privacy guarantee essentially follows from the bounded ratio result in Lemma 5. ∎
3.1 Comparing OPDisc and RSPM
While both OPDisc and the RSPM algorithm of [NRW18] require discrete parameter spaces, OPDisc is substantially more general in that it only requires the loss functions be Lipschitz, whereas RSPM assumes the loss functions are bounded in (and hence Lipschitz over ) and assumes the existence of a small separator set (defined in the supplement). Nevertheless, we might hope that in addition to greater generality, OPDisc
has comparable or superior accuracy for natural classes of learning problems. We show this is indeed the case for the fundamental task of privately learning discrete hyperplanes, where it is better by a linear factor in the dimension. We define the RSPM algorithm, for which we must define the notion of a separator set, in the supplement.
Theorem 6 (RSPM Utility [Nrw18]).
Let be a discrete parameter space with a separator set of size . The Gaussian RSPM algorithm is an oracleefficient minimizer for for:
Let be a discretization of , e.g. . Let be the subset of vectors in this discretization that lie within the unit Euclidean ball: . is separated since any two distinct differ in at least one coordinate by at least . Moreover admits a separator set of size (see the Appendix of [NRW18]. Since the loss functions and is separated, the loss functions are Lipschitz. By Theorem 6, RSPM has accuracy bound:
By Theorem 1 OPDisc has accuracy bound:
Thus, in this case OPDisc has an accuracy bound that is better by a factor of roughly .
4 Objective perturbation for lipschitz functions
We now present an objective perturbation algorithm (paired with an additional output perturbation step), which applies to arbitrary parameter spaces. The privacy guarantee holds for (possibly nonconvex) Lipschitz loss functions, while the accuracy guarantee applies only if the loss functions are convex and bounded. Even in the convex case, this is a substantially more general statement than was previously known for objective perturbation: we don’t require any second order conditions like strong convexity or smoothness (or even differentiability). Our guarantees also hold with access only to an approximate optimization oracle.
We present the full algorithm in Algorithm 2. It 1) uses the approximate linear oracle (in Definition 2.1) to solve polynomially many perturbed optimization objectives, each with an independent random perturbation, and 2) perturbs the average of these solutions with Laplace noise.
Before we proceed to our analysis, let us first introduce some relevant parameters. Let have diameter , and diameter . We assume that the loss functions are Lipschitz with respect to norm, and assume the loss functions are scaled to take values in . Our utility analysis requires convexity in the loss functions, and essentially follows from the highprobability bounds on the linear perturbation terms in the first stage and the output perturbation in the second stage.
Theorem 7 (Utility).
Assuming the loss functions are convex, Algorithm 2 is an minimizer for with
where is the approximation error of the oracle .
Proof.
For . By Theorem in [JAN17] which gives upper tail bounds for the sum of independent exponential random variables, we can conclude that with probability .
Then by Lipschitzness with respect to the norm, with probability :
We now focus on . By the convexity of the loss functions, we have:
Since each is bounded in (since each ) and independent, by Hoeffding’s Inequality (see Appendix) with probability :
So it suffices to show that is small. Fix . Now by definition of , for any , we have
hence
, hence:
Now by Jensen’s inequality, , where the last equality is by the variance of the exponential distribution. Putting it all together, with probability :
Plugging in the value of , and expanding we get the following long expression:
(4) 
The last step of equation 4 comes from replacing in the value of . Replacing back the values of results in:
Finally, note that by the choice of the parameter , the first term has order at most that of the second term, which gives our stated bound. ∎
The privacy analysis of this algorithm crucially depends on a stability lemma proven by [SN19] in the context of online learning, and does not require convexity.^{1}^{1}1Compared to the bound in [SN19], our bound has an additional factor of 2 since our neighboring relationship in Definition 2.5 is defined via replacement whereas in [SN19] the stability is defined in terms of adding another loss function.
Lemma 8 (Stability lemma [Sn19]).
For any pair of neighboring data sets . Let and be the output of an approximate oracle on datasets and respectively. Then,
From now on, let be a sequence of of i.i.d dimensional noise vectors and is the average output of calls to an approximate oracle.
Lemma 9.
If , for , then, with probability :
where the randomness is taken over the different runs of .
The next lemma combines Lemma 8 and Lemma 9 to get high probability sensitivity bound for the average output of the approximate oracle.
Lemma 10 (High Probability sensitivity).
For any pair of neighboring datasets , let , be the sample average after calls to an approximate oracle. Then, with probability over the random draws of ,
(5) 
Proof.
Theorem 11.
Algorithm 2 is differentially private.
Proof sketch.
Given a pair of neighboring data sets , we will condition on the set of noise vectors satisfy the sensitivity bound (5), which occurs with probability at least . Then the privacy guarantee follows from the use of Laplace mechanism. ∎
Proof.
Fixing any two neighboring dataset , is the average of runs of with dataset and sequence of noise vectors . Let be a random dimensional noise vector . We can write the output of algorithm 2 as a sum of two random variables:
Following lemma 8, let and define set as
Where is the norm sensitivity bound from lemma 10. Then, by the same lemma, the probability that is less than , where is samples independently from the Exponential distribution. For any event ,
(7) 
We can can rewrite the joint probability as a conditional probability:
(8) 
(9) 
Therefore,
∎
5 Experiments
For our experiments we consider the problem of privately learning a linear threshold function to solve a binary classification task. Given a labeled data set where each and , the classification problem is to find a hyperplane that best separates the positive from the negative samples. A common approach is to optimize a convex surrogate loss function that approximates the classification loss. We use this approach (private logistic regression) as our baseline. In comparison, using our algorithm OPDisc, we instead try and directly optimize classification error over a discrete parameter space, using an integer program solver. Although this can be computationally expensive, we find that it is feasible for relatively small datasets (we use a balanced subset of the Adult dataset with roughly and
features, after onehot encodings of categorical features). In this setting, we find that
OPDisc can substantially outperform private logistic regression. We remark that “small data” is the regime in which applying differential privacy is most challenging, and we view our approach as a promising way forward in this important setting.Data description and preprocessing
We use the Adult dataset [LIC13], a common benchmark dataset derived from Census data. The classification task is to predict whether an individual earns over 50K per year. The dataset has records and 14 features that are a mix of both categorical and continuous attributes.The Adult dataset is unbalanced: only 7841 individuals have the (positive) label. To arrive at a balanced dataset (so that constant functions achieve 50% error), we take all positive individuals, and an equal number of negative individuals selected at random, for a total dataset size of . We encode categorical features with onehot encodings, which increases the dimensionality of the dataset. We found it difficult to run our algorithm with more than 30 features, and so we take a subset of 7 features from the Adult dataset that are represented by real valued features after onehot encoding. We chose the subset of features to optimize the accuracy of our logistic regression baseline.
Baseline: private logistic regression (LR).
We use as our baseline private logistic regression which optimizes over the space of continuous halfspaces with the goal of minimizing the logistic loss function, given by
. We implement a differentially private stochastic gradient descent (privateSGD) algorithm from
[BST14, ACG+16], keeping track of privacy loss using the moment accountant method as implemented in the TensorFlow Privacy Library. The algorithm involves three parameters: gradient clip norm, minibatch size, and learning rate. For each target privacy parameters
, we run a grid search to identify the triplet of parameters that give the highest accuracy. To lower the variance of the accuracy, we also take average over all the iterates in the run of privateSGD.Implementation details for OPDisc and RSPM
For both OPDisc and RSPM, we encode each record as a loss function: . For both algorithms, we have separation parameter and constrains the weight vectors to have norm bounded by . In OPDisc, each coordinate can take values in the discrete set with , and we constrain the to be at most . In RSPM, we optimize over the set . OPDisc requires an approximate projected linear optimization oracle (Definition 2.2) and RSPM requires an linear optimization oracle (Definition 2.1). In the appendix, we show that the optimization problems can be cast as mixedinteger programs (MIPs), allowing us to implement the oracles via the Gurobi MIP solver. The Gurobi solver was able to solve each of the integer programs we passed it.
Empirical evaluation.
We evaluate our algorithms by their () classification accuracy. The left side of Figure 0(a) plots the accuracy of OPDisc and our baseline (yaxis) as a function of the privacy parameter (xaxis), averaged over 15 runs. We fix
for all three algorithms across all runs. The error bars report the empirical standard deviation. We see that both
OPDisc and RSPM improve dramatically over the logistic regression baseline, showing that in smalldata settings, it is possible to improve over the error/privacy tradeoff given by standard convexsurrogate approaches by appealing to nonconvex optimization heuristics. OPDisc also obtains consistently better error than RSPM. The algorithm OPDisc also has significantly lower variance in its error compared to the other two algorithms. The right side of Figure 0(a) gives a histogram of the runtime of our three methods over the course of our experiment. For both OPDisc and RSPM, the running time is dominated by an integerprogram solver. We see that while our method frequently completes quite quickly (often even beating our logistic regression baseline!), it has high variance, and occasionally requires a long time to run. In our experiments, we were always able to eventually solve the necessary optimization problem, however.References
 [ACG+16] (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §5.
 [BST14] (2014) Private empirical risk minimization: efficient algorithms and tight error bounds. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 1821, 2014, pp. 464–473. External Links: Link, Document Cited by: §5.

[CMS11]
(2011)
Differentially private empirical risk minimization.
Journal of Machine Learning Research
12 (Mar), pp. 1069–1109. Cited by: §1.1, §1.  [DKM+06a] (2006) Our data, ourselves: privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486–503. Cited by: Definition 2.5.
 [DMN+06b] (2006) Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, Berlin, Heidelberg, pp. 265–284. External Links: ISBN 3540327312, 9783540327318, Link, Document Cited by: Definition 2.5.
 [DR14] (2014) The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §1.
 [GKS93] (1993) Exact identification of readonce formulas using fixed points of amplification functions. SIAM Journal on Computing 22 (4), pp. 705–726. Cited by: Definition A.1.
 [INS+19] (2019) Towards practical differentially private convex optimization. In Towards Practical Differentially Private Convex Optimization, pp. 0. Cited by: §1.1, §1.
 [JAN17] (201709) Tail bounds for sums of geometric and exponential variables. arXiv eprints, pp. arXiv:1709.08157. External Links: 1709.08157 Cited by: §4.
 [KLN+11] (2011) What can we learn privately?. SIAM Journal on Computing 40 (3), pp. 793–826. Cited by: §1.
 [KST12] (2012) Private convex empirical risk minimization and highdimensional regression. In Conference on Learning Theory, pp. 25–1. Cited by: §1.1, §1.
 [LIC13] (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.
 [MT07] (2007) Mechanism design via differential privacy.. In FOCS, Vol. 7, pp. 94–103. Cited by: §1.
 [NRW18] (2018) How to use heuristics for differential privacy. arXiv preprint arXiv:1811.07765. Cited by: Appendix B, §1.1, §3.1, §3.1, Theorem 6, 3.
 [SN19] (2019) Online nonconvex learning: following the perturbed leader is optimal. arXiv preprint arXiv:1903.08110. Cited by: §1.1, §4, Lemma 8, footnote 1.
 [SKS16] (2016) Efficient algorithms for adversarial contextual learning. CoRR abs/1602.02454. External Links: Link, 1602.02454 Cited by: Definition A.1.
Comments
There are no comments yet.