Privacy has become a growing concern, due to the massive increase in personal information stored in electronic databases, such as medical records, financial records, web search histories, and social network data. Machine learning can be employed to discover novel population-wide patterns, however the results of such algorithms may reveal certain individuals’ sensitive information, thereby violating their privacy. Thus, an emerging challenge for machine learning is how to learn from datasets that contain sensitive personal information.
At the first glance, it may appear that simple anonymization of private information is enough to preserve privacy. However, this is often not the case; even if obvious identifiers, such as names and addresses, are removed from the data, the remaining fields can still form unique “signatures” that can help re-identify individuals. Such attacks have been demonstrated by various works, and are possible in many realistic settings, such as when an adversary has side information (Sweeney, 1997; Narayanan and Shmatikov, 2008; Ganta et al., 2008), and when the data has structural properties (Backstrom et al., 2007), among others. Moreover, even releasing statistics on a sensitive dataset may not be sufficient to preserve privacy, as illustrated on genetic data (Homer et al., 2008; Wang et al., 2009). Thus, there is a great need for designing machine learning algorithms that also preserve the privacy of individuals in the datasets on which they train and operate.
In this paper we focus on the problem of classification, one of the fundamental problems of machine learning, when the training data consists of sensitive information of individuals. Our work addresses the Empirical risk minimization (ERM) framework for classification, in which a classifier is chosen by minimizing the average over the training data of the prediction loss (with respect to the label) of the classifier in predicting each training data point. In this work, we focus on regularized ERM in which there is an additional term in the optimization, called the regularizer, which penalizes the complexity of the classifier with respect to some metric. Regularized ERM methods are widely used in practice, for example in logistic regression and support vector machines (SVMs), and many also have theoretical justification in the form of generalization error bounds with respect to independently, identically distributed (i.i.d.) data (see Vapnik (1998) for further details).
For our privacy measure, we use a definition due to Dwork et al. (2006b), who have proposed a measure of quantifying the privacy-risk associated with computing functions of sensitive data. Their -differential privacy model is a strong, cryptographically-motivated definition of privacy that has recently received a significant amount of research attention for its robustness to known attacks, such as those involving side information (Ganta et al., 2008). Algorithms satisfying
-differential privacy are randomized; the output is a random variable whose distribution is conditioned on the data set. A statistical procedure satisfies-differential privacy if changing a single data point does not shift the output distribution by too much. Therefore, from looking at the output of the algorithm, it is difficult to infer the value of any particular data point.
In this paper, we develop methods for approximating ERM while guaranteeing
-differential privacy. Our results hold for loss functions and regularizers satisfying certain differentiability and convexity conditions. An important aspect of our work is that we develop methods forend-to-end privacy; each step in the learning process can cause additional risk of privacy violation, and we provide algorithms with quantifiable privacy guarantees for training as well as parameter tuning. For training, we provide two privacy-preserving approximations to ERM. The first is output perturbation, based on the sensitivity method proposed by Dwork et al. (2006b). In this method noise is added to the output of the standard ERM algorithm. The second method is novel, and involves adding noise to the regularized ERM objective function prior to minimizing. We call this second method objective perturbation. We show theoretical bounds for both procedures; the theoretical performance of objective perturbation is superior to that of output perturbation for most problems. However, for our results to hold we require that the regularizer be strongly convex (ruling regularizers) and additional constraints on the loss function and its derivatives. In practice, these additional constraints do not affect the performance of the resulting classifier; we validate our theoretical results on data sets from the UCI repository.
In practice, parameters in learning algorithms are chosen via a holdout data set. In the context of privacy, we must guarantee the privacy of the holdout data as well. We exploit results from the theory of differential privacy to develop a privacy-preserving parameter tuning algorithm, and demonstrate its use in practice. Together with our training algorithms, this parameter tuning algorithm guarantees privacy to all data used in the learning process.
Guaranteeing privacy incurs a cost in performance; because the algorithms must cause some uncertainty in the output, they increase the loss of the output predictor. Because the -differential privacy model requires robustness against all data sets, we make no assumptions on the underlying data for the purposes of making privacy guarantees. However, to prove the impact of privacy constraints on the generalization error, we assume the data is i.i.d. according to a fixed but unknown distribution, as is standard in the machine learning literature. Although many of our results hold for ERM in general, we provide specific results for classification using logistic regression and support vector machines. Some of the former results were reported in Chaudhuri and Monteleoni (2008); here we generalize them to ERM and extend the results to kernel methods, and provide experiments on real datasets.
More specifically, the contributions of this paper are as follows:
We derive a computationally efficient algorithm for ERM classification, based on the sensitivity method due to Dwork et al. (2006b). We analyze the accuracy of this algorithm, and provide an upper bound on the number of training samples required by this algorithm to achieve a fixed generalization error.
We provide a general technique, objective perturbation, for providing computationally efficient, differentially private approximations to regularized ERM algorithms. This extends the work of Chaudhuri and Monteleoni (2008), which follows as a special case, and corrects an error in the arguments made there. We apply the general results on the sensitivity method and objective perturbation to logistic regression and support vector machine classifiers. In addition to privacy guarantees, we also provide generalization bounds for this algorithm.
For kernel methods with nonlinear kernel functions, the optimal classifier is a linear combination of kernel functions centered at the training points. This form is inherently non-private because it reveals the training data. We adapt a random projection method due to Rahimi and Recht (Rahimi and Recht, 2007, 2008b), to develop privacy-preserving kernel-ERM algorithms. We provide theoretical results on generalization performance.
Because the holdout data is used in the process of training and releasing a classifier, we provide a privacy-preserving parameter tuning algorithm based on a randomized selection procedure (McSherry and Talwar, 2007) applicable to general machine learning algorithms. This guarantees end-to-end privacy during the learning procedure.
We validate our results using experiments on two datasets from the UCI Machine Learning repositories (Asuncion and Newman, 2007)) and KDDCup (Hettich and Bay, 1999). Our results show that objective perturbation is generally superior to output perturbation. We also demonstrate the impact of end-to-end privacy on generalization error.
1.1 Related Work
There has been a significant amount of literature on the ineffectiveness of simple anonymization procedures. For example, Narayanan and Shmatikov (2008) show that a small amount of auxiliary information (knowledge of a few movie-ratings, and approximate dates) is sufficient for an adversary to re-identify an individual in the Netflix dataset, which consists of anonymized data about Netflix users and their movie ratings. The same phenomenon has been observed in other kinds of data, such as social network graphs (Backstrom et al., 2007), search query logs (Jones et al., 2007) and others. Releasing statistics computed on sensitive data can also be problematic; for example, Wang et al. (2009) show that releasing -values computed on high-dimensional genetic data can lead to privacy breaches by an adversary who is armed with a small amount of auxiliary information.
There has also been a significant amount of work on privacy-preserving data mining (Agrawal and Srikant, 2000; Evfimievski et al., 2003; Sweeney, 2002; Machanavajjhala et al., 2006), spanning several communities, that uses privacy models other than differential privacy. Many of the models used have been shown to be susceptible to composition attacks, attacks in which the adversary has some reasonable amount of prior knowledge (Ganta et al., 2008). Other work (Mangasarian et al., 2008) considers the problem of privacy-preserving SVM classification when separate agents have to share private data, and provides a solution that uses random kernels, but does provide any formal privacy guarantee.
An alternative line of privacy work is in the Secure Multiparty Computation setting due to Yao (1982), where the sensitive data is split across multiple hostile databases, and the goal is to compute a function on the union of these databases. Zhan and Matwin (2007) and Laur et al. (2006) consider computing privacy-preserving SVMs in this setting, and their goal is to design a distributed protocol to learn a classifier. This is in contrast with our work, which deals with a setting where the algorithm has access to the entire dataset.
Differential privacy, the formal privacy definition used in our paper, was proposed by the seminal work of Dwork et al. (2006b), and has been used since in numerous works on privacy (Chaudhuri and Mishra, 2006; McSherry and Talwar, 2007; Nissim et al., 2007; Barak et al., 2007; Chaudhuri and Monteleoni, 2008; Machanavajjhala et al., 2008). Unlike many other privacy definitions, such as those mentioned above, differential privacy has been shown to be resistant to composition attacks (attacks involving side-information) (Ganta et al., 2008)
. Some follow-up work on differential privacy includes work on differentially-private combinatorial optimization, due toGupta et al. (2010)
, and differentially-private contingency tables, due toBarak et al. (2007) and Kasivishwanathan et al. (2010). Wasserman and Zhou (2010) provide a more statistical view of differential privacy, and Zhou et al. (2009) provide a technique of generating synthetic data using compression via random linear or affine transformations.
Previous literature has also considered learning with differential privacy. One of the first such works is Kasiviswanathan et al. (2008), which presents a general, although computationally inefficient, method for PAC-learning finite concept classes. Blum et al. (2008) presents a method for releasing a database in a differentially-private manner, so that certain fixed classes of queries can be answered accurately, provided the class of queries has a bounded VC-dimension. Their methods can also be used to learn classifiers with a fixed VC-dimension – see Kasiviswanathan et al. (2008); however the resulting algorithm is also computationally inefficient. Some sample complexity lower bounds in this setting have been provided by Beimel et al. (2010). In addition, Dwork and Lei (2009) explore a connection between differential privacy and robust statistics, and provide an algorithm for privacy-preserving regression using ideas from robust statistics. However, their algorithm also requires a running time which is exponential in the data dimension, and is hence computationally inefficient.
This work builds on our preliminary work in Chaudhuri and Monteleoni (2008). We first show how to extend the sensitivity method, a form of output perturbation, due to Dwork et al. (2006b), to classification algorithms. In general, output perturbation methods alter the output of the function computed on the database, before releasing it; in particular the sensitivity method makes an algorithm differentially private by adding noise to its output. In the classification setting, the noise protects the privacy of the training data, but increases the prediction error of the classifier. Recently, independent work by Rubinstein et al. (2009) has reported an extension of the sensitivity method to linear and kernel SVMs. Their utility analysis differs from ours, and thus the analogous generalization bounds are not comparable. Because Rubinstein et al. use techniques from algorithmic stability, their utility bounds compare the private and non-private classifiers using the same value for the regularization parameter. In contrast, our approach takes into account how the value of the regularization parameter might change due to privacy constraints. In contrast, we propose the objective perturbation method, in which noise is added to the objective function before optimizing over the space classifiers. Both the sensitivity method and objective perturbation result in computationally efficient algorithms for our specific case. In general, our theoretical bounds on sample requirement are incomparable with the bounds of Kasiviswanathan et al. (2008) because of the difference between their setting and ours.
Our approach to privacy-preserving tuning uses the exponential mechanism of McSherry and Talwar (2007) by training classifiers with different parameters on disjoint subsets of the data and then randomizing the selection of which classifier to release. This bears a superficial resemblance to the sample-and-aggregate (Nissim et al., 2007) and V-fold cross-validation, but only in the sense that only a part of the data is used to train the classifier. One drawback is that our approach requires significantly more data in practice. Other approaches to selecting the regularization parameter could benefit from a more careful analysis of the regularization parameter, as in Hastie et al. (2004).
We will use , , and to denote the -norm, -norm, and norm in a Hilbert space , respectively. For an integer we will use to denote the set . Vectors will typically be written in boldface and sets in calligraphic type. For a matrix , we will use the notation to denote the norm of .
2.1 Empirical Risk Minimization
In this paper we develop privacy-preserving algorithms for regularized empirical risk minimization, a special case of which is learning a classifier from labeled examples. We will phrase our problem in terms of classification and indicate when more general results hold. Our algorithms take as input training data of data-label pairs. In the case of binary classification the data space and the label set . We will assume throughout that is the unit ball so that .
We would like to produce a predictor . We measure the quality of our predictor on the training data via a nonnegative loss function .
In regularized empirical risk minimization (ERM), we choose a predictor that minimizes the regularized empirical loss:
This minimization is performed over in an hypothesis class . The regularizer prevents over-fitting. For the first part of this paper we will restrict our attention to linear predictors and with some abuse of notation we will write .
2.2 Assumptions on loss and regularizer
The conditions under which we can prove results on privacy and generalization error depend on analytic properties of the loss and regularizer. In particular, we will require certain forms of convexity (see Rockafellar and Wets (1998)).
A function over is said to be strictly convex if for all , , and ,
It is said to be -strongly convex if for all , , and ,
A strictly convex function has a unique minimum – see Boyd and Vandenberghe (2004). Strong convexity plays a role in guaranteeing our privacy and generalization requirements. For our privacy results to hold we will also require that the regularizer and loss function be differentiable functions of . This excludes certain classes of regularizers, such as the -norm regularizer , and classes of loss functions such as the hinge loss . In some cases we can prove privacy guarantees for approximations to these non-differentiable functions.
2.3 Privacy model
We are interested in producing a classifier in a manner that preserves the privacy of individual entries of the dataset that is used in training the classifier. The notion of privacy we use is the -differential privacy model, developed by Dwork et al. (2006b); Dwork (2006), which defines a notion of privacy for a randomized algorithm . Suppose produces a classifier, and let be another dataset that differs from in one entry (which we assume is the private value of one person). That is, and have points in common. The algorithm provides differential privacy if for any set , the likelihood that is close to the likelihood , (where the likelihood is over the randomness in the algorithm). That is, any single entry of the dataset does not affect the output distribution of the algorithm by much; dually, this means that an adversary, who knows all but one entry of the dataset, cannot gain much additional information about the last entry by observing the output of the algorithm.
An algorithm taking values in a set provides -differential privacy if
where the first supremum is over all measurable , the second is over all datasets and differing in a single entry, and is the conditional distribution (measure) on induced by the output given a dataset . The ratio is interpreted to be 1 whenever the numerator and denominator are both 0.
Note that if is a set of measure 0 under the conditional measures induced by and , the ratio is automatically 1. A more measure-theoretic definition is given in Zhou et al. (2009). An illustration of the definition is given in Figure 1.
The following form of the definition is due to Dwork et al. (2006a).
An algorithm provides -differential privacy if for any two datasets and that differ in a single entry and for any set ,
where (resp. ) is the output of on input (resp. ).
From this definition, it is clear that the that outputs the minimizer of the ERM objective (1) does not provide -differential privacy for any . This is because an ERM solution is a linear combination of some selected training samples “near” the decision boundary. If and differ in one of these samples, then the classifier will change completely, making the likelihood ratio in (5) infinite. Regularization helps by penalizing the norm of the change, but does not account how the direction of the minimizer is sensitive to changes in the data.
Dwork et al. (2006b) also provide a standard recipe for computing privacy-preserving approximations to functions by adding noise with a particular distribution to the output of the function. We call this recipe the sensitivity method. Let be a scalar function of , where corresponds to the private value of individual ; then the sensitivity of is defined as follows.
The sensitivity of a function is maximum difference between the values of the function when one input changes. More formally, the sensitivity of is defined as:
3 Privacy-preserving ERM
Here we describe two approaches for creating privacy-preserving algorithms from (1).
3.1 Output perturbation : the sensitivity method
Algorithm 1 is derived from the sensitivity method of Dwork et al. (2006b), a general method for generating a privacy-preserving approximation to any function . In this section the norm is the -norm unless otherwise specified. For the function , Algorithm 1 outputs a vector , where is random noise with density
where is a normalizing constant. The parameter is a function of , and the -sensitivity of , which is defined as follows.
The -sensitivity of a vector-valued function is defined as the maximum change in the norm of the value of the function when one input changes. More formally,
The interested reader is referred to Dwork et al. (2006b) for further details. Adding noise to the output of has the effect of masking the effect of any particular data point. However, in some applications the sensitivity of the minimizer
may be quite high, which would require the sensitivity method to add noise with high variance.
3.2 Objective perturbation
A different approach, first proposed by Chaudhuri and Monteleoni (2008), is to add noise to the objective function itself and then produce the minimizer of the perturbed objective. That is, we can minimize
where has density given by (7), with . Note that the privacy parameter here does not depend on the sensitivity of the of the classification algorithm.
The algorithm requires a certain slack, , in the privacy parameter. This is due to additional factors in bounding the ratio of the densities. The “If” statement in the algorithm is from having to consider two cases in the proof of Theorem 2, which shows that the algorithm is differentially private.
3.3 Privacy guarantees
In this section, we establish the conditions under which Algorithms 1 and 2 provide -differential privacy. First, we establish guarantees for Algorithm 1.
3.3.1 Privacy Guarantees for Output Perturbation
If is differentiable, and -strongly convex, and is convex and differentiable, with for all , then, Algorithm 1 provides -differential privacy.
From Corollary 1, if the conditions on and hold, then the -sensivity of ERM with regularization parameter is at most . We observe that when we pick from the distribution in Algorithm 1, for a specific vector , the density at is proportional to . Let and be any two datasets that differ in the value of one individual. Then, for any ,
where and are the corresponding noise vectors chosen in Step 1 of Algorithm 1, and ( respectively) is the density of the output of Algorithm 1 at , when the input is ( respectively). If and are the solutions respectively to non-private regularized ERM when the input is and , then, . From Corollary 1, and using a triangle inequality,
Moreover, by symmetry, the density of the directions of and are uniform. Therefore, by construction, . The theorem follows. ∎
The main ingredient of the proof of Theorem 1 is a result about the sensitivity of regularized ERM, which is provided below.
Let and be two vector-valued functions, which are continuous, and differentiable at all points. Moreover, let and be -strongly convex. If and , then
Using the definition of and , and the fact that and are continuous and differentiable everywhere,
If is differentiable and -strongly convex, and is convex and differentiable with for all , then, the -sensitivity of is at most .
Let and be two datasets that differ in the value of the -th individual. Moreover, we let , , , and . Finally, we set .
We observe that due to the convexity of , and -strong convexity of , is -strongly convex. Moreover, is also -strongly convex. Finally, due to the differentiability of and , and are also differentiable at all points. We have:
As , , for all , and , for any , . The proof now follows by an application of Lemma 1. ∎
3.3.2 Privacy Guarantees for Objective Perturbation
In this section, we show that Algorithm 2 is -differentially private. This proof requires stronger assumptions on the loss function than were required in Theorem 1. In certain cases, some of these assumptions can be weakened; for such an example, see Section 3.4.2.
If is -strongly convex and doubly differentiable, and is convex and doubly differentiable, with and for all , then Algorithm 2 is -differentially private.
Consider an output by Algorithm 2. We observe that given any fixed and a fixed dataset , there always exists a such that Algorithm 2 outputs on input . Because is differentiable and convex, and is differentiable, we can take the gradient of the objective function and set it to at . Therefore,
Note that (17) holds because for any , .
We claim that as is differentiable and is strongly convex, given a dataset , there is a bijection between and . The relation (17) shows that two different values cannot result in the same . Furthermore, since the objective is strictly convex, for a fixed and , there is a unique ; therefore the map from to is injective. The relation (17) also shows that for any there exists a for which is the minimizer, so the map from to is surjective.
To show -differential privacy, we need to compute the ratio of the densities of under the two datasets and . This ratio can be written as (Billingsley, 1995)
where , are the Jacobian matrices of the mappings from to , and and are the densities of given the output , when the datasets are and respectively.
First, we bound the ratio of the Jacobian determinants. Let denote the -th coordinate of . From (17) we have
Given a dataset , the -th entry of the Jacobian matrix is
where is the indicator function. We note that the Jacobian is defined for all because and are globally doubly differentiable.
Let and be two datasets which differ in the value of the -th
item such that
and . Moreover, we define matrices and as follows:
Then, , and .
denote the largest and second largest eigenvalues of a matrix. As has rank at most , from Lemma 2,
For a -strongly convex function , the Hessian has eigenvalues greater than (Boyd and Vandenberghe, 2004). Since we have assumed is doubly differentiable and convex, any eigenvalue of is therefore at least ; therefore, for , . Applying the triangle inequality to the trace norm:
Then upper bounds on , , and yield
Therefore, , and
We now consider two cases. In the first case, , and by definition, in that case, . In the second case, , and in this case, by definition of , .
Next, we bound the ratio of the densities of . We observe that as , for any and , for datasets and which differ by one value,
This implies that:
We can write:
where denotes the surface area of the sphere in dimensions with radius . Here the last step follows from the fact that , where is the surface area of the unit sphere in .
Finally, we are ready to bound the ratio of densities:
Thus, Algorithm 2 satisfies Definition 2. ∎
If is full rank, and if has rank at most , then,
where is the -th eigenvalue of matrix .
Note that has rank at most , so also has rank at most . Using the fact that ,
3.4 Application to classification
In this section, we show how to use our results to provide privacy-preserving versions of logistic regression and support vector machines.
3.4.1 Logistic Regression
One popular ERM classification algorithm is regularized logistic regression. In this case, , and the loss function is . Taking derivatives and double derivatives,
The output of Algorithm 1 with , is an -differentially private approximation to logistic regression. The output of Algorithm 2 with , , and , is an -differentially private approximation to logistic regression.
We quantify how well the outputs of Algorithms 1 and 2 approximate (non-private) logistic regression in Section 4.
3.4.2 Support Vector Machines
Another very commonly used classifier is -regularized support vector machines. In this case, again, , and
There are two alternative solutions to this. First, we can approximate by a different loss function, which is doubly differentiable, as follows (see also Chapelle (2007)):
As , this loss approaches the hinge loss. Taking derivatives, we observe that:
Observe that this implies that for all and . Moreover, is convex, as for all . Therefore, can be used in Theorems 1 and 2, which gives us privacy-preserving approximations to regularized support vector machines.
The output of Algorithm 1 with , and is an -differentially private approximation to support vector machines. The output of Algorithm 2 with , , and is an -differentially private approximation to support vector machines.
The second solution is to use Huber Loss, as suggested by Chapelle (2007), which is defined as follows:
Observe that Huber loss is convex and differentiable, and piecewise doubly-differentiable, with . However, it is not globally doubly differentiable, and hence the Jacobian in the proof of Theorem 2 is undefined for certain values of . However, we can show that in this case, Algorithm 2, when run with satisfies Definition 3.
Let denote the map from to in (17) under , and denote the map under
. By definition, the probability.
Let be the output of Algorithm 2 with , , and . For any set of possible values of , and any pair of datasets , which differ in the private value of one person ,