PAC learning with stable and private predictions

11/24/2019 ∙ by Yuval Dagan, et al. ∙ 10

We study binary classification algorithms for which the prediction on any point is not too sensitive to individual examples in the dataset. Specifically, we consider the notions of uniform stability (Bousquet and Elisseeff, 2001) and prediction privacy (Dwork and Feldman, 2018). Previous work on these notions shows how they can be achieved in the standard PAC model via simple aggregation of models trained on disjoint subsets of data. Unfortunately, this approach leads to a significant overhead in terms of sample complexity. Here we demonstrate several general approaches to stable and private prediction that either eliminate or significantly reduce the overhead. Specifically, we demonstrate that for any class C of VC dimension d there exists a γ-uniformly stable algorithm for learning C with excess error α using Õ(d/(αγ) + d/α^2) samples. We also show that this bound is nearly tight. For ϵ-differentially private prediction we give two new algorithms: one using Õ(d/(α^2ϵ)) samples and another one using Õ(d^2/(αϵ) + d/α^2) samples. The best previously known bounds for these problems are O(d/(α^2γ)) and O(d/(α^3ϵ)), respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

1 Introduction

For a domain and , let be a randomized learning algorithm that given a dataset outputs a predictor . We consider algorithms that for every , is not sensitive to individual examples in . More formally, we say that is an -differentially private prediction algorithm DworkFeldman18 if for any pair of datasets and that differ in a single example and every we have

The setting when is equivalent to the -uniform stability111Alternatively, uniform stability can be considered for algorithms that output a model that gives the confidence level at . It is easy to see ElisseeffEP05, that our discussion and results extend to this setting by defining .BousquettE02.

Stability is a classical approach for understanding and analysis of generalization bounds RogersWagner78,DevroyeW79,BousquettE02,ShwartzSSS10,FeldmanV:19. In practice, stability-inducing methods such as Bagging breiman1996bagging and regularization BousquettE02,ShwartzSSS10 are used to improve accuracy and robustness to outliers.

Prediction privacy was defined by maxnames DworkFeldman18 to study privacy-preserving learning in the setting where the description of the learned model is not accessible to the (potentially adversarial) users. Instead the users have access to the model through an interface. For an input point the interface provides the value of the predictive model on that point. maxnames BassilyTT:18 consider the same setting while focusing on algorithms for answering multiple queries to the interface by aggregating multiple non-private learning algorithms. Many of the existing ML systems, such as query prediction in online search and credit scores, are deployed in this way. At the same time, as demonstrated in a growing number of works, even such black-box exposure of a learned model presents significant privacy risks for the user data ShokriSSS17,LongBG17,truex2018towards,carlini2019secret.

For comparison, the standard setting of privacy-preserving learning aims to ensure that the model learned from the data is produced in a differently private way. Thus this approach preserves privacy even when a potential adversary has complete access to the description of the predictive model. The downside of this strong guarantee is that for some learning problems, achieving the guarantee is known to have substantial additional costs, both in terms of sample complexity and computation.

Another application of algorithms with private predictions is for labeling public unlabeled data. Training on data labeled in this way gives a differentially private learning algorithm HammCB16,PapernotAEGT17,papernot2018scalable,BassilyTT:18. It is also easy to see that these notions of stability ensure some level of protection against targeted data poisoning attacks Biggio:2012 in which an attacker can add examples to the dataset with the goal of changing the prediction on a point they choose.

Given the significant benefits of these notions of stability and privacy it is natural to ask what is the overhead of ensuring them. We will address this question in the classical context of learning a class of Boolean functions in the (agnostic) PAC learning framework Valiant:84,Haussler:92,KearnsSS:94. Namely, for a class of Boolean functions over , we say that is an -agnostic PAC learning algorithm for if for every distribution over , given a dataset of i.i.d. examples from , outputs a hypothesis

that with probability at least

over the choice of satisfies:

Namely, the excess error is at most . In the realizable setting it is additionally known that there exists such that for all in the support of , .

A simple, general, and well-known approach to improve stability is via averaging of models trained on disjoint subsets of data. Alternatively, one can pick a random subset of the dataset and run the algorithm on that subset. Clearly, this technique improves stability by a factor at the expense of using times more data. It was used in an influential work of maxnames ShwartzSSS10 to demonstrate that learnability of implies existence of a uniformly stable algorithm for learning and, to the best of our knowledge, this is the only known technique for learning an arbitrary class of finite VC dimension with uniform stability. An unfortunate disadvantage of this technique is that it substantially increases the sample complexity. In particular, it uses samples to learn a class of VC dimension with uniform stability and excess error (the dependence on is logarithmic and thus we set it to a small fixed constant). By the same argument, in the realizable case samples suffice.

A somewhat more careful aggregation and analysis are needed to achieve privacy since naive averaging does not improve the parameter. In the realizable case samples are known to suffice (and this is tight), but in the agnostic case the best known bound was DworkFeldman18,BassilyTT:18. An additional factor of results from the need to add some noise in the aggregation step while ensuring that it contributes at most to the excess error.

For comparison, we note that ensuring that the entire model is produced with differential privacy cannot, in general, be done using a number of samples polynomial in . For example, threshold functions on a line have VC dimension of but require an infinite number of samples to learn with privacy FeldmanXiao15,AlonLMM19. The only general approach for PAC learning with differential privacy is the technique of maxnames KasiviswanathanLNRS11 that is based on the exponential mechanism McSherryTalwar:07. This approach leads to sample complexity of (in this bound can be replaced with the representation dimension BeimelNS:13 of which can sometimes be lower than ).

1.1 Our contribution

We describe several new algorithms for agnostic PAC learning of arbitrary VC classes that significantly reduce the overheads of achieving stability and privacy of predictions. We first show a simple and natural algorithm that is -uniformly stable and is a -agnostic PAC learner with the nearly optimal sample complexity of .

Theorem 1.

For every class of VC dimension and , there exists a -uniformly stable -agnostic PAC learning algorithm with sample complexity of

This bound is tight up to poly-logarithmic factors. First, a lower bound of holds for any -agnostic PAC algorithm (not necessarily stable). Secondly, we prove that is required even for the realizable setting. The proof of the lower bound is based on a similar bound for private prediction algorithms [Theorem 2.2]DworkFeldman18 and appears in Section 5.

One way to interpret this bound is that -uniform stability can be achieved essentially “for free”, that is, using asymptotically the same number of samples as is necessary for learning. In contrast, known general approaches for achieving stability do not give any non-trivial guarantees without increasing the sample complexity by at least a factor of ShwartzSSS10.

This uniformly stable algorithm easily implies existence of an -differentially private prediction learning algorithm with sample complexity of . This step can be done using the aggregation technique in DworkFeldman18 but we give an even simpler conversion.

Corollary 1.

For every class of VC dimension and , there exists an -agnostic PAC learning algorithm with -differentially private prediction that has sample complexity of .

This bound implies that for , -differentially private prediction can be achieved without increasing the sample complexity (asymptotically).

One limitation of the bound in Corollary 1 is that it only improves on learning with differential privacy when is sufficiently large. Specifically, the sample complexity of -DP -agnostic PAC learning is , where is the representation dimension of BeimelNS:13. Thus Corollary 1 gives a better bound than samples complexity of -DP learning only when (for most known classes of functions , FeldmanXiao15). The only known result in which learning with -private prediction requires fewer samples than -DP learning for all is the algorithm for learning thresholds on a line with sample complexity of DworkFeldman18. Note that the VC dimension of this class is while the representation dimension is .

Our main technical results is the first general bound that for a number of classes improves on -DP -agnostic PAC learning in all parameter regimes. Specifically, we describe an algorithm with sample complexity :

Theorem 2.

For every class of VC dimension and , there exists an -agnostic PAC learning algorithm with -differentially private prediction that has sample complexity of

For comparison, -DP learning of linear threshold functions over , where has sample complexity of FeldmanXiao15. Thus the first term in our bound is better by a factor of . Another example is the class consisting of (the indicator function of) lines on a plane over a finite field . Its VC dimension is while its representation dimension is FeldmanXiao15. Thus our bound is better by a factor of . Although these gaps are typically small, we believe that there is an important conceptual difference between bounds in terms of size of and those depending on the VC dimension, as apparent also in the classical learning theory. In particular, the latter enables to argue about infinite hypothesis classes.

The best known lower bound for the problem is [Theorem 2.2]DworkFeldman18. Thus the bound in Theorem 2 is tight with respect to and , however, there is a gap in the dependence on . Closing this gap is an interesting open problem.

Overview of the techniques:

The algorithm we use to prove Theorem 1 combines two standard tools. We pick a random subset of of size , or approximately a fraction of . Classical results in VC theory imply that is an -net for with high probability. We can therefore use it to define an -cover of that has size of at most . We then we use the exponential mechanism over as in KasiviswanathanLNRS11. The random choice of the subset ensures that each example has small influence on the choice and the exponential mechanism ensures that each example has small influence on the final output function. We remark that similar uses of an exponential mechanism over a data-dependent cover have appeared in prior work on differentially private learning ChaudhuriHsu:11,BeimelNS:13approx,beimel2015learning but we are not aware of any prior use of these tools to ensure uniform stability.

The above stable algorithm can be converted to a differentially private prediction algorithm (Corollary 1). In order to do so, we simulate it with and, in addition, randomly flip the output label with probability . This additional noise increases the error by at most . At the same time it ensures that both and are output with probability at least . In particular, the additive guarantee of implies a multiplicative guarantee of at most .

Next, we describe the second differentially private algorithm, from Theorem 2 The first step also uses an exponential mechanism over a cover defined using a random subset of . The function output by this step is used to re-label , and the result is used as a training set for the private prediction algorithm in the realizable case from DworkFeldman18. The bound on the sample complexity of this algorithm relies on a substantially more delicate analysis of the differential privacy of the exponential mechanism when the set of functions it is applied to changes, combined with the effect of privacy amplification by subsampling KasiviswanathanLNRS11. Although final step of the algorithm and some elements of the proof are borrowed from the general technique of beimel2015learning, new techniques and ingredients are required to derive the final bound. A more detailed outline of the analysis can be found in Section 4.1.

An alternative way to derive Corollary 1 is to use the general relabeling approach for converting a realizable PAC learning algorithm to an agnostic one beimel2015learning. The algorithm and the analysis resulting from this approach are more involved than our proof of Corollary 1 (we use elements of this approach in the proof of Theorem 2). This approach has been communicated to us by maxnames NissimStemmer18:pc and has been the starting point for this work. The details of this technique can be found in an (independent) work of maxnames NandiB19 which applies this result for answering multiple prediction queries with privacy.

2 Preliminaries

Notation:

We denote by the domain, and by the label set. We use to denote the underlying hypothesis class of functions from to will use to denote the VC dimension of . The dataset is denoted by , and the underlying distribution is denoted by . Define . For any , denote . Given a hypothesis , denote the expected zero-one error of by and the empirical error of by . We use to denote positive universal constants.

Agnostic PAC learning:

A learning algorithm receives a training set and outputs a hypothesis . We also consider randomized learning algorithms, and define the corresponding loss as (and is defined analogously). Given , we say that is -agnostic PAC learner for if for every distribution over ,

Uniform stability and prediction privacy:

A -uniformly stable (or, -stable for brevity) learner is a learner whose prediction probabilities at every point change by at most an additive when one example in is changed. Formally, for any which differ in at most one example and any and ,

The notion of differentially private prediction DworkFeldman18 is an application of the definition of differential privacy DworkMNS:06 to learning in the setting where the only output that is exposed to (potentially adversarial) users is predictions on their points. Formally, given , we say that an algorithm gives -private prediction if

Postprocessing guarantees of differential privacy imply that any learning algorithm that outputs a predictor with -differential privacy also gives -private prediction. We say that an algorithm gives -private prediction when . Note that any -private algorithm is -stable and -private prediction is exactly the same as -stability.

Deterministic notation for randomized hypotheses:

Assume that a randomized learning algorithm outputs a hypothesis . Throughout the formal analysis, instead viewing the prediction of on some point

as random variable, we will describe it by its expectation. That is, we say

instead of . Assume that a randomized algorithm outputs a random hypothesis , where equals with probability , for . Viewing hypotheses as functions from to , one can equivalently write . The expected loss can then be defined as .

Comparing functions ():

For two functions , we write if for any and , . Note that in this notation, the condition of -private prediction is equivalent to

2.1 Technical preliminaries

Nets for VC classes:

Fix , a hypothesis class and a distribution over . Given a subset , we say that is an -net for with respect to if it satisfies the following: any that satisfy for all , also satisfy . We say that is an -net for with respect to if it is an

-net with respect to the uniform distribution over

. The following is a fundamental theorem in machine learning (see, e.g. [Section 28.3]shalev2014understanding).

Lemma 1.

Let be a distribution over , fix , and let be a set of i.i.d. samples from . Then, with probability at least , is an -net for with respect to . Furthermore, this holds also if the samples are selected without replacement (namely, re-sampling from until one gets distinct elements).

Note that this theorem is usually stated for i.i.d. samples with replacement. However, since repetitions do not matter for the definition of nets, one can only gain from sampling without replacement.

The class of all possible labelings of a subset :

Given a subset , we will create a hypothesis class in the following manner: define an equivalence relation over , by if for all . Then, will contain one representative from each equivalence class (chosen arbitrarily).

The growth function and the Sauer-Shelah lemma:

Given a hypothesis class , the growth function of is defined as . The well-known Sauer-Shelah lemma states (see, e.g., [Lemma 6.10]shalev2014understanding):

Lemma 2.

For any hypothesis class of VC dimension and ,

Uniform convergence bounds:

The following is a standard uniform bound on the estimation error of a hypothesis class

(e.g. [Theorem 6.8]shalev2014understanding):

Lemma 3.

Fix and assume that . Then,

The exponential mechanism:

The exponential mechanism is a well-known -differentially private algorithm for selecting a candidate that approximately maximizes some objective that has low sensitivity McSherryTalwar:07. Following KasiviswanathanLNRS11 we apply it to a hypothesis class , dataset and privacy parameter . Namely samples with probability proportional to . It is -differentially private (in the usual sense of DworkMNS:06) and, in particular, it gives -private predictions. The utility guarantees for the expected function can for example be found in SteinkeU17subg.

Lemma 4.

The exponential mechanism gives -private predictions. Additionally, for any and , if then

Privacy amplification by subsampling:

The following lemma is an adaptation of the standard privacy amplification-by-sampling technique KasiviswanathanLNRS11 to private prediction.

Lemma 5.

Let be an algorithm operating on a sample of size for and let be the following algorithm that receives a sample of size :

  1. Select a uniformly random subset of size .

  2. Run on .

For any , if gives -private prediction then gives -private prediction.

Private prediction in the realizable setting:

maxnames [Theorem 4.1]DworkFeldman18 have shown that samples suffice to learn with private prediction in the realizable setting (that is when there exists with zero error).

Lemma 6.

For any and , there exists an -PAC learner with -private prediction for distributions realizable by with sample complexity of .

3 Uniform Stability of PAC Learning

In this section we describe the uniformly stable PAC learning algorithm, prove a (nearly) matching lower bound on its sample complexity and derive the corollary for private prediction.

For convenience, we start by restating Theorem 1.

Theorem 3 (Thm. 1 restated).

For every class of VC dimension and , there exists a -uniformly stable -agnostic PAC learning algorithm with sample complexity of

We outline the algorithm and its analysis below. The details of the proof appear in Section 3.1.

Proof outline:

The algorithm consists of two steps:

  • Given the dataset of size , randomly select a subset of size , and create the hypothesis class , as defined in Section 2.1 ( contains all classifications of on ). Note that satisfies the following properties:

    1. With high probability,

      (1)

      This follows from the fact that , hence it is an net for , with respect to the uniform distribution over (Lemma 1).

    2. From Sauer-Shelah lemma (Lemma 2), the cardinality of is upper-bounded by .

  • We run the exponential mechanism (from Section 2.1) on the set of hypothesis evaluated on the set of examples and denote the selected hypothesis by . We set the privacy parameter to to ensure that the algorithm is -stable. Hence it is -stable. By Lemma 4, suffices to ensure that

    (2)

The algorithm has excess error of at most : combining Eq. (1) and Eq. (2), we get that with high probability over the choice of , , and from uniform convergence we obtain that , with high probability over the choice of and . Next, we claim that it is -stable: assume that and are two training sets that differ in one example. This example has a probability of at most to appear in and affect the creation of . Additionally, the exponential mechanism is stable. Composing the stability bounds from both steps we obtain that algorithm is -stable as desired.

The tightness of the upper bound is implied by the following lower bound.

Theorem 4.

Let be a class of functions of VC dimension . Fix . Assume that there exists a PAC learning algorithm that is -stable and for any realizable distribution over , namely . Then, .

The proof follows the same structure as the proof of Theorem 2.2 in [DworkFeldman18] and appears in Appendix 5.

Prediction privacy via stability:

We now observe that any -stable -PAC learning algorithm for can be converted to a -PAC learning algorithm with -private prediction simply by flipping the predictions with probability .

More formally, the algorithm consists of the following two steps:

  1. Run the -stable algorithm from Theorem 1 with and approximation parameter , and let be the output hypothesis.

  2. Given a point , predict with probability and with probability .

Let denote the obtained algorithm. Note that for any , and , . Additionally, is -stable. It follows easily from the definition that gives -private prediction. Thus we obtain Corollary 1.

3.1 Proof of Theorem 1

The algorithm receives as an input a training set and a point , and outputs as a prediction for defined below:

  1. Select a uniformly random subset of size , where is a parameter to be defined later ( should be thought of as ).

  2. Create the hypothesis class containing all possible labelings of , as defined in Section 2.1.

  3. Execute the exponential mechanism with privacy parameter on the sample to randomly select a hypothesis , and output .

We proceed with the formal definition. First, for any and , define , viewing as a function from . Note that we do not require in this definition. Next, define

The final prediction of the algorithm given is .

We are ready to state the main lemma:

Lemma 7.

Let denote the smallest number that suffices for Lemma 1 to hold, given parameters and (the minimal size required for a random set to be an -net with probability ). Let denote the sample complexity required for -differentially private exponential mechanism to be -approximate given a hypothesis class of size . Assume that

Then

and is -stable.

First, we prove Theorem 1 using Lemma 7, and then we prove this lemma.

Proof of Theorem 1.

Given and , we will show an algorithm with the desired sample complexity. It suffices to find an algorithm which is -stable and -agnostic PAC. To do so, we apply the above algorithm with parameter , and then select as the smallest integer which satisfies the conditions of Lemma 7.

We apply Lemma 7 to get a bound on the empirical error of and then apply uniform convergence to generalize to the data distribution. In particular, from Lemma 3, with probability over ,

Lastly, we prove Lemma 7:

Proof of Lemma 7.

First we prove -approximation. From the condition that , for any ,

Since , with probability at least over the choice of :

Thus,

Now we prove the -stability. Let be a sample and let be obtained from by replacing one sample. Without loss of generality we can assume that is obtained by removing from and adding . Our goal is to show that .

For any , Lemma 4 implies that

using the inequality for . To conclude the proof, using the fact that ,

Remark 1.

It is possible to slightly improve the bound in Theorem 1, replacing with . In order to do so, one has to replace with an -net, namely, a set of hypotheses from which satisfies:

There exists such a set of cardinality . Using this set, one can relax the requirement from Lemma 7 to , and the improved bound would follow.

4 PAC Learning with Prediction Privacy

In this section we describe our main technical result for private prediction. Our algorithm is the first general algorithm that can PAC learn an arbitrary class of VC dimension with the (nearly) optimal dependence of the sample complexity on and . However the dependence on the dimension is quadratic. More formally, we prove the following upper bound on the sample complexity of the problem.

Theorem 5 (Theorem 2 restated).

For every class of VC dimension and , there exists an -agnostic PAC learning algorithm with -differentially private prediction that has sample complexity of

We present the algorithm and the outline of its analysis below. Additional technical details are given in Section 4.2.

4.1 Overview of the algorithm and its analysis

The algorithm that achieves the claimed sample complexity is described below (with slight simplifications for the clarity of exposition):

  • The first steps are similar to those of the stable algorithm: we draw a random subset and then select, using the exponential mechanism, a hypothesis . The only difference is that is now selected to be of size (rather than ) and the privacy parameter of the exponential mechanism is now set to be (rather than ).

  • In the second step, we “privatize” by defining as follows:

    1. Let denote the set of examples obtained by labeling the points in with , namely, .

    2. Feed to the PAC learner for the realizable setting with private prediction (from Lemma 6), and denote the output hypothesis as . The privacy guarantee of this realizable learner is set to and the approximation guarantee is . Note that given , the hypothesis is randomly selected (as the corresponding learner is randomized).

    3. Given a point , predict .

    Note that since is realized by , then is very close to : informally speaking, .

We first establish that the algorithm is an -PAC learning algorithm. As in the analysis of the stable algorithm, . By the guarantees of the PAC learner for the realizable case we get that .

Next, we explain why the algorithm gives -private prediction. Denote by the random hypothesis output given a training set and a fixed subset : is obtained by first selecting from using the exponential mechanism and then learning . Note that is a random variable, since the outputs of both the exponential mechanism and the private learner are random. For the analysis, we extend the definition of to sets which are not necessarily subsets of . Our goal is to show that for any and ,

(3)

where the probability is both over the selections of the subsets and and over the random selections of and . We split Eq. (3) in two, first comparing with and then with : for any and ,

(4)
(5)

Eq. (5) follows from the fact that for any fixed , is the composition of two private algorithms: first, the exponential mechanism selects privately, and then is a private prediction given any fixed .

Next, we sketch the proof of Eq. (4). The terms in both sides of the inequality can be calculated as follows: first and are randomly drawn as subsets of and , respectively. Then, the random predictions and are made. Due to this structure, we can use privacy amplification by sub-sampling (Lemma 5)222To prove Eq. (4), one cannot use Lemma 5 as stated. However, the proof in our case is analogous. to compare with . In particular, we prove that is -private as a function of , and since , the sub-sampling boosts the privacy by a factor of and Eq. (4) follows. Hence, we are left with proving that is -private as a function of :

(6)

for (almost) any and that differ on one element.

To prove Eq. (6), we fix and that differ in one element and create a matching between the elements of and the elements of , such that for any matched pair and , the following properties hold:

  • .

  • For any and , , where the probability is over the random selections of and .

These two properties imply that is within constant factors of : the first property ensures that the probability to select from via the exponential mechanism is within a constant factor from the probability to select from (for any matched and ). The second property states that and are within a constant factor from each other. Hence, Eq. (6) follows from these two properties, and it remains to describe why they hold.

  • First, we define the aforementioned matching between and : is matched with if and only if for all . Note that this is not a one-to-one matching, but any can be matched with either one or two elements from (and similarly any is matched with one or two elements from ).

  • We apply Lemma 1 on -nets, substituting and . We obtain that is an -net for with respect to the uniform distribution over (with high probability over a random selection of ). Any pair of matched hypotheses satisfy for all , hence, by definition of -nets, they satisfy

    (7)
  • The first property, namely , follows immediately from Eq. (7).

  • For the second property, recall that is the output of a learner with -private prediction, trained on . Eq. (7) implies that the training sets used to train and differ on at most examples. Applying the privacy guarantee times (referred to as group privacy), one obtains that , as required. The second property holds, and the proof follows.

4.2 Detailed proof

We will start by describing an algorithm which gives -private prediction (namely, -privacy with , see Section 2). We then convert it to an algorithm with -private prediction as in the proof of Corollary 1.

The algorithm receives a training set and a point , and outputs a label for . It is defined as follows:

  1. Select a subset of size uniformly at random ( is a parameter to be defined later, and should be thought of as ).

  2. Create the hypothesis class , as defined in Section 2.1 ( contains all the different prediction patterns of hypotheses from on ).

  3. Draw a random using the exponential mechanism with loss evaluated on and privacy parameter , where is a parameter to be defined later (should be thought of as ).

  4. Define the dataset as the predictions of on , (where is the ’th sample from ). Recall that denotes the realizable private learner from Lemma 6 with privacy and approximation error (we will ensure that the sample size is sufficiently large to achieve these guarantees). Let denote the learned hypothesis given the sample