In this paper, we revisit the problem of answering a sequence of classification queries in the agnostic PAC model under the constraint of -differential privacy. An algorithm for this problem is given a private training dataset of i.i.d. binary-labeled examples drawn from some unknown distribution over , where
denotes an arbitrary data domain (space of feature-vectors) anddenotes a set of binary labels (e.g., ). The algorithm is also given as input some hypothesis class of binary functions mapping to . The algorithm accepts a sequence of classification queries given by a sequence of i.i.d. feature-vectors , drawn from the marginal distribution of over , denoted as . Here, the feature-vectors defining the set of queries do not involve any privacy constraint. The queries are also assumed to arrive one at a time, and the algorithm is required to answer the current query by predicting a label for it before seeing the next query. The goal is to answer up to a given number of queries (which is a parameter of the problem) such that, (i) the entire process of answering the queries is -differentially private, and (ii) the average excess error in the predicted labels does not exceed some desired level ; specifically, where is the corresponding (hidden) true label, and is the approximation error associated with , i.e., the least possible true (population) error that can be attained by a hypothesis in (see Section 2 for formal definitions).
One could argue that a more direct approach for differentially private classification would be to design a differentially private learner that, given a private training set as input, outputs a classifier that is safe to publish and then can be used to answer any number of classification queries. However, there are several pessimistic results that either limit or eliminate the possibility of differentially private learning even for elementary problems such as one-dimensional thresholds[BNSV15, ALMM18]. Therefore, it is natural to study the problem of classification-query release under differential privacy as an alternative approach.
A recent formal investigation of this problem was carried out in [BTT18]. This recent work gives an algorithm based on a combination of two useful techniques from the literature on differential privacy, namely, the sub-sample-and-aggregate technique [NRS07, ST13] and the sparse-vector technique [DR14]. The algorithm in [BTT18], hereafter denoted as , assumes oracle access to a generic, non-private (agnostic) PAC learner for . In this work, we give non-trivial improvements over the results of [BTT18] in the agnostic PAC setting. More details on the comparison to [BTT18] are given in the “Related work” section below. Our improvements are in terms of the attainable accuracy guarantees and the associated private sample complexity bounds in the agnostic setting. These improvements are achieved via importing new ideas and techniques from literature (particularly, the elegant agnostic-to-realizable reduction technique of [BNS15]) to provide an improved construction for the one that appeared in [BTT18].
In this work, we formally study algorithms for classification queries release under differential privacy in the agnostic PAC model. We focus on the sample complexity of such algorithms as a function of the privacy and accuracy parameters as well as the number of queries to be answered.
We show that our construction provides significant improvements over the results of [BTT18] for the agnostic setting:
The error guarantees in [BTT18] involves a constant blow-up (a multiplicative factor ) in the approximation error associated with the given hypothesis class . Using our construction, we give a standard excess error guarantee that does not involve such a blow-up.
We show that our construction can answer up to queries with average excess error using a private sample whose size (assuming is a constant, e.g. ), where is the VC-dimension of . Note that this implies that we can answer up to queries with private sample size that is essentially the same as the standard non-private sample complexity of agnostic PAC learning. i.e., that many queries can be answered with essentially no additional cost due to privacy.
Using recent results of [ABM19] on the sample complexity of semi-private learners (introduced in [BNS13]), we show that our construction immediately leads to an universal private classification algorithm that can answer any number of classification queries using a private sample of size , which is independent of the number of queries. This implies that in the high accuracy regime , we can privately answer any number of classification queries with private sample size that is essentially the same as the standard non-private sample complexity of agnostic PAC learning.
Our algorithm is a two-stage construction. In the first stage, the input training set is pre-processed once and for all via a relabeling procedure due to [BNS15] in which the labels are replaced with the labels generated by an appropriately chosen hypothesis in the given hypothesis class . This step allows us to reduce the agnostic setting to a realizable one. In the second stage, we first sample a new training set from the empirical distribution of the relabeled set in the first stage, then feed it to of [BTT18] together with other appropriately chosen input parameters.
Our results are most closely related to [BTT18]. In [BTT18], Bassily et al. provide formal accuracy guarantees for their algorithm in both the realizable and agnostic settings of the PAC model. However, the accuracy guarantees they provide for the agnostic setting is far from optimal. In particular, their guarantees involves a constant blow-up in the approximation error
, which would limit or eliminate the utility of their construction in scenarios where the approximation error is not negligible. In fact, in most typical scenarios in practice, the approximation error associated with the hypothesis (model) class is a non-negligible constant, (e.g., the test error attained by some state-of-the-art neural networks on benchmark datasets can be as large as, or ). Our improved construction avoids this blow-up in the approximation error.
The construction in [BTT18] can answer up to queries with average excess error (where is the approximation error) using a private sample of size (follows from [BTT18, Theorem 3.5]). Given our results discussed in the “Main results” section above, it follows that our sample complexity bound is tighter than that of [BTT18] by roughly a factor of . In particular, our bound is tighter by roughly a factor of for , and it is tighter by roughly a factor of for . Equivalently, for the same private sample size, our construction can answer roughly a factor of more queries than that of [BTT18].
Bassily et al. [BTT18] also extend their construction to provide a semi-private learner that can finally produce a classifier. This is done by answering a sufficiently large number of queries then applying the knowledge transfer technique using the new training set formed by the set of answered queries. The output classifier can then be used to answer any subsequent queries, and hence, their extended construction provides a universal private classification algorithm. Their private sample complexity bound for this task is (see [BTT18, Theorem 4.3]). Given our results in the “Main results” section above, our universal private classification algorithm yields a private sample complexity bound that is tighter by roughly a factor of .
Other related works:
Reference [DF18] considers the problem of differentially private classification in the single-query setting, and gives upper bounds on the private sample complexity for that problem in the PAC model. Our results imply that the bound shown in [DF18] for the agnostic setting is sub-optimal. In the single-query setting (i.e., ), our bound is essentially optimal as it nearly matches the standard non-private sample complexity of the agnostic PAC model.
For classification tasks we denote the space of feature vectors by , the set of labels by , and the data universe by . A function is called a hypothesis and it labels data points in the feature space by either or i.e. . A set of hypotheses is called a hypothesis class. The VC dimension of is denoted by . We use to denote a distribution defined over the space of feature vectors and labels , and to denote the marginal distribution over . A sample dataset of i.i.d. draws from is denoted by , where and .
The expected error of a hypothesis with respect to a distribution over is defined by . The excess expected error is defined as .
The empirical error of a hypothesis with respect to a labeled set is denoted by .
The problem of minimizing the empirical error on a dataset is known as Empirical Risk Minimization (ERM). We use to denote the hypothesis that minimizes the empirical error with respect to a dataset , .
The expected disagreement between a pair of hypotheses and with respect to a distribution is defined as
The empirical disagreement between a pair of hypotheses and w.r.t. an unlabeled dataset is defined as .
In the realizable setting of the PAC model, there exists a such that i.e., the true labeling function is assumed to be in . In this setting, the distribution can be completely described by and the hypothesis . Such a distribution is called realizable by . Hence, for realizable distributions, the expected error of a hypothesis will be denoted as .
Next, we define the notion of differential privacy. For any two datasets , we denote the symmetric difference between them by .
A (randomized) algorithm is -differentially private if for all pairs of datasets s.t. , and every measurable , we have with probability at least
, we have with probability at leastover the coin flips of , that:
When , it is known as pure differential privacy, and parameterized only by in this case.
We study private classification algorithms that take as input a private labeled dataset , and a sequence of classification queries , defined by unlabeled feature-vectors from , (where is an input parameter), and output a corresponding sequence of predictions, i.e., labels, . Here, we assume that the classification queries come one at a time and the algorithm is required to generate a label for the current query before seeing and responding to the next query. The goal is: i) after answering queries the algorithm should satisfy -differential privacy, and ii) the labels generated should be -accurate with respect to a hypothesis class : a notion of accuracy which we formally define shortly. We give a generic description of the above classification paradigm in Algorithm 1 below (denoted as ).
The algorithm invokes a procedure , which is a generic classification procedure that given the input private training set and the knowledge of hypothesis class , it generates a label for an input query (feature-vector) .
Definition 2.2 (-Private Classification-Query Release Algorithm).
Let . Let be a hypothesis class . A randomized algorithm (whose generic format is described in Algorithm 1) is said to be an - (private classification-query release) algorithm for , if the following conditions hold:
For any sequence , is -differentially private with respect to its input dataset.
For every distribution over , given a dataset and a sequence (where ’s are the queried feature-vectors in and ’s are their true hidden labels), is -accurate with respect to , where our notion of -accuracy is defined as follows: With probability at least over the choice of , and the internal randomness in (Step 2 in Algorithm 1), we have
In the realizable setting, we have an analogous definition where . In this case, we say that the algorithm is a algorithm for in the realizable setting.
Definition 2.3 (-cover for a hypothesis class).
A family of hypotheses is said to form an -cover for a hypothesis class with respect to distribution if for every there exists a such that .
2.1 Previous work on private classification-query release [Btt18]
In [BTT18], they give a construction for a algorithm (referred to as ), which combines the sub-sample-aggregate framework [NRS07, ST13] with the sparse-vector technique [DR14]. Bassily et al. [BTT18] provide formal privacy and accuracy guarantees with sample complexity bounds for in both the realizable and agnostic settings of the PAC model. Here, we briefly describe the algorithm (Algorithm 2 below), and restate the privacy and accuracy guarantees.
The input to is a private labeled dataset , a sequence of classification queries , and a generic non-private PAC learner for a hypothesis class . The algorithm outputs a sequence of private labels . The key idea in is as follows: first, it arbitrarily splits into equal-sized sub-samples for appropriately chosen . Each of those sub-samples is used to train . Hence, we obtain an ensemble of classifiers . Next for each input query , the votes are computed. It then applies the distance-to-instability test [ST13] on the difference between the largest count of votes and the second largest count. If the majority vote is sufficiently stable, returns the majority vote as the predicted label for ; otherwise, it returns a random label. The sparse-vector framework is employed to efficiently manage the privacy budget over the queries. In particular, by employing the sparse-vector technique, the privacy budget of is only consumed by those queries where the majority vote is not stable. Algorithm takes an input cut-off parameter , which represents a bound on the total number of “unstable queries” the algorithm can answer before it halts in order to ensure -differential privacy.
Next, we restate the results of [BTT18] for the realizable and agnostic settings.
Lemma 2.4 (Realizable Setting: follows from Theorems 3.2 & 3.4 in [Btt18]).
Let . Let be a hypothesis class with . Suppose that in is a PAC learner for . Let be any distribution over that is realizable by . There is a setting for the cut-off parameter such that is an - algorithm for in the realizable setting where the private sample size is .
In the agnostic setting, the accuracy guarantee of [BTT18] is not compatible with Definition 2.2; the accuracy guarantee therein has a sub-optimal dependency on the approximation error, (where . In particular, their result entails a blow-up in by a constant factor (). This significantly limit the applicability of this result in scenarios where . In fact, in practical scenarios, it is typical to have non-negligible approximation error, which is a constant that does not depend on the sample size. For example, a class of neural networks may have (i.e., test accuracy cannot exceed ) but its excess error can be, say, (for a large enough sample).
Lemma 2.5 (Agnostic Setting: follows from Theorems 3.2 & 3.5 in [Btt18]).
Let . Let be a hypothesis class with . Suppose in is an agnostic PAC learner for . Let be any distribution over , and let . Let denote the input private sample to . Let where ’s are the queried feature-vectors in and ’s are their true (hidden) labels. Let denote the output labels of . There is a setting for the cut-off parameter such that: 1) is -differentially private with respect to the input training set; 2) when the private sample is of size , then with probability at least over and the randomness in , we have:
3 Private Release of Classification Queries in the Agnostic PAC Setting
In this section, we give the main results of this paper. We give an improved construction for the private classification-query release algorithm in [BTT18] in the general agnostic setting. Our construction can privately answer up to queries with excess classification error , and input sample size , (where hides factors of ). Assuming , it follows that we can answer up-to queries with a private sample whose size is essentially the same as the standard non-private sample complexity of agnostic PAC learning. Comparing to the result of [BTT18] for the agnostic setting, where the private sample size is ([BTT18, Theorem 3.5]), our sample complexity bound is tighter by roughly a factor of when , and it is tighter by a factor of when .
Our construction is made up of two phases. The first phase is a pre-processing phase in which the input private sample, is relabeled using a “good” hypothesis . This phase is a reenactment of the elegant technique due to Beimel et al. [BNS15], which was called LabelBoost Procedure therein. In this phase can be considered as if it is the true labeling hypothesis and so we can reduce the agnostic setting to the realizable setting. By construction is chosen such that it is close to the ERM hypothesis (where is a subset of ). As the chosen input sample size is sufficiently large, is a good hypothesis, i.e., it attains low excess error.
Now as we reduced the problem to the realizable setting, in the next phase we invoke the techniques in [BTT18]. In the second phase, the relabeled training set is used to provide input training examples to of [BTT18] (described in Section 2.1). Note that is no longer i.i.d., and hence we sample with replacement from to form a new dataset. Algorithm then uses this new training set to privately generate labels for a sequence of classification queries. We need to carefully calibrate the privacy parameters of according to the input sample size, the target accuracy guarantee, and also the fact that input to is a re-sampled version of and may contain multiple copies of the elements in . Note also that the distribution of the input dataset is no longer the true distribution but the empirical distribution of . We give a careful analysis of the overall construction where we show that this re-sampling step does not impact our desired accuracy guarantees.
3.1 From the agnostic to the realizable setting: A generic reduction
In this section, we describe the pre-processing step mentioned earlier to reduce the agnostic setting to the realizable setting. This is done via adopting the relabeling technique devised by Beimel et al. in [BNS15]. We denote this procedure here as (given by Algorithm 4 below). We briefly describe the algorithm below, and state the privacy and accuracy guarantees for it.
Given a private labeled dataset as input, randomly chooses a subset of of size , where . Let denote the unlabeled version of i.e., . Given a hypothesis class with , generates the set of all dichotomies on that are realized by . This is denoted as . It then chooses a finite subset such that each dichotomy in is represented by one of the hypotheses in . We note that forms an -cover for . Also note that by Sauer’s lemma [Sau72], the size of is . Finally, chooses a hypothesis using the exponential mechanism with privacy parameter and a score function . Then, is used to rebalel , and finally output a labeled set .
Let . Let be an - differentially private. Let be an algorithm that on an input dataset runs on and input parameter , then invokes on the output dataset. Then, is - differentially private.
The proof follows from a straightforward combination of [BNS15, Lemma 4.1] and privacy amplification by sampling [KLN08, LQS12]. For completeness, we give an outline here. Fix the randomness in dataset due to sampling in Step 2 of . In this case, using [BNS15, Lemma 4.1], any algorithm that on input a dataset applies on the outcome of is - differentially private. Hence, is -differentially private. Now we take into account the randomness due to sampling in Step 2. By privacy amplification due to sampling [KLN08, LQS12], it follows that is -differentially private. ∎
The following lemma establishes the accuracy of the hypothesis selected by the exponential mechanism in Step 4 of . In particular, the lemma asserts that the expected error of is close to that of the ERM hypothesis .
Let be a hypothesis class with . Let . Let be the input parameter to . Let be an arbitrary distribution over , and be an input dataset to , where . With probability at least , hypothesis (generated in Step 4 of ) satisfies the following:
where is the ERM hypothesis w.r.t. the sample generated in Step 2 of .
Note that the score function for the exponential mechanism is whose global sensitivity is . Now, by using the standard accuracy guarantees of exponential mechanism [MT07] (and the fact that its instantiated here with privacy parameter ), w.p. we have
Given the value of in the lemma statement, we have . Using this setting of together with Sauer’s Lemma [Sau72] to bound the size of , it follows that:
Given the bound on and the fact that , by a standard uniform convergence argument from learning theory [SSBD14], we have the following generalization error bounds. With probability we have:
3.2 A Private Classification-Query Release Algorithm
In this section, we describe the overall algorithm (Algorithm 5 below) that combines the two techniques given by , and . As previously, Algorithm 5 (denoted by ) takes as input a private dataset , the total number of queries , and a sequence of classification queries . Together with these, also has oracle access to a non-private PAC learner for a hypothesis class (in the realizable setting). Note that, dataset (output by ) is relabeled using hypothesis . In order to ensure that our input to the next stage is i.i.d., we sample (size of ) points with replacement from the empirical distribution of to form a new dataset . Next, we invoke in the realizable setting with dataset , , , and as inputs. We set the cut-off parameter of (the maximum number of allowable “unstable” queries) as , where is the accuracy parameter of . The privacy parameters to are set to , where will be specified later. This is needed to ensure -differential privacy for the entire construction. Finally we output the sequence of private labels generated by .
We formally state the main result of this paper in the following theorem.
Let be a hypothesis class with . For any , (Algorithm 5) is an - algorithm for , where private sample size
and number of queries .
We will prove the theorem via the following two lemmas that establish the privacy and accuracy guarantees of . ∎
Lemma 3.4 (Privacy Guarantee of ).
is -differentially private (with respect to its input dataset).
In order to prove that is -differentially private, it suffices to to show that is -differentially private. Note that the input to dataset , is output by . Hence, it follows from Lemma 3.1 that if is -differentially private, then is -differentially private. Next, we show that is -differentially private with respect to .
Let and be neighboring datasets. W.l.o.g., assume that and differ in index . Let be the number of times the -th index is sampled by . By property of , and Chernoff bound, w.p. , .
Using the result in [BTT18, Theorem 3.1], is -differentially private with respect to . Conditioned on and by the notion of group privacy we have, is -differentially private. From the above high probability bound on the event we conclude that is -differentially private. This implies that, is -differentially private. ∎
Lemma 3.5 (Accuracy Guarantee of ).
Let be a hypothesis class with . Let (invoked by ) be a PAC learner for (in the realizable setting). Let be any distribution over . Let denote the input private sample to , where
and . Let denote the corresponding true (hidden) labels for . Then, w.p. at least (over the choice of and the randomness in