1 Introduction
Differential privacy (DP) [Dwork et al., 2006]
is one of the most popular approaches towards addressing the privacy challenges in the era of artificial intelligence and big data. While differential privacy is certainly not a solution to all privacyrelated problems, many would agree that it represents a gold standard and is a key enabler in many applications
[Machanavajjhala et al., 2008, Erlingsson et al., 2014, McMahan et al., 2018].Recently, there has been an increasing demand in training machine learning and deep learning models with DP guarantees, which has motivated a growing body of research on this problem
[Kasiviswanathan et al., 2011, Chaudhuri et al., 2011, Bassily et al., 2014, Wang et al., 2015, Abadi et al., 2016].In a nutshell, differentially private machine learning aims at providing formal privacy guarantees that provably nullify the risk of identifying individual data points in the training data, while still allowing the learned model to be deployed and to provide accurate predictions. Many of these methods work well in lowdimensional regime where the model is small and the data is large. It however remains a fundamental challenge how to avoid the explicit dependence in the ambient dimension of the model and to develop practical methods in privately releasing deep learning models with a large number of parameters.
The “knowledge transfer” model of differentially private learning is a promising recent development [Papernot et al., 2017, 2018] which relaxes the problem by giving the learner access to a public unlabeled dataset. The main workhorse of this model is the Private Aggregation of Teacher Ensembles (PATE) framework:
The PATE Framework: Randomly partition the private dataset into splits. Train one “teacher” classifier on each split. Apply the “teacher” classifiers on public data and privately release their majority votes as pseudolabels. Output the “student” classifier trained on the pseudolabeled public data.
PATE achieves DP via the sampleandaggregate scheme [Nissim et al., 2007] for releasing the pseudolabels. Since the teachers are trained on disjoint splits of the private dataset, adding or removing one data point could affect only one of the teachers, hence limiting the influence of any single data point. The noise injected in the aggregation will then be able to “obfuscate” the output and obtain provable privacy guarantees.
This approach is appealing in practice as it does not place any restrictions on the teacher classifiers, thus allowing any deep learning models to be used in a modelagnostic fashion. The competing alternative for differentially private deep learning, NoisySGD [Abadi et al., 2016], is not modelagnostic
, and it requires significantly more tweaking and modifications to the model to achieve a comparable performance, (e.g., on MNIST), if achievable.
There are a number of DP mechanisms that can be used to instantiate the PATE Framework. Laplace mechanism and Gaussian mechanism are used in Papernot et al. [2017, 2018] respectively. This paper primarily considers the new mechanism of Bassily et al. [2018a]
, which instantiates the PATE framework with a more dataadaptive scheme of private aggregation based on the Sparse Vector Technique (SVT). This approach allows PATE to privately label many examples while paying a privacy loss for only a small subset of them (see Algorithm
2 for details). Moreover, Bassily et al. [2018a] provides the first theoretical analysis of PATE which shows that it is able to PAClearn any hypothesis classes with finite VCdimension in the realizable setting. This is a giant leap from the standard differentially private learning models (without the access to a public unlabeled dataset) because the VCclasses are not privately learnable in general [Bun et al., 2015, Wang et al., 2016]. Bassily et al. [2018a] also establishes a set of results on the agnostic learning setting, albeit less satisfying, as the excess risk, i.e., the error rate of the learned classifier relative to the optimal classifier, does not vanish as the number of data points increases.In this paper, we revisit the problem of modelagnostic private learning under the PATE framework with several new analytical and algorithmic tools from the statistical learning theory including: the Tsybakov Noise Condition (TNC)
[Mammen and Tsybakov, 1999], active learning [Hanneke, 2014], as well as the properties of voting classifiers.Our contributions are:

We show that PATE consistently learns any VCclasses under TNC with fast rates and requires very few unlabeled public data points. When specializing to the realizable case, we show that the sample complexity bound (w.r.t. excess risk) of the SVTbased PATE is and for the private and public datasets respectively. The best known results [Bassily et al., 2018a] is (for private data) and (for public data).

We also analyzed the standard Gaussian mechanismbased PATE [Papernot et al., 2018] under TNC. In the realizable case, we obtained a sample complexity of and for the private and public datasets respectively, which matches the bound of [Bassily et al., 2018a] with a simpler and more practical algorithm that uses fewer public data points.

We show that PATE learning is inconsistent for agnostic learning in general and derive new learning bounds that compete against a sequence of limiting majority voting classifiers.

Finally, we propose a disagreementbased active learning algorithm to adaptively select which public data points to release. Under TNC, we show that active learning with standard Gaussian mechanism is able to match the same learning bounds of the SVTbased method for privacy aggregation (Algorithm 1), modulo an additional dependence on the “disagreement coefficients”.
These results (summarized in Table 1) provide strong theoretical insight into how PATE works. Interestingly, our theory suggests that Gaussian mechanism suffices especially if we use actively learning and that it is better not to label all public data when the number of public data points
is large. The remaining data points can be used for semisupervised learning. These tricks have been independently proposed in empirical studies of PATE (see, e.g., semisupervised learning
[Papernot et al., 2017, 2018], active learning [Zhao et al., 2019]), thus our theory can be viewed as providing formal justifications to these PATE variants that are producing strong empirical results in deep learning with differential privacy.2 Related Work
The literature on differentially private machine learning is enormous and it is impossible for us to provide an exhaustive discussion. Instead we focus on a few closely related work and only briefly discuss other representative results in the broader theory of private learning.
2.1 Private Learning with an Auxiliary Public Dataset
The use of an auxiliary unlabeled public dataset was pioneered in empirical studies [Papernot et al., 2017, 2018] where PATE was proposed and shown to produce stronger results than NoisySGD in many regimes. Our work builds upon Bassily et al. [2018a]’s first analysis of PATE and substantially improves the theoretical underpinning. To the best of our knowledge, our results are new and we are the first that consider noise models and active learning for PATE.
Independent to our work, Alon et al. [2019] also studied the problem of private learning with access to an additional public dataset. Specifically, their result reveals an interesting “theorem of the alternatives”type result that says either a VCclass is learnable without an auxiliary public dataset, or we need at least public data points, which essentially says that our sample complexity on the (unlabeled) public data points are optimal. They also provide an upper bound that says private data and public data are sufficient (assuming constant privacy parameter ) to agnostically learn any classes with VCdimension to excess risk. Their algorithm however uses an explicit (distributionindependent) net construction due to Beimel et al. [2016] and exponential mechanism for producing pseudolabels, which cannot be efficiently implemented. Our contributions are complementary as we focus on oracleefficient algorithms that reduce to the learning bounds of ERM oracles (for passive learning) and active learning oracles. Our algorithms can therefore be implemented (and has been) in practice [Papernot et al., 2017, 2018]. Moreover, we show that under TNC, the inefficient construction is not needed and PATE is indeed consistent and enjoys faster rates. It remains an open problem how to achieve consistent private agnostic learning with only access to ERM oracles.
2.2 Privacypreserving prediction
There is another line of work [Dwork and Feldman, 2018] that focuses on the related problem of “privacypreserving prediction” which does not release the learned model (which we do), but instead privately answer one randomly drawn query (which we need to answer many, so as to train a model that can be released). While their technique can be used to obtain bounds in our setting, it often involves weaker parameters. More recent works under this model [see e.g., Dagan and Feldman, 2020, Nandi and Bassily, 2020] notably achieve consistent agnostic learning in this setting with rates comparable to that of Alon et al. [2019]. However, they rely on the same explicit net construction [Beimel et al., 2016], which renders their algorithm computationally inefficient in practice. In contrast, we analyze an oracleefficient algorithm via a reduction to supervised learning (which is practically efficient if we believe supervised learning is easy).
2.3 Theory of Private Learning
More broadly, the learnability and sample complexity of private learning were studied under various models in Kasiviswanathan et al. [2011], Beimel et al. [2013, 2016], Chaudhuri and Hsu [2011], Bun et al. [2015], Wang et al. [2016], Alon et al. [2019]. The VCclasses were shown to be learnable when the either the hypothesis class or the datadomain is finite [Kasiviswanathan et al., 2011]. Beimel et al. [2013] characterizes the sample complexity of private learning in the realizable setting with a “dimension” that measures the extent to which we can construct a specific discretization of the hypothesis space that works for “all distributions” on data. This dimension is often not possible when and are both continuous. Specifically, the problem of learning threshold functions on having VCdimension of is not privately learnable [Chaudhuri and Hsu, 2011, Bun et al., 2015].
2.4 Weaker Private Learning Models
This setting of private learning was relaxed in various ways to circumvent the above artifact. These include protecting only the labels [Chaudhuri and Hsu, 2011, Beimel et al., 2016], leveraging prior knowledge with a prior distribution [Chaudhuri and Hsu, 2011], switching to the general learning setting with Lipschitz losses [Wang et al., 2016], relaxing the distributionfree assumption [Wang et al., 2016], and the setting we consider in this paper — when we assume the availability of an auxiliary public data [Bassily et al., 2018a, Alon et al., 2019]. Note that these settings are closely related to each other in that some additional information about the distribution of the data is needed.
2.5 Tsybakov Noise Condition and Statistical Learning Theory
The Tsybakov Noise Condition (TNC) [Mammen and Tsybakov, 1999, Tsybakov, 2004] is a natural and wellestablished condition in learning theory that has long been used in the analysis of passive as well as active learning [Boucheron et al., 2005]. The Tsybakov noise condition is known to yield better convergence rates for passive learning [Hanneke, 2014], and label savings for active learning [Zhang and Chaudhuri, 2014]. However, the contexts under which we use these techniques are different. For instance, while we are making the assumption of TNC, the purpose is not for active learning, but rather to establish stability. When we apply active learning, it is for the synthetic learning problem with pseudolabels that we release privately, which does not actually satisfy TNC. To the best of our knowledge, we are the first that formally study noise models in the theory of private learning. Lastly, active learning was considered for PATE learning in [Zhao et al., 2019], which demonstrates the clear practical benefits of adaptively selecting what to label. We remain the first that provides theoretical analysis with provable learning bounds.
3 Preliminaries
In this section, we introduce the notations, definitions, and discuss specific technical tools that we will use throughtout this paper.
3.1 Symbols and Notations
We use to denote the set . Let denote the feature space, denote the label, to denote the sample space, and to denote the space of a dataset of unspecified size. A hypothesis (classifier) is a function mapping from to . A set of hypotheses is called the hypothesis class. The VC dimension of is denoted by . Also, let denote the distribution over , and denote the marginal distribution over . is the labeled private teacher dataset, and is the unlabeled public student dataset.
The expected risk of a certain hypothesis with respect to the distribution over is defined as , where is the indicator function which equals to when is true, otherwise. The empirical risk of a certain hypothesis with respect to a dataset is defined as . The best hypothesis is defined as , and the Empirical Risk Minimizer (ERM) is defined as . is used to denote the aggregated classifier in the PATE framework. denotes the privately aggregated one. The expected disagreement between a pair of hypotheses and with respect to the distribution is defined as . The empirical disagreement between a pair of hypotheses and with respect to a dataset is defined as . Throughout this paper, we use standard big notations; and to improve the readability, we use and to hide polylogarithmic factors.
3.2 Differential Privacy and Private Learning
Now we formally introduce differential privacy.
Definition 1 (Differential Privacy [Dwork and Roth, 2014]).
A randomized algorithm is ()DP (differentially private) if for every pair of neighboring datasets (denoted by ) for all :
The definition says that if an algorithm is DP, then no adversary can use the output of to distinguish between two parallel worlds where an individual is in the dataset or not. are privacy loss parameters that quantify the strength of the DP guarantee. The closer they are to , the stronger the guarantee is.
The problem of DP learning aims at designing a randomized training algorithm that satisfies Definition 1. More often than not, the research question is about understanding the privacyutility tradeoffs and characterizing the Pareto optimal frontiers.
3.3 PATE and ModelAgnostic Private Learning
There are different ways we can instantiate the PATE framework to privately aggregate the teachers’ predicted labels. The simplest, described in Algorithm 1, uses Gaussian mechanism to perturb the voting score.
An alternative approach due to [Bassily et al., 2018a] uses the Sparse Vector Technique (SVT) in a nontrivial way to privately label substantially more data points in the cases when teacher ensemble’s predictions are stable for most input data. The stability is quantified in terms of the margin function, defined as
(1) 
which measures the absolute value of the difference between the number of votes (see Algorithm 2).
In both algorithms, the privacy budget parameters are taken as an input and the following privacy guarantee applies to all input datasets.
Careful readers may note the slightly improved constants in the formula for calibrating privacy than when these methods were first introduced. We include the new proof based on the concentrated differential privacy [Bun and Steinke, 2016] approach in the Appendix A.
The key difference between the two privateaggregation mechanisms is that the standard PATE pays for a unit privacy loss for every public data point labeled, while the SVTbased PATE essentially pays only for those queries where the voted answer from the teacher ensemble is close to be unstable (those with a small margin). Combining this intuition with the fact that the individual classifiers are accurate — by the statistical learning theory, they are — the corresponding majority voting classifier can be shown to be accurate with a large margin. These two critical observations of Bassily et al. [2018a] lead to the first learning theoretic guarantees for SVTbased PATE. For completeness, we include this result with a concise new proof in Appendix A.
Lemma 3 (Adapted from Theorem 3.11 of Bassily et al. [2018b]).
Lemma 4 (Lemma 4.2 of Bassily et al. [2018b]).
If the classifiers obey that each of them makes at most mistakes on data , then
Lemma 4 implies that if the individual classifiers are accurate — by the statistical learning theory, they are — the corresponding majority voting classifier is not only nearly as accurate, but also has sufficiently large margin that satisfies the conditions in Lemma 3.
Next, we state and provide a straightforward proof of the following results due to [Bassily et al., 2018b]. The results are already stated in the referenced work in the form of sample complexities, but we include a more direct analysis of the error bound and clarify a few technical subtleties.
Theorem 5 (Adapted from Theorems 4.6 and 4.7 of [Bassily et al., 2018b]).
We provide a selfcontained proof of this result in the appendix (see Theorem 29 ^{1}^{1}1The numerical constant might be improvable (and it is indeed worse than the result stated in Bassily et al. [2018a]), though we decide to present this for the simplicity of the proof.)
Remark 6 (Error bounds when is sufficiently large).
Notice that we do not have to label all public data, so when we have a large number of public data, we can afford to choose to be smaller so as to minimize the bound. That gives us a error bound for the realizable case and a error bound for the agnostic case ^{2}^{2}2These correspond to the sample complexity bound in Theorem 4.6 of [Bassily et al., 2018b] for realizable PAC learning for error ; and the sample complexity bound in Theorem 4.7 of [Bassily et al., 2018b] for agnostic PAC learning with error . The privacy parameter is taken as a constant in these results..
In Section 4.1 and 4.2, we present a more refined theoretical analysis of PATE with Passive Student Queries algorithm (PATEPSQ, Algorithm 3) that uses SVTbased Algorithm 2 as the subroutine. Our results provide stronger learning bounds and new theoretical insights under various settings. In Section 4.3, we propose a new active learning based method and show that we can obtain qualitatively the same theoretical gain while using the simpler (an often more practical) Gaussian mechanismbased Algorithm 1 as the subroutine. For comparison, we also include an analysis of standard PATE (with Gaussian mechanism) in Appendix A. Table 1 summarizes these technical results.
3.4 DisagreementBased Active Learning
We adopt the disagreementbased active learning algorithm that comes with strong learning bounds (see, e.g., an excellent treatment of the subject in [Hanneke, 2014]). The exact algorithm, described in Algorithm 4, keeps updating a subset of the hypothesis class called a version space by collecting labels only from those data points from a certain region of disagreement and eliminates candidate hypothesis that are certifiably suboptimal.
Definition 7 (Region of disagreement [Hanneke, 2014]).
For a given hypothesis class , its region of disagreement is defined as a set of data points over which there exists two hypotheses disagreeing with each other,
Region of disagreement is the key concept of the disagreementbased active learning algorithm. It captures the uncertainty region of data points for the current version space. The algorithm is fed a sequence of data points and runs in the online fashion, whenever there exists a data point in this region, its label will be queried. Then any bad hypotheses will be removed from the version space.
The algorithm, as it is written is not directly implementable, as it represents the version spaces explicitly, but there are practical implementations that avoids explicitly representing the versions spaces by a reduction to supervised learning oracles.
4 Main Results
In this section, we present our main results. Section 4.1 and 4.2 provide new learning bounds for PATEPSQ under noise models and in agnostic setting. Section 4.3 introduces PATE with Active Student Queries (PATEASQ) algorithm and analyze its performance.
4.1 Improved Learning Bounds under TNC
Recall that our motivation is to analyze PATE in the cases when the best classifier does not achieve 0 error and that the existing bound presented in Theorem 5 is vacuous if . The error bound of does not match the performance of even as and even if we output the voted labels without adding noise. This does not explain the empirical performance of Algorithm 3 reported in Papernot et al. [2017, 2018]
which demonstrates that the retrained classifier from PATE could get quite close to the best nonprivate baselines even if the latter are far from being perfect. For instance, on Adult dataset and SVHN dataset, the nonprivate baselines have accuracy 85% and 92.8% and PATE achieves 83.7% and 91.6% respectively.
To under stand how PATE works in the regime where the best classifier obeys that , we introduce a large family of learning problems that satisfy the socalled the Tsybakov Noise Condition (TNC), under which we show that PATE is consistent with fast rates. To understand TNC, we need to introduce a few more notations. Let label and define the regression function . The Tsybakov noise condition is defined in terms of the distribution of .
Definition 8 (Tsybakov noise condition).
The joint distribution of the data
satisfies the Tsybakov noise condition with parameter if there exists a universal constant such that for allNote that when , the label is purely random and when or , is a deterministic function of . The Tsybakov noise condition essentially is reasonable “low noise” condition that does not require a uniform lower bound of for all . When the labelnoise is bounded for all , e.g., when with probability and with probability , then the Tsybakov noise condition holds with . The case when is also known as the Massart noise condition or bounded noise condition in the statistical learning literature.
For our purpose, it is more convenient to work with the following equivalent definition of TNC, which is equivalent to Definition 8 (see a proof from Bousquet et al. [2004, Definition 7]).
Lemma 9 (Equivalent definition of TNC).
We say that a distribution of satisfies the Tsybakov noise condition with parameter if and only if there exists such that, for every labeling function ,
(2) 
where is the Bayes optimal classifier.
In the remainder of this section, we make the assumption that the Bayes optimal classifier and works with the slightly weaker condition that requires (2) to hold only for and that we replace by the optimal classifier ^{3}^{3}3This slightly different condition, that requires (2) to hold only for but with replaced by the optimal classifier (without assuming that ) is all we need. This is formally referred to as the Bernstein class condition by Hanneke [2014]. Very confusingly, when the Tsybakov noise condition is being referred to in more recent literature, it is in fact the Bernstein class condition — a slightly weaker but more opaque definition about both the hypothesis class and the data generating distribution..
We emphasize that the Tsybakov noise condition is not our invention. It has a long history from statistical learning theory to interpolate between the realizable setting and the agnostic setting. Specifically, problems satisfying TNC admit fast rates. For
, the empirical risk minimizer achieves an excess risk of , which clearly interpolates the realizable case of and the agnostic case of .Next, we give a novel analysis of Algorithm 3 under TNC. The analysis is simple but revealing, as it not only avoids the strong assumption that requires to be close to , but also achieves a family of fast rates which significantly improves the sample complexity of PATE learning even for the realizable setting.
Theorem 10 (Utility guarantee of Algorithm 3 under TNC).
Assume the data distribution and the hypothesis class obey the Tsybakov noise condition with parameter . Then Algorithm 3 with
obeys that with probability at least :
Remark 11 (Bounded noise case).
When , the Tsybakov noise condition is implied by the bounded noise assumption, a.k.a., Massart noise condition, where the labels are generated by the Bayes optimal classifier and then toggled with a fixed probability less than . Theorem 10 implies that the excess risk is bounded by , with , which implies a sample complexity upper bound of private data points and public data points. The results improve over the sample complexity bound from Bassily et al. [2018a] in the stronger realizable setting from and to and respectively in the private and public data.
There are two key observations behind the improvement. First, the teacher classifiers do not have to agree on the labels as in Lemma 4; all they have to do is to agree on something for the majority of the data points. Conveniently, the Tsybakov noise condition implies that the teacher classifiers agree on the Bayes optimal classifier . Second, when the teachers agree on , the synthetic learning problem with the privately released pseudolabels is nearly realizable. These intuitions can be formalized with a few lemmas, which will be used in the proof of Theorem 10.
Lemma 12.
With probability over the training data of , assume is the Bayes optimal classifier and Tsybakov noise condition with parameter , then there is a universal constant such that for all
Proof.
By the equivalent definition of the Tsybakov noise condition and then the learning bounds under TNC (Lemma 35),
∎
Lemma 13.
Under the condition of Lemma 12, with probability , for all the total number of mistakes made by one teacher classifier with respect to can be bounded as:
Proof.
Number of mistakes made by with respect to is the empirical disagreement between and on data points, therefore, by Bernstein’s inequality (Lemma 32)
∎
Using the above two lemmas we establish a bound on the number of examples where the differentially privately released labels differ from the prediction of .
Lemma 14.
Let Algorithm 3 be run with the number of teachers and the cutoff parameter chosen according to Theorem 10. Assume the conditions of Lemma 12. Then with high probability ( over the random coins of Algorithm 3 alone and conditioning on the high probability events of Lemma 12 and Lemma 13), Algorithm 3 finishes all queries without exhausting the cutoff budget and that
The notation in the choice of and hides polynomial factors of where is from Lemma 12 and 13.
Proof.
Denote the bound from Lemma 13 by . By the same Pigeon hole principle argument as in Lemma 4 (but with replaced by ), we have that the number of queries that have margin smaller than is at most . The choice of
ensures that with high probability, over the Laplace random variables in Algorithm
2, in at least queries where the answer , i.e.,∎
Now we are ready to put everything together and prove Theorem 10.
Proof of Theorem 10.
Denote where is the empirical average of the disagreements over the data points that students have^{4}^{4}4Note that in this case we could take since . We are defining this more generally so later we can substitute with other label vector that are not necessarily generated by any hypothesis in .. By the triangular inequality of the error,
(3) 
where the second line follows from the uniform Bernstein’s inequality — apply the first statement Lemma 33 in Appendix C with and the third line is due to for nonnegative .
By the triangular inequality, we have , therefore
In the second line, we applied the fact that is the minimizer of ; in the third line, we applied triangular inequality again and the last line is true because since is the minimizer and that .
In the light of the above analysis, it is clear that the improvement from our analysis under TNC are twofolds: (1) We worked with the disagreement with respect to rather than . (2) We used a uniform Bernstein bound rather than a uniform Hoeffding bound that leads to the faster rate in terms of the number of public data points needed.
Remark 15 (Reduction to ERM).
The main challenge in the proof is to appropriately take care of . Although we are denoting it as a classifier, it is in fact a vector that is defined only on rather than a general classifier that can take any input . Since we are using the SVTbased Algorithm 2, is only welldefined for the student dataset. Moreover, these privately released “pseudolabels” are not independent, which makes it infeasible to invoke a generic learning bound such as Lemma 34. Our solution is to work with the empirical risk minimizer (ERM, rather than a generic PAC learner as a blackbox) and use uniform convergence (Lemma 33) directly. This is without loss of generality because all learnable problems are learnable by (asymptotic) ERM [Vapnik, 1995, ShalevShwartz et al., 2010].
4.2 Challenges and New Bounds under Agnostic Setting
In this section, we present a more refined analysis of the agnostic setting. We first argue that agnostic learning with Algorithm 3 will not be consistent in general and competing against the best classifier in seems not the right comparator. The form of the pseudolabels mandate that is aiming to fit a labeling function that is inherently a voting classifier. The literature on ensemble methods have taught us that the voting classifier is qualitatively different from the individual voters. In particular, the error rate of the majority voting classifier can be significantly better, about the same, or significantly worse than the average error rate of the individual voters. We illustrate this matter with two examples.
Example 16 (Voting fail).
Consider a uniform distribution on
and that the corresponding label . Let the hypothesis class be whose evaluation on are given in Figure 1. Check that the classification error of all three classifiers is . Also note that the empirical risk minimizer will be a uniform distribution over . The majority voting classifiers, learned with iid data sets, will perform significantly worse and converge to a classification error of exponentially quickly as the number of classifiers goes to .Error  
1  1  1  1  0  
1  1  0  0  0.5  
1  0  1  0  0.5  
1  0  0  1  0.5  
1  0  0  0  0.75 
This example illustrates that the PATE framework cannot consistently learn a VCclass in the agnostic setting in general. On a positive note, there are also cases where the majority voting classifier boosts the classification accuracy significantly, such as the following example.
Example 17 (Voting win).
If , where is a small constant, for all , then by Hoeffding’s inequality,
Thus the error goes to exponentially as .
These cases call for an alternative distributiondependent theory of learning that characterizes the performance of Algorithm 3 more accurately.
Next, we propose two changes to the learning paradigm. First, we need to go beyond and compare with the following infinite ensemble classifier
The classifier outputs the majority voting result of infinitely many independent teachers, each trained on i.i.d. data points. As discussed earlier, this classifier can be better or worse than a single classifier that takes data points, that trains on all data points or that is the optimal classifier in . Note that this classifier also changes as gets larger.
Considering different centers for teacher classifiers to agree on is one of the key ideas of this paper. Figure 2 shows three kinds of centers for teachers to agree on. In Bassily et al. [2018a], the center is the true label in the realizable setting. In Section 4.1 under TNC, we analyze the performance of PATEPSQ, where the center is the optimal hypothesis . In the agnostic setting, as we will see, the natural center of agreement is by definition.
Second, we define the expected margin for a classifier trained on i.i.d. samples to be
This quantity captures for a fixed , how likely the teachers will agree. For a fixed learning problem and the number of i.i.d. data points is trained upon, the expected margin is a function of alone. The larger is, the more likely that the ensemble of teachers agree on a prediction in with highconfidence. Note that unlike in Example 17, we do not require the teachers to agree on . Instead, it measures the extent to which they agree on , which could be any label.
When the expected margin is bounded away from for , then the voting classifier outputs with probability converging exponentially to as gets larger. On the technical level, this definition allows us to decouple the stability analysis and accuracy of PATE as the latter relies on how good is.
Definition 18 (Approximate high margin).
We say that a learning problem with i.i.d. samples satisfy approximate highmargin condition if
This definition says that with high probability, except for data points, all other data points in the public dataset have an expected margin of at least . Observe that every learning problem has that increases from to as we vary from to . The realizability assumption and the Tsybakov noise condition that we considered up to this point imply upper bounds of at fixed (see more details in Remark 22). In Appendix E, we demonstrate that for the problem of linear classification on Aadult dataset — clearly an agnostic learning problem — approximate high margin condition is satisfied with a small and large .
The following proposition shows that when a problem is approximate highmargin, there are choices and under which the SVTbased PATE provably labels almost all data points with the output of .
Proposition 19.
Assume the learning problem with i.i.d. data points satisfies approximate highmargin condition. Let Algorithm 2 be instantiated with parameters
then with high probability (over the randomness of the i.i.d. samples of the private dataset, i.i.d. samples of the public dataset, and that of the randomized algorithm), Algorithm 2 finishes all rounds and the output is the same as for all but of the .
Proof.
By the Bernstein’s inequality, with probability over the iid samples of the public data, the number of queries with is smaller than . is an upper bound of the above quantity if we choose .
Conditioning on the above event, by Hoeffding’s inequality and a union bound, with probability over the iid samples of the private data (hence the iid teacher classifiers), for all queries with larger than , the realized margin (defined in (1)) obeys that
It remains to check that under our choice of , for all except the (up to) exceptions.
By the tail of Laplace distribution and a union bound, with probability , all Laplace random variables that perturb the distance to stability in Algorithm 7 is larger than and all
Comments
There are no comments yet.