Revisiting Model-Agnostic Private Learning: Faster Rates and Active Learning

The Private Aggregation of Teacher Ensembles (PATE) framework is one of the most promising recent approaches in differentially private learning. Existing theoretical analysis shows that PATE consistently learns any VC-classes in the realizable setting, but falls short in explaining its success in more general cases where the error rate of the optimal classifier is bounded away from zero. We fill in this gap by introducing the Tsybakov Noise Condition (TNC) and establish stronger and more interpretable learning bounds. These bounds provide new insights into when PATE works and improve over existing results even in the narrower realizable setting. We also investigate the compelling idea of using active learning for saving privacy budget. The novel components in the proofs include a more refined analysis of the majority voting classifier – which could be of independent interest – and an observation that the synthetic "student" learning problem is nearly realizable by construction under the Tsybakov noise condition.



There are no comments yet.


page 1

page 2

page 3

page 4


Statistical Active Learning Algorithms for Noise Tolerance and Differential Privacy

We describe a framework for designing efficient active learning algorith...

Refined Error Bounds for Several Learning Algorithms

This article studies the achievable guarantees on the error rates of cer...

Robust and Private Learning of Halfspaces

In this work, we study the trade-off between differential privacy and ad...

Differentially- and non-differentially-private random decision trees

We consider supervised learning with random decision trees, where the tr...

Efficient and Parsimonious Agnostic Active Learning

We develop a new active learning algorithm for the streaming setting sat...

Not All are Made Equal: Consistency of Weighted Averaging Estimators Under Active Learning

Active learning seeks to build the best possible model with a budget of ...

Near-Optimal Active Learning of Halfspaces via Query Synthesis in the Noisy Setting

In this paper, we consider the problem of actively learning a linear cla...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Differential privacy (DP) [Dwork et al., 2006]

is one of the most popular approaches towards addressing the privacy challenges in the era of artificial intelligence and big data. While differential privacy is certainly not a solution to all privacy-related problems, many would agree that it represents a gold standard and is a key enabler in many applications

[Machanavajjhala et al., 2008, Erlingsson et al., 2014, McMahan et al., 2018].

Recently, there has been an increasing demand in training machine learning and deep learning models with DP guarantees, which has motivated a growing body of research on this problem

[Kasiviswanathan et al., 2011, Chaudhuri et al., 2011, Bassily et al., 2014, Wang et al., 2015, Abadi et al., 2016].

In a nutshell, differentially private machine learning aims at providing formal privacy guarantees that provably nullify the risk of identifying individual data points in the training data, while still allowing the learned model to be deployed and to provide accurate predictions. Many of these methods work well in low-dimensional regime where the model is small and the data is large. It however remains a fundamental challenge how to avoid the explicit dependence in the ambient dimension of the model and to develop practical methods in privately releasing deep learning models with a large number of parameters.

The “knowledge transfer” model of differentially private learning is a promising recent development [Papernot et al., 2017, 2018] which relaxes the problem by giving the learner access to a public unlabeled dataset. The main workhorse of this model is the Private Aggregation of Teacher Ensembles (PATE) framework:

The PATE Framework: Randomly partition the private dataset into splits. Train one “teacher” classifier on each split. Apply the “teacher” classifiers on public data and privately release their majority votes as pseudo-labels. Output the “student” classifier trained on the pseudo-labeled public data.

PATE achieves DP via the sample-and-aggregate scheme [Nissim et al., 2007] for releasing the pseudo-labels. Since the teachers are trained on disjoint splits of the private dataset, adding or removing one data point could affect only one of the teachers, hence limiting the influence of any single data point. The noise injected in the aggregation will then be able to “obfuscate” the output and obtain provable privacy guarantees.

This approach is appealing in practice as it does not place any restrictions on the teacher classifiers, thus allowing any deep learning models to be used in a model-agnostic fashion. The competing alternative for differentially private deep learning, NoisySGD [Abadi et al., 2016], is not model-agnostic

, and it requires significantly more tweaking and modifications to the model to achieve a comparable performance, (e.g., on MNIST), if achievable.

There are a number of DP mechanisms that can be used to instantiate the PATE Framework. Laplace mechanism and Gaussian mechanism are used in Papernot et al. [2017, 2018] respectively. This paper primarily considers the new mechanism of Bassily et al. [2018a]

, which instantiates the PATE framework with a more data-adaptive scheme of private aggregation based on the Sparse Vector Technique (SVT). This approach allows PATE to privately label many examples while paying a privacy loss for only a small subset of them (see Algorithm 

2 for details). Moreover, Bassily et al. [2018a] provides the first theoretical analysis of PATE which shows that it is able to PAC-learn any hypothesis classes with finite VC-dimension in the realizable setting. This is a giant leap from the standard differentially private learning models (without the access to a public unlabeled dataset) because the VC-classes are not privately learnable in general [Bun et al., 2015, Wang et al., 2016]. Bassily et al. [2018a] also establishes a set of results on the agnostic learning setting, albeit less satisfying, as the excess risk, i.e., the error rate of the learned classifier relative to the optimal classifier, does not vanish as the number of data points increases.

6in! Algorithm PATE (Gaussian Mech.) Papernot et al. [2017] PATE (SVT-based) PATE (Active Learning) This paper Bassily et al. [2018a] This paper Realizable -TNC same as agnostic Agnostic (vs ) required. required. required. Agnostic (vs ) - - Consistent under weaker conditions. -

  • Results new to this paper are highlighted in blue

    . Hyperparameter

    is chosen optimally. The number of public data points we privately label is chosen optimally (subsampling the available public data to run PATE) to minimize the risk bound. is defined such that denotes .

  • and denote the number of data points in the private and public dataset respectively. is the privacy budget for -DP, with assumed to be in its typical range and . The TNC parameter ranges between . denotes the VC-dimension of the hypothesis class, and denotes the disagreement coefficient [Hanneke, 2014]. hides logarithmic terms in and where

    is the failure probability.

  • Precise theorem statements of these results are found in Section 4.1, 4.2 and 4.3. Results about PATE (Gaussian mechanism) can be found in Appendix B.

Table 1: Summary of our results: excess risk bounds for PATE algorithms.

In this paper, we revisit the problem of model-agnostic private learning under the PATE framework with several new analytical and algorithmic tools from the statistical learning theory including: the Tsybakov Noise Condition (TNC)

[Mammen and Tsybakov, 1999], active learning [Hanneke, 2014], as well as the properties of voting classifiers.

Our contributions are:

  1. We show that PATE consistently learns any VC-classes under TNC with fast rates and requires very few unlabeled public data points. When specializing to the realizable case, we show that the sample complexity bound (w.r.t. -excess risk) of the SVT-based PATE is and for the private and public datasets respectively. The best known results [Bassily et al., 2018a] is (for private data) and (for public data).

  2. We also analyzed the standard Gaussian mechanism-based PATE [Papernot et al., 2018] under TNC. In the realizable case, we obtained a sample complexity of and for the private and public datasets respectively, which matches the bound of [Bassily et al., 2018a] with a simpler and more practical algorithm that uses fewer public data points.

  3. We show that PATE learning is inconsistent for agnostic learning in general and derive new learning bounds that compete against a sequence of limiting majority voting classifiers.

  4. Finally, we propose a disagreement-based active learning algorithm to adaptively select which public data points to release. Under TNC, we show that active learning with standard Gaussian mechanism is able to match the same learning bounds of the SVT-based method for privacy aggregation (Algorithm 1), modulo an additional dependence on the “disagreement coefficients”.

These results (summarized in Table 1) provide strong theoretical insight into how PATE works. Interestingly, our theory suggests that Gaussian mechanism suffices especially if we use actively learning and that it is better not to label all public data when the number of public data points

is large. The remaining data points can be used for semi-supervised learning. These tricks have been independently proposed in empirical studies of PATE (see, e.g., semi-supervised learning

[Papernot et al., 2017, 2018], active learning [Zhao et al., 2019]), thus our theory can be viewed as providing formal justifications to these PATE variants that are producing strong empirical results in deep learning with differential privacy.

2 Related Work

The literature on differentially private machine learning is enormous and it is impossible for us to provide an exhaustive discussion. Instead we focus on a few closely related work and only briefly discuss other representative results in the broader theory of private learning.

2.1 Private Learning with an Auxiliary Public Dataset

The use of an auxiliary unlabeled public dataset was pioneered in empirical studies [Papernot et al., 2017, 2018] where PATE was proposed and shown to produce stronger results than NoisySGD in many regimes. Our work builds upon Bassily et al. [2018a]’s first analysis of PATE and substantially improves the theoretical underpinning. To the best of our knowledge, our results are new and we are the first that consider noise models and active learning for PATE.

Independent to our work, Alon et al. [2019] also studied the problem of private learning with access to an additional public dataset. Specifically, their result reveals an interesting “theorem of the alternatives”-type result that says either a VC-class is learnable without an auxiliary public dataset, or we need at least public data points, which essentially says that our sample complexity on the (unlabeled) public data points are optimal. They also provide an upper bound that says private data and public data are sufficient (assuming constant privacy parameter ) to agnostically learn any classes with VC-dimension to -excess risk. Their algorithm however uses an explicit (distribution-independent) -net construction due to Beimel et al. [2016] and exponential mechanism for producing pseudo-labels, which cannot be efficiently implemented. Our contributions are complementary as we focus on oracle-efficient algorithms that reduce to the learning bounds of ERM oracles (for passive learning) and active learning oracles. Our algorithms can therefore be implemented (and has been) in practice [Papernot et al., 2017, 2018]. Moreover, we show that under TNC, the inefficient construction is not needed and PATE is indeed consistent and enjoys faster rates. It remains an open problem how to achieve consistent private agnostic learning with only access to ERM oracles.

2.2 Privacy-preserving prediction

There is another line of work [Dwork and Feldman, 2018] that focuses on the related problem of “privacy-preserving prediction” which does not release the learned model (which we do), but instead privately answer one randomly drawn query (which we need to answer many, so as to train a model that can be released). While their technique can be used to obtain bounds in our setting, it often involves weaker parameters. More recent works under this model [see e.g., Dagan and Feldman, 2020, Nandi and Bassily, 2020] notably achieve consistent agnostic learning in this setting with rates comparable to that of Alon et al. [2019]. However, they rely on the same explicit -net construction [Beimel et al., 2016], which renders their algorithm computationally inefficient in practice. In contrast, we analyze an oracle-efficient algorithm via a reduction to supervised learning (which is practically efficient if we believe supervised learning is easy).

2.3 Theory of Private Learning

More broadly, the learnability and sample complexity of private learning were studied under various models in Kasiviswanathan et al. [2011], Beimel et al. [2013, 2016], Chaudhuri and Hsu [2011], Bun et al. [2015], Wang et al. [2016], Alon et al. [2019]. The VC-classes were shown to be learnable when the either the hypothesis class or the data-domain is finite [Kasiviswanathan et al., 2011]. Beimel et al. [2013] characterizes the sample complexity of private learning in the realizable setting with a “dimension” that measures the extent to which we can construct a specific discretization of the hypothesis space that works for “all distributions” on data. This dimension is often not possible when and are both continuous. Specifically, the problem of learning threshold functions on having VC-dimension of is not privately learnable [Chaudhuri and Hsu, 2011, Bun et al., 2015].

2.4 Weaker Private Learning Models

This setting of private learning was relaxed in various ways to circumvent the above artifact. These include protecting only the labels [Chaudhuri and Hsu, 2011, Beimel et al., 2016], leveraging prior knowledge with a prior distribution [Chaudhuri and Hsu, 2011], switching to the general learning setting with Lipschitz losses [Wang et al., 2016], relaxing the distribution-free assumption [Wang et al., 2016], and the setting we consider in this paper — when we assume the availability of an auxiliary public data [Bassily et al., 2018a, Alon et al., 2019]. Note that these settings are closely related to each other in that some additional information about the distribution of the data is needed.

2.5 Tsybakov Noise Condition and Statistical Learning Theory

The Tsybakov Noise Condition (TNC) [Mammen and Tsybakov, 1999, Tsybakov, 2004] is a natural and well-established condition in learning theory that has long been used in the analysis of passive as well as active learning [Boucheron et al., 2005]. The Tsybakov noise condition is known to yield better convergence rates for passive learning [Hanneke, 2014], and label savings for active learning [Zhang and Chaudhuri, 2014]. However, the contexts under which we use these techniques are different. For instance, while we are making the assumption of TNC, the purpose is not for active learning, but rather to establish stability. When we apply active learning, it is for the synthetic learning problem with pseudo-labels that we release privately, which does not actually satisfy TNC. To the best of our knowledge, we are the first that formally study noise models in the theory of private learning. Lastly, active learning was considered for PATE learning in [Zhao et al., 2019], which demonstrates the clear practical benefits of adaptively selecting what to label. We remain the first that provides theoretical analysis with provable learning bounds.

3 Preliminaries

In this section, we introduce the notations, definitions, and discuss specific technical tools that we will use throughtout this paper.

3.1 Symbols and Notations

We use to denote the set . Let denote the feature space, denote the label, to denote the sample space, and to denote the space of a dataset of unspecified size. A hypothesis (classifier) is a function mapping from to . A set of hypotheses is called the hypothesis class. The VC dimension of is denoted by . Also, let denote the distribution over , and denote the marginal distribution over . is the labeled private teacher dataset, and is the unlabeled public student dataset.

The expected risk of a certain hypothesis with respect to the distribution over is defined as , where is the indicator function which equals to when is true, otherwise. The empirical risk of a certain hypothesis with respect to a dataset is defined as . The best hypothesis is defined as , and the Empirical Risk Minimizer (ERM) is defined as . is used to denote the aggregated classifier in the PATE framework. denotes the privately aggregated one. The expected disagreement between a pair of hypotheses and with respect to the distribution is defined as . The empirical disagreement between a pair of hypotheses and with respect to a dataset is defined as . Throughout this paper, we use standard big notations; and to improve the readability, we use and to hide poly-logarithmic factors.

3.2 Differential Privacy and Private Learning

Now we formally introduce differential privacy.

Definition 1 (Differential Privacy [Dwork and Roth, 2014]).

A randomized algorithm is ()-DP (differentially private) if for every pair of neighboring datasets (denoted by ) for all :

The definition says that if an algorithm is DP, then no adversary can use the output of to distinguish between two parallel worlds where an individual is in the dataset or not. are privacy loss parameters that quantify the strength of the DP guarantee. The closer they are to , the stronger the guarantee is.

The problem of DP learning aims at designing a randomized training algorithm that satisfies Definition 1. More often than not, the research question is about understanding the privacy-utility trade-offs and characterizing the Pareto optimal frontiers.

3.3 PATE and Model-Agnostic Private Learning

There are different ways we can instantiate the PATE framework to privately aggregate the teachers’ predicted labels. The simplest, described in Algorithm 1, uses Gaussian mechanism to perturb the voting score.

Input: “Teachers” trained on disjoint subsets of the private data. “Nature” chooses an adaptive sequence of data points . Privacy parameters .

1:  Find such that .
2:  Nature chooses .
3:  for  do
4:     Output .
5:     Nature chooses adaptively (as a function of the output vector till time ).
6:  end for
Algorithm 1 Standard PATE [Papernot et al., 2018]

An alternative approach due to [Bassily et al., 2018a] uses the Sparse Vector Technique (SVT) in a nontrivial way to privately label substantially more data points in the cases when teacher ensemble’s predictions are stable for most input data. The stability is quantified in terms of the margin function, defined as


which measures the absolute value of the difference between the number of votes (see Algorithm 2).

Input: “Teacher” classifiers trained on disjoint subsets of the private data. “Nature” chooses an adaptive sequence of data points . Unstable cutoff , privacy parameters .

1:  Nature chooses .
2:  .
3:  .
4:  .
5:  for  do
6:     .
7:     .
8:     if  then
9:        Output .
10:     else
11:        Output .
12:        , break if .
13:        .
14:     end if
15:     Nature chooses adaptively (based on ).
16:  end for
Algorithm 2 SVT-based PATE [Bassily et al., 2018a]

In both algorithms, the privacy budget parameters are taken as an input and the following privacy guarantee applies to all input datasets.

Theorem 2.

Algorithm 1 and 2 are both ()-DP.

Careful readers may note the slightly improved constants in the formula for calibrating privacy than when these methods were first introduced. We include the new proof based on the concentrated differential privacy [Bun and Steinke, 2016] approach in the Appendix A.

The key difference between the two private-aggregation mechanisms is that the standard PATE pays for a unit privacy loss for every public data point labeled, while the SVT-based PATE essentially pays only for those queries where the voted answer from the teacher ensemble is close to be unstable (those with a small margin). Combining this intuition with the fact that the individual classifiers are accurate — by the statistical learning theory, they are — the corresponding majority voting classifier can be shown to be accurate with a large margin. These two critical observations of Bassily et al. [2018a] lead to the first learning theoretic guarantees for SVT-based PATE. For completeness, we include this result with a concise new proof in Appendix A.

Lemma 3 (Adapted from Theorem 3.11 of Bassily et al. [2018b]).

If the classifiers and the sequence obey that there are at most of them such that for . Then with probability at least , Algorithm 2 finishes all queries and for all such that , the output of Algorithm 2 is .

Lemma 4 (Lemma 4.2 of Bassily et al. [2018b]).

If the classifiers obey that each of them makes at most mistakes on data , then

Lemma 4 implies that if the individual classifiers are accurate — by the statistical learning theory, they are — the corresponding majority voting classifier is not only nearly as accurate, but also has sufficiently large margin that satisfies the conditions in Lemma 3.

Next, we state and provide a straightforward proof of the following results due to [Bassily et al., 2018b]. The results are already stated in the referenced work in the form of sample complexities, but we include a more direct analysis of the error bound and clarify a few technical subtleties.

Theorem 5 (Adapted from Theorems 4.6 and 4.7 of [Bassily et al., 2018b]).


Let be the output of Algorithm 3 that uses Algorithm 2 for privacy aggregation. With probability at least (over the randomness of the algorithm and the randomness of all data points drawn iid), we have

for the realizable case, and

for the agnostic case.

We provide a self-contained proof of this result in the appendix (see Theorem 29 111The numerical constant might be improvable (and it is indeed worse than the result stated in Bassily et al. [2018a]), though we decide to present this for the simplicity of the proof.)

Remark 6 (Error bounds when is sufficiently large).

Notice that we do not have to label all public data, so when we have a large number of public data, we can afford to choose to be smaller so as to minimize the bound. That gives us a error bound for the realizable case and a error bound for the agnostic case 222These correspond to the sample complexity bound in Theorem 4.6 of [Bassily et al., 2018b] for realizable PAC learning for error ; and the sample complexity bound in Theorem 4.7 of [Bassily et al., 2018b] for agnostic PAC learning with error . The privacy parameter is taken as a constant in these results..

In Section 4.1 and 4.2, we present a more refined theoretical analysis of PATE with Passive Student Queries algorithm (PATE-PSQ, Algorithm 3) that uses SVT-based Algorithm 2 as the subroutine. Our results provide stronger learning bounds and new theoretical insights under various settings. In Section 4.3, we propose a new active learning based method and show that we can obtain qualitatively the same theoretical gain while using the simpler (an often more practical) Gaussian mechanism-based Algorithm 1 as the subroutine. For comparison, we also include an analysis of standard PATE (with Gaussian mechanism) in Appendix A. Table 1 summarizes these technical results.

Input: Labeled private teacher dataset , unlabeled public student dataset , unstable query cutoff , privacy parameters ; number of splits .

1:  Randomly and evenly split the teacher dataset into parts where .
2:  Train classifiers , one from each part .
3:  Call Algorithm 2 with parameters and to obtain pseudo-labels for the public dataset . (Alternatively, call Algorithm 1 with parameters )
4:  For those pseudo labels that are , assign them arbitrarily to .

Output: trained on pseudo-labeled student dataset.

Algorithm 3 PATE-PSQ

Input: A “data stream” sampled i.i.d. from distribution . A hypothesis class . An on-demand “labeling service” that outputs label when requested at time . Parameter .

1:  Initialize the version space .
2:  Initialize the selected dataset .
3:  Initialize “Current Output” to be any .
4:  Initialize “Counter”
5:  for  do
6:     if  then
7:        “Request for label” for and get back from the “labeling service”
8:        Update
9:        .
10:     end if
11:     if  then
12:        Update where where is a constant and
13:        Set “Current Output” to be any .
14:     end if
15:     if  then
16:        Break.
17:     end if
18:  end for

Output: Return “Current Output”.

Algorithm 4 Disagreement-Based Active Learning [Hanneke, 2014]

3.4 Disagreement-Based Active Learning

We adopt the disagreement-based active learning algorithm that comes with strong learning bounds (see, e.g., an excellent treatment of the subject in [Hanneke, 2014]). The exact algorithm, described in Algorithm 4, keeps updating a subset of the hypothesis class called a version space by collecting labels only from those data points from a certain region of disagreement and eliminates candidate hypothesis that are certifiably suboptimal.

Definition 7 (Region of disagreement [Hanneke, 2014]).

For a given hypothesis class , its region of disagreement is defined as a set of data points over which there exists two hypotheses disagreeing with each other,

Region of disagreement is the key concept of the disagreement-based active learning algorithm. It captures the uncertainty region of data points for the current version space. The algorithm is fed a sequence of data points and runs in the online fashion, whenever there exists a data point in this region, its label will be queried. Then any bad hypotheses will be removed from the version space.

The algorithm, as it is written is not directly implementable, as it represents the version spaces explicitly, but there are practical implementations that avoids explicitly representing the versions spaces by a reduction to supervised learning oracles.

4 Main Results

In this section, we present our main results. Section 4.1 and 4.2 provide new learning bounds for PATE-PSQ under noise models and in agnostic setting. Section 4.3 introduces PATE with Active Student Queries (PATE-ASQ) algorithm and analyze its performance.

4.1 Improved Learning Bounds under TNC

Recall that our motivation is to analyze PATE in the cases when the best classifier does not achieve 0 error and that the existing bound presented in Theorem 5 is vacuous if . The error bound of does not match the performance of even as and even if we output the voted labels without adding noise. This does not explain the empirical performance of Algorithm 3 reported in Papernot et al. [2017, 2018]

which demonstrates that the retrained classifier from PATE could get quite close to the best non-private baselines even if the latter are far from being perfect. For instance, on Adult dataset and SVHN dataset, the non-private baselines have accuracy 85% and 92.8% and PATE achieves 83.7% and 91.6% respectively.

To under stand how PATE works in the regime where the best classifier obeys that , we introduce a large family of learning problems that satisfy the so-called the Tsybakov Noise Condition (TNC), under which we show that PATE is consistent with fast rates. To understand TNC, we need to introduce a few more notations. Let label and define the regression function . The Tsybakov noise condition is defined in terms of the distribution of .

Definition 8 (Tsybakov noise condition).

The joint distribution of the data

satisfies the Tsybakov noise condition with parameter if there exists a universal constant such that for all

Note that when , the label is purely random and when or , is a deterministic function of . The Tsybakov noise condition essentially is reasonable “low noise” condition that does not require a uniform lower bound of for all . When the label-noise is bounded for all , e.g., when with probability and with probability , then the Tsybakov noise condition holds with . The case when is also known as the Massart noise condition or bounded noise condition in the statistical learning literature.

For our purpose, it is more convenient to work with the following equivalent definition of TNC, which is equivalent to Definition 8 (see a proof from Bousquet et al. [2004, Definition 7]).

Lemma 9 (Equivalent definition of TNC).

We say that a distribution of satisfies the Tsybakov noise condition with parameter if and only if there exists such that, for every labeling function ,


where is the Bayes optimal classifier.

In the remainder of this section, we make the assumption that the Bayes optimal classifier and works with the slightly weaker condition that requires (2) to hold only for and that we replace by the optimal classifier 333This slightly different condition, that requires (2) to hold only for but with replaced by the optimal classifier (without assuming that ) is all we need. This is formally referred to as the Bernstein class condition by Hanneke [2014]. Very confusingly, when the Tsybakov noise condition is being referred to in more recent literature, it is in fact the Bernstein class condition — a slightly weaker but more opaque definition about both the hypothesis class and the data generating distribution..

We emphasize that the Tsybakov noise condition is not our invention. It has a long history from statistical learning theory to interpolate between the realizable setting and the agnostic setting. Specifically, problems satisfying TNC admit fast rates. For

, the empirical risk minimizer achieves an excess risk of , which clearly interpolates the realizable case of and the agnostic case of .

Next, we give a novel analysis of Algorithm 3 under TNC. The analysis is simple but revealing, as it not only avoids the strong assumption that requires to be close to , but also achieves a family of fast rates which significantly improves the sample complexity of PATE learning even for the realizable setting.

Theorem 10 (Utility guarantee of Algorithm 3 under TNC).

Assume the data distribution and the hypothesis class obey the Tsybakov noise condition with parameter . Then Algorithm 3 with

obeys that with probability at least :

Remark 11 (Bounded noise case).

When , the Tsybakov noise condition is implied by the bounded noise assumption, a.k.a., Massart noise condition, where the labels are generated by the Bayes optimal classifier and then toggled with a fixed probability less than . Theorem 10 implies that the excess risk is bounded by , with , which implies a sample complexity upper bound of private data points and public data points. The results improve over the sample complexity bound from Bassily et al. [2018a] in the stronger realizable setting from and to and respectively in the private and public data.

There are two key observations behind the improvement. First, the teacher classifiers do not have to agree on the labels as in Lemma 4; all they have to do is to agree on something for the majority of the data points. Conveniently, the Tsybakov noise condition implies that the teacher classifiers agree on the Bayes optimal classifier . Second, when the teachers agree on , the synthetic learning problem with the privately released pseudo-labels is nearly realizable. These intuitions can be formalized with a few lemmas, which will be used in the proof of Theorem 10.

Lemma 12.

With probability over the training data of , assume is the Bayes optimal classifier and Tsybakov noise condition with parameter , then there is a universal constant such that for all


By the equivalent definition of the Tsybakov noise condition and then the learning bounds under TNC (Lemma 35),

Lemma 13.

Under the condition of Lemma 12, with probability , for all the total number of mistakes made by one teacher classifier with respect to can be bounded as:


Number of mistakes made by with respect to is the empirical disagreement between and on data points, therefore, by Bernstein’s inequality (Lemma 32)

Using the above two lemmas we establish a bound on the number of examples where the differentially privately released labels differ from the prediction of .

Lemma 14.

Let Algorithm 3 be run with the number of teachers and the cut-off parameter chosen according to Theorem 10. Assume the conditions of Lemma 12. Then with high probability ( over the random coins of Algorithm 3 alone and conditioning on the high probability events of Lemma 12 and Lemma 13), Algorithm 3 finishes all queries without exhausting the cut-off budget and that

The notation in the choice of and hides polynomial factors of where is from Lemma 12 and 13.


Denote the bound from Lemma 13 by . By the same Pigeon hole principle argument as in Lemma 4 (but with replaced by ), we have that the number of queries that have margin smaller than is at most . The choice of

ensures that with high probability, over the Laplace random variables in Algorithm 

2, in at least queries where the answer , i.e.,

Now we are ready to put everything together and prove Theorem 10.

Proof of Theorem 10.

Denote where is the empirical average of the disagreements over the data points that students have444Note that in this case we could take since . We are defining this more generally so later we can substitute with other label vector that are not necessarily generated by any hypothesis in .. By the triangular inequality of the error,


where the second line follows from the uniform Bernstein’s inequality — apply the first statement Lemma 33 in Appendix C with and the third line is due to for non-negative .

By the triangular inequality, we have , therefore

In the second line, we applied the fact that is the minimizer of ; in the third line, we applied triangular inequality again and the last line is true because since is the minimizer and that .

Recall that is the unstable cutoff in Algorithm 3. The proof completes by invoking Lemma 14 which shows that the choice of is appropriate such that with high probability. ∎

In the light of the above analysis, it is clear that the improvement from our analysis under TNC are two-folds: (1) We worked with the disagreement with respect to rather than . (2) We used a uniform Bernstein bound rather than a uniform Hoeffding bound that leads to the faster rate in terms of the number of public data points needed.

Remark 15 (Reduction to ERM).

The main challenge in the proof is to appropriately take care of . Although we are denoting it as a classifier, it is in fact a vector that is defined only on rather than a general classifier that can take any input . Since we are using the SVT-based Algorithm 2, is only well-defined for the student dataset. Moreover, these privately released “pseudo-labels” are not independent, which makes it infeasible to invoke a generic learning bound such as Lemma 34. Our solution is to work with the empirical risk minimizer (ERM, rather than a generic PAC learner as a blackbox) and use uniform convergence (Lemma 33) directly. This is without loss of generality because all learnable problems are learnable by (asymptotic) ERM [Vapnik, 1995, Shalev-Shwartz et al., 2010].

4.2 Challenges and New Bounds under Agnostic Setting

In this section, we present a more refined analysis of the agnostic setting. We first argue that agnostic learning with Algorithm 3 will not be consistent in general and competing against the best classifier in seems not the right comparator. The form of the pseudo-labels mandate that is aiming to fit a labeling function that is inherently a voting classifier. The literature on ensemble methods have taught us that the voting classifier is qualitatively different from the individual voters. In particular, the error rate of the majority voting classifier can be significantly better, about the same, or significantly worse than the average error rate of the individual voters. We illustrate this matter with two examples.

Example 16 (Voting fail).

Consider a uniform distribution on

and that the corresponding label . Let the hypothesis class be whose evaluation on are given in Figure 1. Check that the classification error of all three classifiers is . Also note that the empirical risk minimizer will be a uniform distribution over . The majority voting classifiers, learned with iid data sets, will perform significantly worse and converge to a classification error of exponentially quickly as the number of classifiers goes to .

1 1 1 1 0
1 1 0 0 0.5
1 0 1 0 0.5
1 0 0 1 0.5
1 0 0 0 0.75
Figure 1: An example where majority voting classifier is significantly worse than the best classifier in .

This example illustrates that the PATE framework cannot consistently learn a VC-class in the agnostic setting in general. On a positive note, there are also cases where the majority voting classifier boosts the classification accuracy significantly, such as the following example.

Example 17 (Voting win).

If , where is a small constant, for all , then by Hoeffding’s inequality,

Thus the error goes to exponentially as .

These cases call for an alternative distribution-dependent theory of learning that characterizes the performance of Algorithm 3 more accurately.

Next, we propose two changes to the learning paradigm. First, we need to go beyond and compare with the following infinite ensemble classifier

The classifier outputs the majority voting result of infinitely many independent teachers, each trained on i.i.d. data points. As discussed earlier, this classifier can be better or worse than a single classifier that takes data points, that trains on all data points or that is the optimal classifier in . Note that this classifier also changes as gets larger.

Considering different centers for teacher classifiers to agree on is one of the key ideas of this paper. Figure 2 shows three kinds of centers for teachers to agree on. In Bassily et al. [2018a], the center is the true label in the realizable setting. In Section 4.1 under TNC, we analyze the performance of PATE-PSQ, where the center is the optimal hypothesis . In the agnostic setting, as we will see, the natural center of agreement is by definition.

[width=0.5]agree_y.png (a) True label is the center for in realizable setting. [width=0.5]agree_h_star.png (b) Best hypothesis is the center under TNC. [width=0.5]agree_h_inf.png (c) is our new construction for agnostic setting.
Figure 2: Centers for teachers to agree on.

Second, we define the expected margin for a classifier trained on i.i.d. samples to be

This quantity captures for a fixed , how likely the teachers will agree. For a fixed learning problem and the number of i.i.d. data points is trained upon, the expected margin is a function of alone. The larger is, the more likely that the ensemble of teachers agree on a prediction in with high-confidence. Note that unlike in Example 17, we do not require the teachers to agree on . Instead, it measures the extent to which they agree on , which could be any label.

When the expected margin is bounded away from for , then the voting classifier outputs with probability converging exponentially to as gets larger. On the technical level, this definition allows us to decouple the stability analysis and accuracy of PATE as the latter relies on how good is.

Definition 18 (Approximate high margin).

We say that a learning problem with i.i.d. samples satisfy -approximate high-margin condition if

This definition says that with high probability, except for data points, all other data points in the public dataset have an expected margin of at least . Observe that every learning problem has that increases from to as we vary from to . The realizability assumption and the Tsybakov noise condition that we considered up to this point imply upper bounds of at fixed (see more details in Remark 22). In Appendix E, we demonstrate that for the problem of linear classification on Aadult dataset — clearly an agnostic learning problem — -approximate high margin condition is satisfied with a small and large .

The following proposition shows that when a problem is approximate high-margin, there are choices and under which the SVT-based PATE provably labels almost all data points with the output of .

Proposition 19.

Assume the learning problem with i.i.d. data points satisfies -approximate high-margin condition. Let Algorithm 2 be instantiated with parameters

then with high probability (over the randomness of the i.i.d. samples of the private dataset, i.i.d. samples of the public dataset, and that of the randomized algorithm), Algorithm 2 finishes all rounds and the output is the same as for all but of the .


By the Bernstein’s inequality, with probability over the iid samples of the public data, the number of queries with is smaller than . is an upper bound of the above quantity if we choose .

Conditioning on the above event, by Hoeffding’s inequality and a union bound, with probability over the iid samples of the private data (hence the iid teacher classifiers), for all queries with larger than , the realized margin (defined in (1)) obeys that

It remains to check that under our choice of , for all except the (up to) exceptions.

By the tail of Laplace distribution and a union bound, with probability , all Laplace random variables that perturb the distance to stability in Algorithm 7 is larger than and all