# Private Hypothesis Selection

We provide a differentially private algorithm for hypothesis selection. Given samples from an unknown probability distribution P and a set of m probability distributions H, the goal is to output, in a ε-differentially private manner, a distribution from H whose total variation distance to P is comparable to that of the best such distribution (which we denote by α). The sample complexity of our basic algorithm is O( m/α^2 + m/αε), representing a minimal cost for privacy when compared to the non-private algorithm. We also can handle infinite hypothesis classes H by relaxing to (ε,δ)-differential privacy. We apply our hypothesis selection algorithm to give learning algorithms for a number of natural distribution classes, including Gaussians, product distributions, sums of independent random variables, piecewise polynomials, and mixture classes. Our hypothesis selection procedure allows us to generically convert a cover for a class to a learning algorithm, complementing known learning lower bounds which are in terms of the size of the packing number of the class. As the covering and packing numbers are often closely related, for constant α, our algorithms achieve the optimal sample complexity for many classes of interest. Finally, we describe an application to private distribution-free PAC learning.

## Authors

• 15 publications
• 26 publications
• 17 publications
• 51 publications
• ### Differentially Private Assouad, Fano, and Le Cam

Le Cam's method, Fano's inequality, and Assouad's lemma are three widely...
04/14/2020 ∙ by Jayadev Acharya, et al. ∙ 1

• ### Statistically Near-Optimal Hypothesis Selection

Hypothesis Selection is a fundamental distribution learning problem wher...
08/17/2021 ∙ by Olivier Bousquet, et al. ∙ 4

• ### Locally Private Hypothesis Selection

We initiate the study of hypothesis selection under local differential p...
02/21/2020 ∙ by Sivakanth Gopi, et al. ∙ 0

• ### On the Sample Complexity of Privately Learning Unbounded High-Dimensional Gaussians

We provide sample complexity upper bounds for agnostically learning mult...
10/19/2020 ∙ by Ishaq Aden-Ali, et al. ∙ 0

• ### Privately Learning High-Dimensional Distributions

We design nearly optimal differentially private algorithms for learning ...
05/01/2018 ∙ by Gautam Kamath, et al. ∙ 0

• ### Differentially Private Algorithms for Learning Mixtures of Separated Gaussians

Learning the parameters of a Gaussian mixtures models is a fundamental a...
09/09/2019 ∙ by Gautam Kamath, et al. ∙ 0

• ### Near Optimal Jointly Private Packing Algorithms via Dual Multiplicative Weight Update

We present an improved (ϵ, δ)-jointly differentially private algorithm f...
05/02/2019 ∙ by Zhiyi Huang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the problem of hypothesis selection: given samples from an unknown probability distribution, select a distribution from some fixed set of candidates which is “close” to the unknown distribution in some appropriate distance measure. Such situations can arise naturally in a number of settings. For instance, we may have a number of different methods which work under various circumstances, which are not known in advance. One option is to run all the methods to generate a set of hypotheses, and pick the best from this set afterwards. Relatedly, an algorithm may branch its behavior based on a number of “guesses,” which will similarly result in a set of candidates, corresponding to the output at the end of each branch. Finally, if we know that the underlying distribution belongs to some (parametric) class, it is possible to essentially enumerate the class (also known as a cover) to create a collection of hypotheses. Observe that this last example is quite general, and this approach can give generic learning algorithms for many settings of interest.

This problem of hypothesis selection has been extensively studied (see, e.g., [Yatracos85, DevroyeL96, DevroyeL97, DevroyeL01]), resulting in algorithms with a sample complexity which is logarithmic in the number of hypotheses. Such a mild dependence is critical, as it facilitates sample-efficient algorithms even when the number of candidates may be large. These initial works have triggered a great deal of study into hypothesis selection with additional considerations, including computational efficiency, understanding the optimal approximation factor, adversarial robustness, and weakening access to the hypotheses (e.g., [MahalanabisS08, DaskalakisDS12b, DaskalakisK14, SureshOAJ14, AcharyaJOS14b, DiakonikolasKKLMS16, AcharyaJOS18, BousquetKM19]).

However, in modern settings of data analysis, data may contain sensitive information about individuals. Some examples of such data include medical records, GPS location data, or private message transcripts. As such, we would like to perform statistical inference in these settings without revealing significant information about any particular individual’s data. To this end, there have been many proposed notions of data privacy, but perhaps the gold standard is that of differential privacy [DworkMNS06]. Informally, differential privacy requires that, if a single datapoint in the dataset is changed, then the distribution over outputs produced by the algorithm should be similar (see Definition 2.4). Differential privacy has seen widespread adoption, including deployment by Apple [AppleDP17], Google [ErlingssonPK14], and the US Census Bureau [DajaniLSKRMGDGKKLSSVA17].

This naturally raises the question of whether one can perform hypothesis selection under the constraint of differential privacy, while maintaining a logarithmic dependence on the size of the cover. Such a tool would allow us to generically obtain private learning results for a wide variety of settings.

### 1.1 Results

Our main results answer this in the affirmative: we provide differentially private algorithms for selecting a good hypothesis from a set of distributions. The output distribution is competitive with the best distribution, and the sample complexity is bounded by the logarithm of the size of the set. The following is a basic version of our main result.

###### Theorem 1.1.

Let be a set of probability distributions. Let be a set of samples drawn indepdently from an unknown probability distribution . There exists an -differentially private algorithm (with respect to the dataset ) which has following guarantees. Suppose there exists a distribution such that . If , then the algorithm will output a distribution such that with probability at least . The running time of the algorithm is .

The sample complexity of this problem without privacy constraints is , and thus the additional cost for -differential privacy is an additive . We consider this cost to be minimal; in particular, the dependence on is unchanged. Note that the running time of our algorithm is – we conjecture it may be possible to reduce this to as has been done in the non-private setting [DaskalakisK14, SureshOAJ14, AcharyaJOS14b, AcharyaJOS18], though we have not attempted to perform this optimization. Regardless, our main focus is on the sample complexity rather than the running time, since any method for generic hypothesis selection requires time, thus precluding efficient algorithms when is large.

It is possible to improve the guarantees of this algorithm in two ways (Theorem 4.1). First, if the distributions are nicely structured, the former term in the sample complexity can be reduced from to , where is a VC-dimension-based measure of the complexity of the collection of distributions. Second, if there are few hypotheses which are close to the true distribution, then we can pay only logarithmically in this number, as opposed to the total number of hypotheses. These modifications allow us to handle instances where may be very large (or even infinite), albeit at the cost of weakening to approximate differential privacy to perform the second refinement. A technical discussion of our methods is in Section 1.2, our basic approach is covered in Section 3, and the version with all the bells and whistles appears in Section 4.

From Theorem 1.1, we immediately obtain Corollary 1.2 which applies when itself may not be finite, but admits a finite cover with respect to total variation distance.

###### Corollary 1.2.

Suppose there exists an -cover of a set of distributions , and that we are given a set of samples , where . There exists an -differentially private algorithm (with respect to the input ) which outputs a distribution such that with probability , as long as

 n=Ω(log|Cα|α2+log|Cα|αε).

Informally, this says that if a hypothesis class has an -cover , then there is a private learning algorithm for the class which requires samples. Note that our algorithm works even if the unknown distribution is only close to the hypothesis class. This is useful when we may have model misspecification, or when we require adversarial robustness. (We also give an extension of this algorithm which gives guarantees in the semi-agnostic learning model; see Section 3.3

for details.) The requirements for this theorem to apply are minimal, and thus it generically provides learning algorithms for a wide variety of hypothesis classes. That said, in non-private settings, the sample complexity given by this method is rather lossy: as an extreme example, there is no finite-size cover of univariate Gaussian distributions with unbounded parameters, so this approach does not give a finite-sample algorithm. That said, it is well-known that

samples suffice to estimate a Gaussian in total variation distance. In the private setting, our theorem incurs a cost which is somewhat necessary: in particular, it is folklore that any pure

-differentially private learning algorithm must pay a cost which is logarithmic in the packing number of the class (for completeness, see Lemma 5.1). Due to the relationship between packing and covering numbers (Lemma 5.2), this implies that up to a constant factor relaxation in the learning accuracy, our results are tight (Theorem 5.3). Further discussion appears in Sections 5.

Given Corollary 1.2, in Section 6, we derive new learning results for a number of classes. Our main applications are for Gaussian and product distributions. Informally, we obtain sample algorithms for learning a product distribution and a Gaussian with known covariance (Corollaries 6.3 and 6.10), and an algorithm for learning a Gaussian with unknown covariance (Corollary 6.11). These improve on recent results by Kamath, Li, Singhal, and Ullman [KamathLSU19] in two different ways. First, as mentioned before, our results are semi-agnostic, so we can handle when the distribution is only close to a product or Gaussian distribution. Second, our results hold for pure -differential privacy, which is a stronger notion than -zCDP as considered in [KamathLSU19]. In this weaker model, they also obtained and sample algorithms, but the natural modifications to achieve -DP incur extra factors.111Roughly, this is due to the fact that the Laplace and Gaussian mechanism are based on and sensitivity, respectively, and that there is a -factor relationship between these two norms, in the worst case. [KamathLSU19] also showed lower bounds for Gaussian and product distribution estimation in the even weaker model of -differential privacy. Thus, our results show that the dimension dependence for these problems is unchanged for essentially any notion of differential privacy. In particular, our results show a previously-unknown separation between mean estimation of product distributions and non-product distributions under pure -differential privacy; see Remark 6.4.

We also apply Theorem 4.1 to obtain algorithms for learning Gaussians under

-differential privacy, with no bounds on the mean and variance parameters. More specifically, we provide algorithms for learning multivariate Gaussians with unknown mean and known covariance (Corollary

6.13), and univariate Gaussians with both unknown mean and variance (Corollary 6.15). For the former problem, we manage to avoid dependences which arise due to the application of advanced composition (similar to Remark 6.4).

To demonstrate the flexibility of our approach, we also give private learning algorithms for sums of independent random variables (Corollaries 6.20 and 6.22) and piecewise polynomials (Corollary 6.29). To the best of our knowledge, the former class of distributions has not been considered in the private setting, and we rely on covering theorems from the non-private literature. Private learning algorithms for the latter class, piecewise polynomials, have been studied by Diakonikolas, Hardt, and Schmidt [DiakonikolasHS15]. They provide sample and time efficient algorithms for histogram distributions (i.e., piecewise constant distributions), and claim similar results for general piecewise polynomials. Their method depends heavily on rather sophisticated algorithms for the non-private version of this problem [AcharyaDLS17]. In constrast, we can obtain comparable sample complexity bounds from just the existence of a cover and elementary VC dimension arguments, which we derive in a fairly self-contained manner.

We additionally give algorithms for learning mixtures of any coverable class (Corollary 6.32). In particular, this immediately implies algorithms for learning mixtures of Gaussians, product distributions, and all other classes mentioned above.

To conclude our applications, we discuss a connection to PAC learning (Corollary 6.34). It is known that the sample complexity of differentially private distribution-free PAC learning can be higher than that of non-private learning. However, this gap does not exist for distribution-specific learning, where the learning algorithm knows the distribution of (unlabeled) examples, as both sample complexities are characterized by VC dimension. Private hypothesis selection allows us to address an intermediate situation where the distribution of unlabeled examples is not known exactly, but is known to come (approximately) from a class of distributions. When this class has a small cover, we are able to recover sample complexity guarantees for private PAC learning which are comparable to the non-private case.

### 1.2 Techniques

Non-privately, most algorithms for hypothesis selection involve a tournament-style approach. We conduct a number of pairwise comparisons between distributions, which may either have a winner and a loser, or may be declared a draw. Intuitively, a distribution will be declared the winner of a comparison if it is much closer than the alternative to the unknown distribution, and a tie will be declared if the two distributions are comparably close. The algorithm will output any distribution which never loses a comparison. A single comparison between a pair of hypotheses requires samples, and a Chernoff plus union bound argument over the possible comparisons increases the sample complexity to . In fact, we can use uniform convergence arguments to reduce this sample complexity to , where is the VC dimension of the sets (the “Scheffé” sets) defined by the subsets of the domain where the PDF of one distribution dominates another. Crucially, we must reuse the same set of samples for all comparisons to avoid paying polynomially in the number of hypotheses.

A private algorithm for this problem requires additional care. Since a single comparison is based on the number of samples which fall into a particular subset of the domain, the sensitivity of the underlying statistic is low, and thus privacy may seem easily achievable at first glance. However, the challenge comes from the fact that the same samples are reused for all pairwise comparisons, thus greatly increasing the sensitivity: changing a single datapoint could flip the result of every comparison! In order to avoid this pitfall, we instead carefully construct a score function for each hypothesis, namely, the minimum number of points that must be changed to cause the distribution to lose any comparison. For this to be a useful score function, we must show that the best hypothesis will win all of its comparisons by a large margin. We can then use the Exponential Mechanism [McSherryT07] to select a distribution with high score.

Further improvements can be made if we are guaranteed that the number of “good” hypotheses (i.e., those that have total variation distance from the true distribution bounded by ) is at most some parameter , and if we are willing to relax to approximate differential privacy. The parameter here is related to the doubling dimension of the hypothesis class with respect to total variation distance. If we randomly assign the hypotheses to buckets, with high probability, no bucket will contain more than one good hypothesis. We can identify a bucket containing a good hypothesis using a similar method based on the exponential mechanism as described above. Moreover, since we are likely to only have one “good” hypothesis in the chosen bucket, this implies a significant gap between the best and second-best scores in that bucket. This allows us to use stability-based techniques [DworkL09, SmithT13], and in particular the GAP-MAX algorithm of Bun, Dwork, Rothblum, and Steinke [BunDRS18], to identify an accurate distribution.

### 1.3 Related Work

Our main result builds on a long line of work on non-private hypothesis selection. One starting point for the particular style of approach we consider here is [Yatracos85], which was expanded on in [DevroyeL96, DevroyeL97, DevroyeL01]. Since then, there has been study into hypothesis selection under additional considerations, including computational efficiency, understanding the optimal approximation factor, adversarial robustness, and weakening access to the hypotheses [MahalanabisS08, DaskalakisDS12b, DaskalakisK14, SureshOAJ14, AcharyaJOS14b, DiakonikolasKKLMS16, AcharyaJOS18, BousquetKM19]. Our private algorithm examines the same type of problem, with the additional constraint of differential privacy.

There has recently been a great deal of interest in differentially private distribution learning. In the central model, most relevant are [DiakonikolasHS15], which gives algorithms for learning structured univariate distributions, and [KarwaV18, KamathLSU19], which focus on learning Gaussians and binary product distributions. [CaiWZ19] also studies private statistical parameter estimation. Privately learning mixtures of Gaussians was considered in [NissimRS07]. [BunNSV15] give an algorithm for learning distributions in Kolmogorov distance. Upper and lower bounds for learning the mean of a product distribution over the hypercube in -distance include [BlumDMN05, BunUV14, DworkMNS06, SteinkeU17a]. [AcharyaKSZ18] focuses on estimating properties of a distribution, rather than the distribution itself. [Smith11] gives an algorithm which allows one to estimate asymptotically normal statistics with optimal convergence rates, but no finite sample complexity guarantees. There has also been a great deal of work on distribution learning in the local model of differential privacy [DuchiJW13, WangHWNXYLQ16, KairouzBR16, AcharyaSZ19, DuchiR18, JosephKMW18, YeB18, GaboardiRS19].

Non-privately, there has been a significant amount of work on learning specific classes of distributions. The PAC-style formulation of the problem we consider originated in [KearnsMRRSS94]. While learning Gaussians and product distributions can be considered folklore at this point, some of the other classes we learn have enjoyed more recent study. For instance, learning sums of independent random variables was recently considered in [DaskalakisDS12b]

toward the problem of learning Poisson Binomial Distributions (PBDs). Since then, there has been additional work on learning PBDs and various generalizations

Piecewise polynomials are a highly-expressive class of distributions, and they can be used to approximate a number of other univariate distribution classes, including distributions which are multi-modal, concave, convex, log-concave, monotone hazard rate, Gaussian, Poisson, Binomial, and more. Algorithms for learning such classes are considered in a number of papers, including [DaskalakisDS12a, ChanDSS14a, ChanDSS14b, AcharyaDK15, AcharyaDLS17].

There has also been a great deal of work on learning mixtures of distribution classes, particularly mixtures of Gaussians. There are many ways the objective of such a problem can be defined, including clustering [Dasgupta99, DasguptaS00, AroraK01, VempalaW02, AchlioptasM05, ChaudhuriR08a, ChaudhuriR08b, KumarK10, AwasthiS12, RegevV17, HopkinsL18, DiakonikolasKS18b, KothariSS18], parameter estimation [KalaiMV10, MoitraV10, BelkinS10, HsuK13, AndersonBGRV14, BhaskaraCMV14, HardtP15, GeHK15, XuHM16, DaskalakisTZ17, AshtianiBHLMP18], proper learning [FeldmanOS06, FeldmanOS08, DaskalakisK14, SureshOAJ14, DiakonikolasKKLMS16, LiS17], and improper learning [ChanDSS14a]. Our work falls into the line on proper learning: the algorithm is given a set of samples from a mixture of Gaussians, and must output a mixture of Gaussians which is close in total variation distance.

### 1.4 Organization

We begin in Section 2 with preliminaries. In Section 3, we give a basic algorithm for private hypothesis selection, via the exponential mechanism. In Section 4, we extend this approach in two ways: by using VC dimension arguments to reduce the sample complexity for sets of hypotheses with additional structure, and combining this with a GAP-MAX algorithm to achieve non-trivial guarantees for infinite hypothesis classes. Section 5 shows that our approach leads to algorithms which essentially match lower bounds for most distribution classes (in the constant regime). We consider applications in Section 6: through a combination of arguments about covers and VC dimension, we derive algorithms for learning a number of classes of distributions, as well as describe an application to private PAC learning. Finally, we conclude in Section 7 with open questions.

## 2 Preliminaries

###### Definition 2.1.

The total variation distance or statistical distance between and is defined as

 dTV(P,Q)=maxS⊆ΩP(S)−Q(S)=12∫x∈Ω|P(x)−Q(x)|dx=12∥P−Q∥1∈[0,1].

Moreover, if is a set of distributions over a common domain, we define .

Throughout this paper, we consider packings and coverings of sets of distributions with respect to total variation distance.

###### Definition 2.2.

A -cover of a set of distributions is a set of distributions , such that for every , there exists some such that .

A -packing of a set of distributions is a set of distributions , such that for every pair of distributions , we have that .

In this paper, we present semi-agnostic learning algorithms.

###### Definition 2.3.

An algorithm is said to be an -semi-agnostic learner for a class if it has the following guarantees. Suppose we are given , where . The algorithm must output some distribution such that , for some constant . If , then the algorithm is said to be agnostic.

Now we define differential privacy. We say that and are neighboring datasets, denoted , if and differ by at most one observation. Informally, differential privacy requires that the algorithm has close output distributions when run on any pair of neighboring datasets. More formally:

###### Definition 2.4 ([DworkMNS06]).

A randomized algorithm is -differentially private if for all , for all neighboring datasets , and for all events ,

 Pr[T(D)∈S]≤eεPr[T(D′)∈S]+δ.

If , we say that is -differentially private.

The exponential mechanism [McSherryT07] is a powerful -differentially private mechanism for selecting an approximately best outcome from a set of alternatives, where the quality of an outcome is measured by a score function relating each alternative to the underlying dataset. Letting be the set of possible outcomes, a score function maps each pair consisting of a dataset and an outcome to a real-valued score. The exponential mechanism instantiated with a dataset , a score function , and a privacy parameter selects an outcome in with probability proportional to , where is the sensitivity of the score function defined as

 Δ(q)=maxr∈R,D∼D′∣∣q(D,r)−q(D′,r)∣∣.
###### Theorem 2.5 ([McSherryT07]).

For any input dataset , score function and privacy parameter , the exponential mechanism is -differentially private, and with probability at least , selects an outcome such that

 q(D,r)≥maxr′∈Rq(D,r′)−2Δ(q)log(|R|/β)ε.

## 3 A First Method for Private Hypothesis Selection

In this section, we present our first algorithm for private hypothesis selection and obtain the following result.

See 1.1

Note that the sample complexity bound above scales logarithmically with the size of the hypothesis class. In Section 4, we will provide a stronger result (which subsumes the present one as a special case) that can handle certain infinite hypothesis classes. For sake of exposition, we begin in this section with the basic algorithm.

### 3.1 Pairwise Comparisons

We first present a subroutine which compares two hypothesis distributions. Let and be two distributions and consider the following set, which is called the Scheffé set:

 W1={x∈X∣H(x)>H′(x)}

Define , , and to be the probability masses that , , and place on , respectively. It follows that and .222For simplicity of our exposition, we will assume that we can evaluate the two quantities and exactly. In general, we can estimate these quantities to arbitrary accuracy, as long as we can evaluate the density of each point in and also draw samples from .

Now consider the following function of this ordered pair of hypotheses:

 Γ(H,H′,D)={nif p1−p2≤6α;n⋅max{0,^τ−(p2+3α)}otherwise.

When the two hypotheses are sufficiently far apart (i.e., ), is essentially the number of points one needs to change in to make the winner.

###### Lemma 3.1.

Let be distributions as above. With probability at least over the random draws of from , satisfies , and if , then .

###### Proof.

By applying Hoeffding’s inequality, we know that with probability at least , . We condition on this event for the remainder of the proof. Consider the following two cases. In the first case, suppose that . Then we know that . In the second case, suppose that . Since , we know that , and so . Since , we also have . It follows that . This completes the proof. ∎

### 3.2 Selection via Exponential Mechanism

In light of the definition of the pairwise comparison defined above, we consider the following score function , such that for any and dataset ,

 S(Hj,D)=minHk∈HΓ(Hj,Hk,D). (1)

Roughly speaking, is the minimum number of points required to change in in order for to lose at least one pairwise contest against a different hypothesis. When the hypothesis is very close to every other distribution, such that all pairwise contests return “Draw’,’ then the score will be .

###### Lemma 3.2 (Privacy).

For any and collection of hypotheses , the algorithm satisfies -differential privacy.

###### Proof.

First, observe that for any pairs of hypotheses , has sensitivity 1. As a result, the score function is also 1-sensitive. Then the result directly follows from the privacy guarantee of the exponential mechanism (Theorem 2.5). ∎

###### Lemma 3.3 (Utility).

Fix any . Suppose that there exists such that . Then with probability over the sample and the algorithm PHS, we have that outputs an hypothesis such that , as long as the sample size satisfies

 n≥2ln(8m/β)α2+2ln(2m/β)αε.
###### Proof.

First, consider the pairwise contests between and every candidate in . Let be the collection of Scheffé sets. For any event , let denote the empirical probablity of event on the dataset . By Lemma 3.1 and an application of the union bound, we know that with probability at least over the draws of , and for all . In particular, the latter event implies that .

Next, by the utility guarantee of the exponential mechanism (Theorem 2.5), we know that with probability at least , the output hypothesis satisfies

 S(^H,D)≥S(H∗,D)−2ln(2m/β)ε>αn−2ln(2m/β)ε.

Then as long as , we know that with probability at least , . Let us condition on this event, which implies that . We will now show that , which directly implies that by the triangle inequality. Suppose to the contrary that . Then by the definition of , , where . Since , we have , which is a contradiction to the assumption that . ∎

### 3.3 Obtaining a Semi-Agnostic Algorithm

Theorem 1.1 shows that given a hypothesis class and samples from an unknown distribution , we can privately find a distribution with provided that we know . But what if we are not promised that is itself close to ? We would like to design a private hypothesis selection algorithm for the more general semi-agnostic setting, where for any value of , we are able to privately identify a distribution with for some universal constant . Our goal will be to do this with sample complexity which is still logarithmic in .

Our strategy for handling this more general setting is by a reduction to that of Theorem 1.1. We run that algorithm times, doubling the choice of in each run and producing a sequence of candidate hypotheses . By the guarantees of Theorem 1.1, there is some candidate with . The remaining task is to approximately select the best candidate from . This is done by implementing a private version of the Scheffé tournament which is itself semi-agnostic, but has a very poor (quadratic) dependence on the number of candidates .

We prove the following result, which gives a semi-agnostic learner whose sample complexity comparable to that of Theorem 1.1.

###### Theorem 3.4.

Let . Let be a set of distributions and let be a distribution with . There is an -differentially private algorithm which takes as input samples from and with probability at least , outputs a distribution with , as long as

 n≥O(log(m/β)+loglog(1/α)α2+logm+log2(1/α)⋅(log(1/β)+loglog(1/α))αε).

As discussed above, the algorithm relies on the following variant with a much worse dependence on .

###### Lemma 3.5.

Let . There is an -differentially private algorithm which takes as input samples from and with probability at least , outputs a distribution with , as long as

 n≥O(log(m/β)α2+m2log(m/β)αε).
###### Proof sketch..

We use a different variation of the Scheffé tournament which appears in [DevroyeL01]. Non-privately, the algorithm works as follows. For every pair of hypotheses with Scheffé set , let , , and denote the probability masses of on , respectively. Moreover, let denote the fraction of points in the input sample which lie in . We declare to be the winner of the pairwise contest between and if . Otherwise, we declare to be the winner. The algorithm outputs the hypothesis which wins the most pairwise contests (breaking ties arbitrarily).

To make this algorithm -differentially private, we replace in each pairwise contest with the -differentially private estimate . By the composition guarantees of differential privacy, the algorithm as a whole is -differentially private.

The analysis of Devroye and Lugosi [DevroyeL01, Theorem 6.2] shows that the (private) Scheffé tournament outputs a hypothesis with

 dTV(^H,P)≤9OPT+16maxH,H′∈H∣∣P(WH,H′)−cH,H′∣∣.

Fix an arbitrary pair . A Chernoff bound shows that with probability at least as long as . Moreover, properties of the Laplace distribution guarantee with probability at least as long as . The triangle inequality and a union bound over all pairs complete the proof. ∎

###### Proof of Theorem 3.4.

We now combine the private hypothesis selection algorithm of Theorem 1.1 with the expensive semi-agnostic learner of Lemma 3.5 to prove Theorem 3.4. Define sequences and for . For each , let denote the outcome of a run of Algorithm 2 using accuracy parameter and privacy parameter . Finally, use the algorithm of Lemma 3.5 to select a hypothesis from using accuracy parameter and privacy parameter .

Privacy of this algorithm follows immediately from composition of differential privacy. We now analyze its sample complexity guarantee. By Lemma 3.3, we have that all runs of Algorithm 2 succeed simultaneously with probability at least as long as

 n≥O(log(m/β)+loglog(1/α)α2+log(m/β)+loglog(1/α)αε).

Condition on this event occurring. Recall that success of run of Algorithm 2 means that if , then . Meanwhile, if , then we have . Hence, regardless of the value of , there exists a run such that . The algorithm of Lemma 3.5 is now, with probability at least , able to select a hypothesis with as long as

 n≥O(log(1/β)+loglog(1/α)α2+log2(1/α)⋅(log(1/β)+loglog(1/α))αε).

This gives the asserted sample complexity guarantee. ∎

## 4 An Advanced Method for Private Hypothesis Selection

In Section 3, we provided a simple algorithm whose sample complexity grows logarithmically in the size of the hypothesis class. We now demonstate that this dependence can be improved and, indeed, we can handle infinite hypothesis classes given that their VC dimension is finite and that the cover has small doubling dimension.

To obtain this improved dependence on the hypothesis class size, we must make two improvements to the analysis and algorithm. First, rather than applying a union bound over all the pairwise contests to analyse the tournament, we use a uniform convergence bound in terms of the VC dimension of the Scheffé sets. Second, rather than use the exponential mechanism to select a hypothesis, we use a “GAP-MAX” algorithm [BunDRS18]. This takes advantage of the fact that, in many cases, even for infinite hypothesis classes, only a handful of hypotheses will have high scores. The GAP-MAX algorithm need only pay for the hypotheses that are close to optimal. To exploit this, we must move to a relaxation of pure differential privacy which is not subject to strong packing lower bounds (as we describe in Section 5). Specifically, we consider approximate differential privacy, although results with an improved dependence are also possible under various variants of concentrated differential privacy [DworkR16, BunS16, Mironov17, BunDRS18].

###### Theorem 4.1.

Let be a set of probability distributions on . Let be the VC dimension of the set of functions defined by where . There exists a -differentially private algorithm which has following guarantee. Let be a set of private samples drawn indepdently from an unknown probability distribution . Let . Suppose there exists a distribution such that . If , then the algorithm will output a distribution such that with probability at least .

Alternatively, we can demand that the algorithm be -concentrated differentially private if .

Comparing Theorem 4.1 to Theorem 1.1, we see that the first (non-private) term is replaced by the VC dimension and the second (private) term is replaced by . Here is a measure of the “local” size of the hypothesis class ; its definition is similar to that of the doubling dimension of the hypothesis class under total variation distance.

We note that the term could be large, as the privacy failure probability should be cryptographically small. Thus our result includes statements for pure differential privacy (by using the other term in the minimum with ) and also concentrated differential privacy. Note that, since and can be upper-bounded by , this result supercedes the guarantees of Theorem 1.1.

### 4.1 VC Dimension

We begin by reviewing the definition of Vapnik-Chervonenkis (VC) dimension and its properties.

###### Definition 4.2 (VC dimension [VapnikC74]).

Let be a set of functions . The VC dimension of is defined to be the largest such that there exist and such that for all there exists such that .

For our setting, we must extend the definition of VC dimension from function families to hypothesis classes.

###### Definition 4.3 (VC dimension of hypothesis class).

Let be a set of probability distributions on a space . For , define by . Define . We define the VC dimension of to be the VC dimension of .333Here, for simplicity, we assume that each distribution is given by a density function . More generally, we define the VC dimension of to be the smallest such that there exists a function family of VC dimension with the property that, for all we have , where the supremum is over measurable with respect to both and . We ignore this technicality throughout.

The key property of VC dimension is the following uniform convergence bound, which we use in place of a union bound.

###### Theorem 4.4 (Uniform Convergence [Talagrand94]).

Let be a set of functions with VC dimension . Let be a distribution on . Then

 PrD←Pn[supf∈F|f(D)−f(P)|≤α]≥1−β

whenever . Here and .

It is immediate from Definition 4.2 that . Thus Theorem 4.4 subsumes the union bound used in the proof of Theorem 1.1.

The relevant application of uniform convergence for our algorithm is the following lemma (roughly the equivalent of Lemma 3.1), which says that good hypotheses have high scores, and bad hypotheses have low scores.

###### Lemma 4.5.

Let be a collection of probability distributions on with VC dimension .

Let be as in Equation 1, namely

 S(H,D)=infH′∈Hmax{|{x∈D:H(x)>H′(x)}|−n⋅(PrX←H′[H(X)>H′(X)]+3α),n⋅I[dTV(H,H′)≤6α]},

where denotes the indicator function.

Let be a distribution on . Let and . Suppose there exists with . Then, with probability at least over , we have

• and

• for all with .

###### Proof.

For , define by . Note that and is the VC dimension of the function class . By Theorem 4.4, if , then

 PrD←Pn[∀H,H′∈H  ∣∣|{x∈D:H(x)>H′(x)}|−n⋅PrX←P[H(X)>H′(X)]∣∣≤αn]≥1−β.

We condition on this event happening.

In order to prove the first conclusion – namely, – it remains to show that, for all , we have either or

 |{x∈D:H(x)>H′(x)}|−n⋅(PrX←H′[H∗(X)>H′(X)]+3α)>αn.

If , we are done, so assume . By the uniform convergence event we have conditioned on,

 |{x∈D:H(x)>H′(x)}| ≥n⋅(PrX←P[H(X)>H′(X)]−α) ≥n⋅(PrX←H∗[H(X)>H′(X)]−dTV(P,H∗)−α) ≥n⋅(dTV(H∗,H′)+PrX←H′[H(X)>H′(X)]−2α) >n⋅(6α+PrX←H′[H(X)>H′(X)]−2α),

from which the desired conclusion follows.

In order to prove the second conclusion – namely, for all with – it suffices to show that one yields a score of zero for any with . In particular, we show that yields a score of zero for any such . That is, if , then and

 |{x∈D:H(x)>H∗(x)}|−n⋅(PrX←H∗[H(X)>H∗(X)]+3α)≤0.

By the triangle inequality , as required. By the uniform convergence event we have conditioned on,

 |{x∈D:H(x)>H∗(x)}| ≤n⋅(PrX←P[H(X)>H∗(X)]+α) ≤n⋅(PrX←H∗[H(X)>H∗(X)]+dTV(P,H∗)+α) ≤n⋅(PrX←H∗[H(X)>H∗(X)]+2α),

which completes the proof. ∎

### 4.2 GAP-MAX Algorithm

In place of the exponential mechanism for privately selecting a hypothesis we use the following algorithm that works under a “gap” assumption. That is, we assume that there is a gap between the highest score and the -th highest score. Rather than paying in sample complexity for the total number of hypotheses we pay for the number of high-scoring hypotheses .

This algorithm is based on the GAP-MAX algorithm of Bun, Dwork, Rothblum, and Steinke [BunDRS18]. However, we combine their GAP-MAX algorithm with the exponential mechanism to improve the dependence on the parameter .

###### Theorem 4.6.

Let and be arbitrary sets. Let have sensitivity at most 1 in its second argument – that is, for all and all differing in a single example, .

For and , define

 K(D,5α):=∣∣ ∣∣{H∈H:S(H,D)≥supH′∈HS(H′,D)−5αn}∣∣ ∣∣.

Given parameters and , there exists a -differentially private randomized algorithm such that, for all and all ,

 K(D,5α)≤k⟹Pr[S(M(D),D)≥supH′∈HS(H′,D)−αn]≥1−β

provided .

Furthermore, given and , there exists a -concentrated differentially private [BunS16] algorithm such that, for all and all ,

 K(D,5α)≤k⟹Pr[S(M(D),D)≥supH′∈HS(H′,D)−αn]≥1−β

provided