1 Introduction
Hypothesis selection is a fundamental task in statistics, where a learner is getting a sample access to an unknown distribution on some, possibly infinite, domain , and wishes to output a distribution that is “close” to
. The problem was studied extensively over the last century and found many applications, most notably, in machine learning.
In this paper we study the hypothesis selection problem in the agnostic setting, where we assume a fixed finite^{1}^{1}1See discussion of the infinite case at the end of this section. class of reference distributions which is known to the learner, and which may or may not contain ^{2}^{2}2 The setting where is assumed to be in is called the realizable setting.. The goal of the learner is to output a distribution that is at least as close to as any of the distributions in in total variation distance (denoted here ).
The statistical performance of a learner is measured using two parameters, denoted and , where is the approximation factor of the algorithm and is its sample complexity. Specifically, we say that a class of distributions is learnable with sample complexity if there is a (possibly randomized) learner such that for every and every target distribution , upon receiving random samples from , the learner outputs a distribution satisfying
with probability at least
. For the discussion below, we think of as a small constant.How good can a learner be?
Apriori, it is not even clear that every class is learnable with finite sample complexity. Consider the following natural algorithm for hypothesis selection: estimate for every and output the that minimizes this quantity. While this algorithm clearly works (and even achieves an approximation factor of ), estimating for any requires samples from (see, e.g., [JHW18]). Thus, if the domain is infinite (say ), the sample complexity of this algorithm is not even finite. However, perhaps surprisingly, despite the impossibility of estimating the distance of from even one of the distributions , one can still find an approximate minimizer of the distances (even when is infinite!).
What are the smallest and for which any given class of distributions of size is learnable with sample complexity ? A seminal work by Yatracos [Yat85] (also see [DL96, DL97, DL01]) shows that any reference class of size is learnable with sample complexity . For the case of , Mahalanabis and Stefankovic [MS08] improve the approximation factor, constructing a learner. This was extended by the recent work of Bousquet, Kane, and Moran [BKM19] to give a approximation for any finite , using a very different scheme. A matching lower bound of on the approximation factor follows from the work of [CDSS14].
Although the work of [BKM19] obtains the optimal approximation factor for the agnostic hypothesis selection problem, the sample complexity of their scheme is , which is exponential in the sample complexity of Yatracos’s algorithm^{3}^{3}3We note that [BKM19] also provide sample complexity bounds, which can be better than their general bound for finite domains .. Deriving optimal learners with efficient sample complexity is left as the main open problem in their work. In this paper, we give a novel learner with (near) optimal sample complexity, getting the best of both worlds.
Density Estimation.
Hypothesis selection, and, in particular, Yatracos’s algorithm, found applications beyond learning finite classes. Specifically, it is used as a basic subroutine in density estimation tasks where the goal is to learn an infinite class of distributions, in the realizable or agnostic setting^{4}^{4}4In fact, learning infinite classes was a part of Yatracos’s original motivation.. A popular method, where the reference class may be infinite, is the cover method (a.k.a. the skeleton method). In this method, one “covers” the class by a finite cover; that is, a subclass of distributions such that for every there exists with . Often times it is the case that even if is infinite, a finite net exists, and Yatracos’s agnostic learning algorithm can be applied on (see [DL01, Dia16] and references within for many such examples).
While the minimal possible size of such a cover is often exponential in the natural parameters of the class ^{5}^{5}5One easy example of an exponential cover is when is the set of all convex combinations of fixed distributions , i.e., . The set is a cover of of exponential size (in ). Subexponential covers are not possible in this case. See Chapter in [DL01] for this example, and the rest of Chapter for more such examples., because Yatracos’s algorithm has polylogarithmic sample complexity, the obtained density estimation algorithm has a polynomial sample complexity. Since many density estimation results follow the cover method, or other related methods^{6}^{6}6Another such method is the recent sample compression method by [ABDH20], used to obtain improved density algorithms for the mixtures of Gaussians problem. that use Yatracos’s algorithm as a black box, our algorithm can imply an improvement for all of these results. (We mention a couple of such examples below, in Section 1.4).
We note that in the realizable setting for density estimation, where the distribution we wish to learn is in the infinite class of distributions we are considering (that is, ), one can typically get a better approximation factor by taking a finer cover (smaller ). By taking an cover of , the above method results in a distribution with . However, in the agnostic setting, even if we take a very small , the resulting may not be small as it is dominated by . By using the result of this paper in lieu of Yatracos’s learning algorithm, this distance can be made .
1.1 Our Results
We design a learner for the agnostic hypothesis selection problem with sample complexity whose dependence on both and is (near) optimal.
Theorem 1.
Let be a finite class of distributions and let . Then, is learnable with sample complexity^{7}^{7}7We use the standard notation that if there exists such that . . In particular, for constant ,
Our learner in Theorem 1 is deterministic, and, as in the case for [BKM19], it only makes statistical queries. That is, our learner can be implemented in the restricted model where instead of getting random samples from , the learner has access to an oracle that on a query answers by a value in (or, equivalently, on a query , where is a set, answers by ). Furthermore, our algorithm consists of only such rounds of queries, whereas the algorithm [BKM19] consists of such rounds.
2 Proof Overview
In this section we overview the proofs and highlight some of the more technical arguments. We defer the full proof to the Appendix.
Let be a (known) finite reference class of distributions and let denote the target distribution to which we have sample access. Denote . Our goal is to use as few samples as possible from in order to find such that .
2.1 A Geometric Approach to Hypothesis Selection
Our starting point is the approximation algorithm of [BKM19]. In this subsection we describe our interpretation of their technique (some of the claims we make here are implicit in their paper).
The basic observation of [BKM19] is that it suffices to find a distribution which is (almost) at least as close to each of the ’s as ,
(1) 
Finding such a suffices, as by the triangle inequality, for every , and, in particular, for .
This suggests the following definitions: for a distribution , let denote the vector of all distances ; a vector is feasible if for some distribution (when we write for we mean ). With this notation, our goal is to find such that

, where is the allone vector, and

is feasible.
Once such a vector is obtained, one can find a distribution satisfying , and consequently a approximation for the target distribution .
Let denote the set of all feasible vectors and note that it is convex and upwardclosed. The approach of [BKM19] for finding a desired proceeds in rounds, where in round we find a vector that is closer to the feasible set, while maintaining the invariant that :

Let be the allzero vector. Note that , so satisfies the above Item (i), but not Item (ii) (except in trivial cases).

For

If is feasible (that is, if , where denotes distance), then output a such that ().

Else, use samples from to derive such that , and is “closer” (in some measure, see below) to .

Selecting the new point .
The crux of this approach is the update step in which is computed given . Since , there exists a such that and (for instance, since there exists a coordinate such that , where is the unit vector). [BKM19] show how to find such a with few queries (discussed next), and they use this as their next point. However, since , their strategy may require rounds.
2.1.1 Implementing the Strategy
Violated tests.
We next explain how [BKM19] find the coordinate of that they wish to update. To this end, observe that whenever is not feasible there is a hyperplane separating the point from the set of feasible vectors, witnessing the fact that . We call a normal to such a hyperplane a “violated test” (here denotes the simplex of all probability vectors in ). For and , we denote the set of all violated tests witnessing the fact that is not feasible by
From a test to an updated point .
We next informally state a central lemma proved by [BKM19], showing how to convert any violated test to a new point (for a precise statement, see Lemma 12 in [BKM19] or Lemma 7 in this paper).
Lemma 2.
Using statistical queries (queries of the form for some set ), any can be converted to a point satisfying:

.

passes the test induced by : . This also implies that (as implies and implies ).
Proving the lemma.
While the proof of Lemma 2 is pretty short, it is tricky. For completeness, we will next give some intuition for it by showing how to construct for a specific (easy to handle) .
Assume that is not feasible and that . Denote . (Observe that this is the socalled Yatracos set which is used in Yatracos’s approximation algorithm and satisfies ). Use samples from to get an estimate of up to an additive term. Set for and for . Obtain from by setting .
Query/sample complexity.
For a general , the proof of the lemma is more involved and crucially relays on the Minmax theorem. The point is computed as , where for every , is of the form , for some set and where is an approximation of to within an additive error of for some constant .
Computing requires statistical queries (the values of for all ’s), where each needs to be approximated to within an additive error of . While approximating each query separately requires samples, by a standard combination of Chernoff and union bound, all queries can be approximated using samples.
2.2 The CuttingWithMargin Game: A Dual Perspective
Recall that we wish to find a rule for updating to a satisfying that will allow us to reach a feasible point after the minimum number of steps. We wish to define a measure of progress to help us choose our next . As discussed above, [BKM19] use the norm as their measure of progress, but this results in a slow convergence to a feasible point.
To find a better progress measure, we revisit Lemma 2, specifically Item 2 that shows that by updating using the test , it is not only that , but also . We interpret this as implying that the set of violated tests can shrink substantially between rounds. This suggests a new approach: instead of measuring progress by comparing the locations of and , we can take a “dual” view and compare the sizes of the sets and of violated tests that we still need to rule out (recall that if this set is empty, we have found a feasible point). We note that this “dual” view is lossy (and is not a dual in the standard sense) as the mapping may not be onetoone.
The cuttingwithmargin game.
Consider a sequence in which the point was produced from by selecting some and applying Lemma 2, and where is feasible. Denote . It can be shown that is convex for every , and that ( as is feasible). Furthermore, we are able to prove that is disjoint from an ball of radius around (see Lemma 9). Intuitively, this is because (Lemma 2, Item 2) implies that the generated not only passes the test induced by , but also passes all “similar” tests.
The above discussion gives rise to the cuttingwithmargin game discussed in the introduction (see Section 1.2.1). Recall that this is a game between a player and an adversary, and it is played over a convex body known to both the player and the adversary. Let ; in every round of the game, the player selects a point and the adversary picks to be any convex set which is disjoint from the ball of radius around . The game ends when the set is empty. See illustration in Figure 1. Of course, the task is now to find a strategy that solves this game with minimum number of rounds. Note that, in the language of this game, the strategy of [BKM19] selects an arbitrary in round . We will next show a strategy for selecting that will allow for a faster convergence.
2.3 Warmup: Sample Complexity
So far, we reduced the hypothesis selection problem to solving the cuttingwithmargin game. We next outline a solution for the cuttingwithmargin game in rounds. Since the implementation of each round requires samples (see Section 2.1.1), this implies an algorithm for hypothesis selection with sample complexity.
First observe that an equivalent way of presenting the cuttingwithmargin game lets the adversary pick in each round a halfspace which is disjoint from the ball of radius around , and the game continues with . This presentation is reminiscent of Grunbaum’s inequality [Grü60], which guarantees that if the player picks the centroid (which is a standard way of defining the “center” of a body) of then , where is the standard (Lebesgue) volume. While the centroid is an intuitive choice for our player, a counter strategy by the adversary will pick bodies that have small volumes but large diameters. Indeed, note that as long as the diameter of the body is greater than , the adversary can force at least one additional round. This shows that the volume is too crude of a measure for our game. Ideally, we would have wanted to use a different “centroid” that satisfies an analogous property with respect to the diameter (say, ). Unfortunately, no such object exists.
The approach we take for designing our player stems from the observation that if the player could always pick a point
that is close to the uniform distribution
, then the game would have been solved in a few rounds. It is the easiest to see why when using the “primal” point of view from Section 2.1: indeed, assume is separated from by a hyperplane perpendicular to . Then, since lies on the other side of that hyperplane, it follows that . So, when updating from to , the norm increases by at least (recall from Section 2.1 that in the [BKM19] strategy the norm increases by only in each round). Thus, since in the norm is bounded by , the total number of such steps is at most . Of course, this strategy is impossible, as if then a ball of radius is disjoint from , for all .Entropy as a progress measure.
Inspired by the above intuition, our approach will be to set to be as “close” to as possible. Indeed, we select that maximizes the entropy function (here we view the point as a distribution). This corresponds to measuring the distance from the uniform distribution using divergence. The reason that the entropy function gives an efficient solution for our game boils down to that it is (i) strongly convex w.r.t (as is evident by Pinsker’s Inequality), (ii) bounded by over the simplex. Roughly speaking, strong convexity means that in every step the entropy drops by . This, combined with the fact that the entropy is bounded by , implies our solution for the cuttingwithmargin game^{14}^{14}14Given that, it is natural to look for a strongly convex function over the simplex that is bounded by . However, no such function exists..
As discussed in the introduction, entropy and divergence based strategies are often used in the context of optimization and regret minimization, basically for similar reasons (convexity and boundedness). However, our game is not defined by a cost function measuring the cost of each round separately, but rather, our “cost function” is the length of the game.
2.4 NearOptimal Sample Complexity
In Section 2.3, we gave a hypothesis selection algorithm with samples, by solving the dual game. While this algorithm uses exponentially less samples than the one by [BKM19], it still suboptimal. We next show how to obtain an algorithm with a nearoptimal sample complexity of , by first improving the dependence on to (less involved), and then improving the dependence on to (one of the main technical contributions of this paper). Since the sample complexity of our resulting algorithm (almost) matches Yatracos’s, it can replace Yatracos’s algorithm in density estimation algorithms to obtain a better approximation factor, while keeping the same low sample complexity.
2.4.1 Optimal Dependence on
We revisit the basic observation from Section 2.1 that finding a distribution satisfying suffices in order to get a approximation for hypothesis selection (see Equation 1). We observe that it also suffices to find that only satisfies (recall that minimizes ) for exactly the same reason: . Thus, it suffices for our algorithm to maintain the invariant , instead of . This suggests that we can relax Item 1 in Lemma 2 and only require (in addition to ).
Due to the above, had we known , we would only shoot for a good approximation (to within ) of , which means that Lemma 2 can use only samples (to get a good approximation of ). But, we don’t know the identity of . The crucial observation here is that this does not matter. We can use the same samples to evaluate each of the statistical queries corresponding to each of the coordinates of . Of course, since we are using too few samples, some of these coordinates will not be well approximated. However, it is likely that each one by itself will, and, in particular, this will be the case for . In other words, since we only care about , we no longer have to pay for a costly union bound over all coordinates. (We also show that Item 2 in Lemma 2 still holds under this approximation using an averaging argument).
2.4.2 Optimal Dependence on
Recall that in each step of the cuttingwithmargin game, the player picks a point , and the adversary sets by cutting away an ball of radius around . The algorithm we have so far uses samples from : every round uses samples and drops by (recall that, to begin with, the entropy is at most and we want it to drop to ).
To reduce the sample complexity, we move away from this “static” type of algorithms and design a “dynamic” algorithm whose number of samples per round may vary (but, will never exceed ). The important property of the new algorithm is that if the algorithm samples more points from , then the adversary cuts away a larger ball around . Specifically, if points are sampled then the radius of the removed ball is , and if points are samples then the radius removed ball will be . We will show that this coupling of the number of samples used in a step with the amount of progress made in that step (instead of using the maximum number of samples in every step and expecting the minimum progress) enables a winwin analysis which implies the desired saving in the sample complexity.
Bounding the radius of the removed ball.
To explain how this idea is implemented, we need to dive into the details of the algorithm. Recall that the algorithm aims to find a point such that , and for which . Assume that the current point satisfies (which means ) and that we aim at reducing the distance to, say, . That is, we want to get to a point such that , or, equivalently, . Recall from Section 2.1 that towards this, we pick a violated test which, by applying Lemma 2, yields the new point . Of course, the lemma uses samples from to compute this . As we soon see, in some cases it will be worthwhile for our algorithm to only compute a crude approximation of this using fewer samples. Part of the difficulty is to decide on the quality of this approximation without knowing .
Nevertheless, imagine for a moment that the algorithm does know this
and uses it as its next point. How much “progress” does this imply in the cuttingwithmargin game? That is, how much smaller is compared to ? Denote . We next show that is disjoint from an ball of radius(3) 
around (we wish for to be as large as possible). Intuitively, if is small, it means that we have made progress in many coordinates (though the progress in each might be relatively small). Since we are getting close to in many directions, this should imply that passes many of the tests that were violated by , and thus that is much smaller.
More formally, let , Equation 3 follows from:
Here, the first inequality is due Hölder’s Inequality. The second inequality is because (due to Lemma 2, Item 2) and because (since it holds that , while since it holds that ).
Our “winwin” strategy.
The take home message from the above discussion is that:
If is small then is small.
We next show that this relation leads us to a “winwin” situation: if is large, it suffices to only crudely approximate , and we save on samples. However, if is small, is small and we made a lot of progress towards ruling out all violated tests.
To see the relation between and the number of samples required to approximate , first assume that is uniform over a set of coordinates of size (i.e., for every , either or ). Now, if is small than all nonzeros coordinates of are large, and thus can be reasonably approximated with few samples. (In fact, the number of samples scales with ).
Slicing.
Of course, may not be uniform on a set. To deal with such ’s, we partition to many “slices” such that each is almost uniform over a set (specifically, for , each of the coordinates of is either or in ). We then try to identify a slice with a significant contribution to (recall that due to Lemma 2, Item 2). However, since is not known to the algorithm, we use samples to learn it “slicebyslice”, starting by approximating , the slice containing the largest values and requiring the least number of samples to estimate, and continuing to the slices that require more samples, until reaching a “good” slice. We mention that this slicesearching process is equivalent to playing the dual game with different values.
3 Preliminaries
3.1 Notation
Let . For , we write if . We use to denote the standard inner product of and .
For , we denote by the norm. For and , let denote a ball of radius with respect to that is centered at ,
Let denote the simplex of probability vectors in ,
3.2 Definition of the Hypothesis Selection Problem
Let be a domain and let
denote the set of all probability distributions over
. We assume that either (i) is finite in which case is identified with the set of dimensional probability vectors, or (ii) in which case is the set of Borel probability measures.Let be a set of distributions. We focus on the case where is finite and denote its size by . Let , we say that is learnable with sample complexity if there is a (possibly randomized) algorithm such that for every and every target distribution , if receives as input at least independent samples from then it outputs a distribution such that
with probability at least , where and is the total variation distance. We say that is properly learnable if it is learnable by a proper algorithm; namely an algorithm that always outputs .
Distances vectors and sets.
Let , and let be a distribution. The distance vector of relative to the ’s is the vector .
Following [BKM19], our algorithm is based on the next claim which shows that in order to find such that it suffices to find such that .
Lemma 3.
Let such that . Then .
Proof.
Follows directly by the triangle inequality; indeed, let be a minimizer of in . Then, . ∎
Next, we explore which are of the form for some . For this we make the following definition. A vector is called a distance dominating vector if for some distribution . Define to be the set of all dominating distance vectors.
Claim 4.
is convex and upwardclosed^{15}^{15}15Recall that upwardsclosed means that whenever and then also ..
Proof.
That is upwardclosed is trivial. Convexity follows since is convex in both of its arguments. ∎
3.3 Pythagorian Theorem for
We will use the following Pythagorian theorem for the divergence, the version here is taken from [PW15].
Lemma 5.
Let be a set, let be a convex set of distributions, and let be a distribution. Let . Then, for all it holds that
Proof.
If , then we are done. So, we can assume , which also implies that . For , form the convex combination . Since is the minimizer of , then
∎
If we view the picture above in the Euclidean setting, the “triangle” formed by , and (for in a convex set, is outside the set) is always obtuse, and is a right triangle only when the convex set has a “flat face”. In this sense, the divergence is similar to the squared Euclidean distance, and the above theorem is sometimes called the Pythagorean theorem.
An assumption.
Our analysis uses the Minimax Theorem for zerosum games [vN28] for the same purpose that it was used in [BKM19]. Therefore, we will assume a setting (i.e., the domain and the class of distributions ) in which this theorem is valid. Alternatively, one could state explicit assumptions such as finiteness of or forms of compactness under which it is known that the Minimax Theorem holds. However, we believe that the presentation benefits from avoiding such explicit technical assumptions and simply assuming the Minimax Theorem as an “axiom” in the discussed setting.
4 A Geometric Game from Hypothesis Selection
We next describe a geometric game, called the ()primal game. This game is between a player and an adversary, where is a given upwardsclosed and nonempty convex body, and is a margin parameter. Both and are known to both the player and the adversary. The game proceeds in rounds roughly as follows: the player starts at position and its goal is to get sufficiently close to as fast as possible. Let denote the position of the player in round ; if then the player wins the game. Else, the player picks a tangent hyperplane to which separates from (such a hyperplane must exist since ), announces it to the adversary, and the adversary picks the player’s next position to be any point such that and is close to the tangent hyperplane chosen by the player. The ()primal game is formally described in Fig. 2. It uses the following notation:
In words, is the set of normals to hyperplanes separating from . Note that the assumption does not lose generality, because is upwardsclosed and therefore for any , , any hyperplane separating and has a normal of this form. (See Claim 5 in [BKM19] for a proof of this fact.) Thus, by the hyperplane separation theorem, if and only if . Also observe that since is a convex, the set is convex for every .
The Primal Game
Let be a nonempty convex set which is upward closed.

Set and .

While (equivalently )

The player picks a normal to a hyperplane tangent to which separates from , and announces it to the adversary.

The adversary replies with a point whose every coordinate is at least as great as that of and is close to the hyperplane tangent to whose normal is , i.e.,
(4) 
Set .

Winning Strategies.
Let be a strategy^{16}^{16}16That is, in every round , the strategy provides a rule for picking