## 1 Introduction

Probably approximately correct learning (or *PAC* learning; Valiant, 1984

) is a classic criterion for supervised learning, which has been the focus of much research in the past three decades. The objective in PAC learning is to produce a classifier that, with probability at least

, has error rate at most . To qualify as a PAC learning algorithm, it must satisfy this guarantee for all possible target concepts in a given family, under all possible data distributions. To achieve this objective, the learning algorithm is supplied with a number of i.i.d. training samples (data points), along with the corresponding correct classifications. One of the central questions in the study of PAC learning is determining the minimum number of training samples necessary and sufficient such that there exists a PAC learning algorithm requiring at most samples (for any given and ). This quantity is known as the*sample complexity*.

Determining the sample complexity of PAC learning is a long-standing open problem. There have been upper and lower bounds established for decades, but they differ by a logarithmic factor. It has been widely believed that this logarithmic factor can be removed for certain well-designed learning algorithms, and attempting to prove this has been the subject of much effort. Simon (2015) has very recently made an enormous leap forward toward resolving this issue. That work proposed an algorithm that classifies points based on a majority vote among classifiers trained on independent data sets. Simon proves that this algorithm achieves a sample complexity that reduces the logarithmic factor in the upper bound down to a very slowly-growing function. However, that work does not quite completely resolve the gap, so that determining the optimal sample complexity remains open.

The present work resolves this problem by completely eliminating the logarithmic factor. The algorithm achieving this new bound is also based on a majority vote of classifiers. However, unlike Simon’s algorithm, here the voting classifiers are trained on data subsets specified by a recursive algorithm, with substantial overlaps among the data subsets the classifiers are trained on.

## 2 Notation

We begin by introducing some basic notation
essential to the discussion.
Fix a nonempty set , called the *instance space*;
we suppose is equipped with a -algebra, defining
the measurable subsets of .
Also denote , called the *label space*.
A *classifier* is any measurable function .
Fix a nonempty set of classifiers, called the *concept space*.
To focus the discussion on nontrivial cases,^{1}^{1}1The
sample complexities for and are already quite well understood
in the literature, the former having sample complexity , and the latter having
sample complexity either or
(depending on whether the two classifiers are exact complements or not).
we suppose ;
other than this, the results in this article will be valid for *any* choice of .

In the learning problem, there is a probability measure over ,
called the *data distribution*, and a sequence of
independent

-distributed random variables, called the

*unlabeled data*; for , also define , and for completeness denote . There is also a special element of , denoted , called the

*target function*. For any sequence in , denote by . For any probability measure over , and any classifier , denote by . A

*learning algorithm*is a map,

^{2}

^{2}2We also admit randomized algorithms , where the “internal randomness” of is assumed to be independent of the data. Formally, there is a random variable independent of such that the value is determined by the input data and the value of . mapping any sequence in (called a

*data set*), of any length , to a classifier (not necessarily in ).

For any , the
*sample complexity of -PAC learning*,
denoted ,
is defined as the smallest for which
there exists a learning algorithm such that,
for every possible data distribution , ,
denoting ,

If no such exists, define .

The sample complexity is our primary object of study in this work.
We require a few additional definitions before proceeding.
Throughout, we use a natural extension of set notation to sequences:
for any finite sequences ,
we denote by
the concatenated sequence .
For any set , we denote by
the subsequence comprised of all for which .
Additionally, we write to indicate s.t. ,
and we write
or
to express that for every .
We also denote (the length of the sequence).
For any and any sequence of points in ,
denote ,
referred to as the set of classifiers *consistent* with .

Following Vapnik and Chervonenkis (1971), we say a sequence of points in
is *shattered* by if ,
such that , : that is, there are
distinct classifications of realized by classifiers in .
The Vapnik-Chervonenkis dimension (or *VC dimension*)
of is then defined as the largest integer for which
there exists a sequence in shattered by ;
if no such largest exists, the VC dimension is said to be infinite.
We denote by the VC dimension of . This quantity is
of fundamental importance in characterizing the sample complexity of
PAC learning. In particular, it is well known that the sample complexity
is finite for any if and only if
(Vapnik, 1982; Blumer, Ehrenfeucht, Haussler, and
Warmuth, 1989; Ehrenfeucht, Haussler, Kearns, and
Valiant, 1989).
For simplicity of notation, for the remainder of this article
we suppose ; furthermore, note that our assumption of
implies .

We adopt a common variation on big-O asymptotic notation, used in much of the
learning theory literature. Specifically, for functions ,
we let denote the assertion that
and such that,
, ,
; however, we also require that
the values in this definition be
*numerical constants*, meaning that they are
*independent of and *.
For instance, this means cannot depend on .
We equivalently write to assert
that . Finally, we write
to assert
that both and hold.
We also sometimes write in an expression,
as a place-holder for some function satisfying :
for instance, the statement expresses that
for which .
Also, for any value , define and similarly .

As is commonly required in the learning theory literature, we adopt the assumption that the events appearing in probability claims below are indeed measurable. For our purposes, this comes into effect only in the application of classic generalization bounds for sample-consistent classifiers (Lemma 4.3 below). See Blumer, Ehrenfeucht, Haussler, and Warmuth (1989) and van der Vaart and Wellner (1996) for discussion of conditions on sufficient for this measurability assumption to hold.

## 3 Background

Our objective in this work is to establish *sharp* sample complexity bounds.
As such, we should first review the known *lower bounds* on .
A basic lower bound of
was established by Blumer, Ehrenfeucht, Haussler, and
Warmuth (1989) for and .
A second lower bound of was supplied by Ehrenfeucht, Haussler, Kearns, and
Valiant (1989),
for and .
Taken together, these results imply that, for any and ,

(1) |

This lower bound is complemented by classic *upper bounds*
on the sample complexity. In particular, Vapnik (1982) and Blumer, Ehrenfeucht, Haussler, and
Warmuth (1989)
established an upper bound of

(2) |

They proved that this sample complexity bound is in fact achieved by any algorithm that returns a classifier ,
also known as a *sample-consistent learning algorithm* (or *empirical risk minimization* algorithm).
A sometimes-better upper bound was established by Haussler, Littlestone, and Warmuth (1994):

(3) |

This bound is achieved by a modified variant of the *one-inclusion graph prediction algorithm*,
a learning algorithm also proposed by Haussler, Littlestone, and Warmuth (1994), which has been conjectured
to achieve the optimal sample complexity (Warmuth, 2004).

In very recent work, Simon (2015) produced a breakthrough insight. Specifically, by analyzing a learning algorithm based on a simple majority vote among classifiers consistent with distinct subsets of the training data, Simon (2015) established that, for any ,

(4) |

where is the -times iterated logarithm: and .
In particular, one natural choice would be ,^{3}^{3}3The function
is the iterated logarithm: the smallest for which . It is an extremely slowly growing function of .
which (one can show) optimizes the asymptotic dependence on in the bound,
yielding

In general, the entire form of the bound (4) is optimized (up to numerical constant factors) by choosing . Note that, with either of these choices of , there is a range of , , and values for which the bound (4) is strictly smaller than both (2) and (3): for instance, for small , it suffices to have while . However, this bound still does not quite match the form of the lower bound (1).

There have also been many special-case analyses, studying restricted types of concept spaces for which the above gaps can be closed (e.g., Auer and Ortner, 2007; Darnstädt, 2015; Hanneke, 2015)

. However, these special conditions do not include many of the most-commonly studied concept spaces, such as linear separators and multilayer neural networks. There have also been a variety of studies that, in addition to restricting to specific concept spaces

, also introduce strong restrictions on the data distribution , and establish an upper bound of the same form as the lower bound (1) under these restrictions (e.g., Long, 2003; Giné and Koltchinskii, 2006; Bshouty, Li, and Long, 2009; Hanneke, 2009, 2015; Balcan and Long, 2013). However, there are many interesting classes and distributions for which these results do not imply any improvements over (2). Thus, in the present literature, there persists a gap between the lower bound (1) and the minimum of all of the known upper bounds (2), (3), and (4) applicable to the*general*case of an arbitrary concept space of a given VC dimension (under arbitrary data distributions).

In the present work, we establish a new upper bound for a novel learning algorithm,
which holds for *any* concept space , and which improves over all of the above general upper bounds
in its joint dependence on , , and . In particular, it is *optimal*, in the sense that it
matches the lower bound (1) up to numerical constant factors. This work thus
solves a long-standing open problem, by determining the precise form of the optimal sample complexity,
up to numerical constant factors.

## 4 Main Result

This section presents the main contributions of this work: a novel learning algorithm, and a proof that it achieves the optimal sample complexity.

### 4.1 Sketch of the Approach

The general approach used here builds on an argument of Simon (2015),
which itself has roots in the analysis of sample-consistent learning
algorithms by Hanneke (2009, Section 2.9.1). The essential idea
from Simon (2015) is that,
if we have two classifiers, and ,
the latter of which is an element of consistent with an i.i.d.
data set *independent* from ,
then we can analyze the probability that they *both* make a mistake
on a random point by bounding the error rate of
under the distribution , and bounding the error rate of
under the *conditional* distribution given
that makes a mistake. In particular, it will either
be the case that itself has small error rate, or else
(if has error rate larger than our desired bound)
with high probability, the number of points in
contained in the error region of will be at least some number ;
in the latter case, we can bound the conditional
error rate of in terms of the number of such points via a classic generalization
bound for sample-consistent classifiers (Lemma 4.3 below).
Multiplying this bound on the conditional error rate of
by the error rate of results in a bound on the probability
they both make a mistake. More specifically, this argument yields a bound of the following form:
for an appropriate numerical constant ,
with probability at least , ,

The original analysis of Simon (2015) applied this reasoning repeatedly, in an inductive argument, thereby bounding the probability that classifiers, each consistent with one of independent training sets, all make a mistake on a random point. He then reasoned that the error rate of the majority vote of such classifiers can be bounded by the sum of these bounds for all subsets of of these classifiers, since the majority vote classifier agrees with at least of the constituent classifiers.

In the present work, we also consider a simple majority vote of
a number of classifiers, but we alter the way the data is split up,
allowing significant overlaps among the subsamples. In particular,
each classifier is trained on considerably more data this way.
We construct these subsamples recursively, motivated by an inductive analysis of the sample complexity.
At each stage, we have a working set of i.i.d. data points, and another sequence of data points,
referred to as the *partially-constructed subsample*.
As a terminal case, if is smaller than a certain cutoff size, we generate a subsample ,
on which we will train a classifier .
Otherwise (for the nonterminal case), we use (roughly) a constant fraction of the points in to form a subsequence ,
and make three recursive calls to the algorithm, using as the working set in each call.
By an inductive hypothesis, for each of these three recursive calls, with probability ,
the majority vote of the classifiers trained on subsamples generated by that call
has error rate at most
, for an appropriate numerical constant .
These three majority vote classifiers, denoted , , , will each play the role of
in the argument above.

With the remaining constant fraction of data points in (i.e., those not used to form ),
we divide them into three independent subsequences , , .
Then for each of the three recursive calls, we provide as its partially-constructed
subsample (i.e., the “” argument) a sequence with ;
specifically, for the recursive call (), we take .
Since the argument is retained within the partially-constructed subsample passed to each recursive call,
a simple inductive argument reveals that, for each , ,
all of the classifiers trained on subsamples generated in the recursive
call are contained in . Furthermore, since is not included in the argument to the
recursive call, and are independent. Thus, by the argument discussed above, applied with
and , we have that with probability at least ,
for any trained on a subsample generated in recursive calls ,
the probability that *both* *and* make a mistake on a random point is at most
.
Composing this with the aforementioned inductive hypothesis, recalling that and ,
and simplifying by a bit of calculus,
this is at most
, for an appropriate numerical constant .
By choosing appropriately, the union bound implies that, with probability at least ,
this holds for all choices of . Furthermore, by choosing sufficiently large, this bound is at most
.

To complete the inductive argument, we then note that on any point , the majority vote of all
of the classifiers (from all three recursive calls) must agree with at least one of the three classifiers
, and must agree with at least of the classifiers trained on subsamples
generated in recursive calls .
Therefore, on any point for which
the majority vote makes a mistake,
with probability at least , a uniform random choice
of , and of from recursive calls ,
results in and that both make a mistake on .
Applying this fact to a *random* point (and invoking Fubini’s theorem),
this implies that the error rate of the majority
vote is at most times the average (over choices of and )
of the probabilities that and both make a mistake on .
Combined with the above bound, this is at most .
The formal details are provided below.

### 4.2 Formal Details

For any , and any with ,
let denote an arbitrary classifier in , entirely determined by :
that is, is a fixed sample-consistent learning algorithm (i.e., empirical risk minimizer).
For any and sequence of data sets ,
denote .
Also, for any values ,
define the majority function: .
We also overload this notation, defining the *majority classifier* ,
for any classifiers .

Now consider the following recursive algorithm, which takes as input two
finite data sets, and , satisfying , and returns a *finite sequence of data sets*
(referred to as *subsamples* of ).
The classifier used to achieve the new sample complexity bound below is simply the majority vote of the classifiers
obtained by applying to these subsamples.

In particular, a sample complexity of the form expressed on the right hand side is achieved by the algorithm that returns the classifier , given any data set .

Combined with (1), this immediately implies the following corollary.

The algorithm is expressed above as a recursive method for constructing a sequence of subsamples,
as this form is most suitable for the arguments in the proof below. However,
it should be noted that one can equivalently describe these constructed subsamples *directly*,
as the selection of which data points should be included in which subsamples
can be expressed as a simple function of the indices. To illustrate this, consider
the simplest case in which with
for some : that is, is a power of .
In this case, let denote the sequence of labeled data sets
returned by , and note that since each recursive call reduces
by a factor of while making recursive calls, we have .
First, note that is contained in *every* subsample .
For the rest, consider any and ,
and let us express in its base- representation as , where each ,
and express in its base- representation as , where each .
Then it holds that if and only if the largest with
satisfies .
This kind of direct description of the subsamples is also possible when is not a power of ,
though a bit more complicated to express.

### 4.3 Proof of Theorem 4.2

The following classic result will be needed in the proof.
A bound of this type is implied by a theorem of Vapnik (1982);
the version stated here features slightly smaller constant factors,
obtained by
Blumer, Ehrenfeucht, Haussler, and
Warmuth (1989).^{4}^{4}4Specifically, it follows by combining
their Theorem A2.1 and Proposition A2.1, setting the resulting expression
equal to and solving for .

For any , , , and any probability measure over , letting be independent -distributed random variables, with probability at least , every satisfies

We are now ready for the proof of Theorem 4.2.

Proof of Theorem 4.2 Fix any and probability measure over , and for brevity, denote , for each . Also, for any classifier , define .

We begin by noting that, for any finite sequences and of points in , a straightforward inductive argument reveals that all of the subsamples in the sequence returned by satisfy (since no additional data points are ever introduced in any step). Thus, if and , then , so that . In particular, this means that, in this case, each of these subsamples is a valid input to , and thus is a well-defined sequence of classifiers. Furthermore, since the recursive calls all have as a subsequence of their second arguments, and the terminal case (i.e., Step 1) includes this second argument in the constructed subsample, another straightforward inductive argument implies that every subsample returned by satisfies . Thus, in the case that and , by definition of , we also have that every classifier in the sequence satisfies .

Fix a numerical constant . We will prove by induction that, for any , for every , and every finite sequence of points in with , with probability at least , the classifier satisfies

(5) |

First, as a base case, consider any with . In this case, fix any and any sequence with . Also note that . Thus, as discussed above, is a well-defined classifier. We then trivially have

so that (5) holds.

Now take as an inductive hypothesis that, for some with , for every with , we have that for every and every finite sequence in with , with probability at least , (5) is satisfied. To complete the inductive proof, we aim to establish that this remains the case with as well. Fix any and any finite sequence of points in with . Note that , so that (since ) we have , and hence the execution of returns in Step 3 (not Step 1). Let be as in the definition of , with . Also denote , , , and for each , denote , corresponding to the majority votes of classifiers trained on the subsamples from each of the three recursive calls in the algorithm.

Note that . Furthermore, since , we have . Also note that for each , which, together with the fact that , implies for each . Thus, since as well, for each , is a well-defined sequence of classifiers (as discussed above), so that is also well-defined. In particular, note that . Therefore, by the inductive hypothesis (applied under the conditional distribution given , , , which are independent of

), combined with the law of total probability, for each

, there is an event of probability at least , on which(6) |

Next, fix any , and denote by , where : that is, is the subsequence of elements in for which . Note that, since and are independent, are conditionally independent given and , each with conditional distribution (if ). Thus, applying Lemma 4.3 under the conditional distribution given and , combined with the law of total probability, we have that on an event of probability at least , if , then every satisfies

Furthermore, as discussed above, each and have , and , so that . It follows that every has . Thus, on the event , if , ,

(7) |

Additionally, since and are independent, by a Chernoff bound (applied under the conditional distribution given ) and the law of total probability, there is an event of probability at least , on which, if , then

In particular, on , if , then the above inequality implies .

Combining this with (6) and (7), and noting that and is nonincreasing on (for any fixed ), we have that on , if , then every has

(8) |

where this last inequality is due to Lemma A in Appendix A. Since , we have . Plugging this relaxation into the above bound, combined with numerical calculation of the logarithmic factor (with as defined above), we find that the expression in (8) is less than

Additionally, if , then monotonicity of measures implies

again using the above lower bound on for this last inequality. Thus, regardless of the value of , on the event , we have ,

Now denote , again with . By definition of , for any , at least of the classifiers in the sequence have

Comments

There are no comments yet.