1 Learning procedures and unrestricted procedures
In the standard setup in statistical learning theory, one is given a class of functions
defined on a probability space. The goal is to identify, or at least mimic, a function in
that is as close as possible to the unknown target random variablein some appropriate sense. If is distributed according to then an obvious candidate for being considered “as close as possible to in ” is the function
it minimizes the average cost (relative to the squared loss) one has to pay for predicting instead of . In a more geometric language, minimizes the distance between and the class , and in what follows we implicitly assume that such a minimizer exists.
What makes the learner’s task a potentially difficult one is the limited information at his disposal: instead of knowing the distribution and the target random variable (which would make identifying a problem in approximation theory), both and are not known. Rather, the learner is given an independent sample , with each pair
distributed according to the joint distribution. Using the sample, the learner selects some function in , hoping that it is almost as good a prediction of as is. The selection is made via a learning procedure, which is a mapping .
If is the selection made by the procedure given the data, its excess risk is the conditional expectation
and the procedure’s success is measured through properties of . Since the information the learner has is limited, it is unlikely that can always be a good guess, and therefore ’s performance is measured using a probabilistic yardstick: the sample complexity; that is, for a given accuracy and a confidence parameter , the number of independent pairs that are needed to ensure that
The key question in learning theory is to identify a procedure that performs with the optimal sample complexity (an elusive term that will be clarified in what follows) for each learning problem. It stands to reason that the optimal sample complexity should depend on the right notion of statistical complexity of the class ; on some (minimal) global information on the target and underlying distribution ; and the required accuracy and confidence levels.
To put our results in context, let us begin by describing what we mean by optimal sample complexity and optimal procedure. These are minor modifications of notions introduced in , in which an optimal proper learning procedure was identified for (almost) any problem involving a convex class .
Before we dive into more technical details, let us fix some notation.
Throughout we denote absolute constants by or . Their values may change from line to line. or are constants that depend only on the parameter . means that there is an absolute constant such that and means that the constant depends only on the parameter . We write if and , while means that the equivalence constants depend only on .
All the functions we consider are square integrable on an appropriate probability space, though frequently we do not specify the space or the measure as those will be clear from the context. Thus, and , and we adopt a similar notation for other spaces.
For the sake of simplicity, we denote each learning problem, consisting of the given class of functions , an unknown underlying distribution , and an unknown target , by the triplet . It should be stressed that when we write “given the triplet ”, it does not mean that the learner has any additional information on or on . Still, this notation helps one to keep track of the fact that the sample complexity may change not only with but also with and .
We denote generic triplets by and . For a triplet we set and . The class consists of all the functions for and ; also, for , set .
1.1 Notions of optimality
The notion optimality we use is based on a list of ‘obstructions’. These obstructions, are, in some sense, trivial, and overcoming each one of them is something one would expect of any reasonable procedure—certainly from a procedure that deserves to be called optimal. On the other hand, overcoming each obstruction comes at a price: as we explain in what follows, a certain geometric obstruction forces one to consider procedures that need not be proper; and overcoming some trivial statistical obstructions requires a minimal number of sample points.
Let us describe the ‘trivial’ obstructions one may encounter and minimal price one has to pay to overcome each one.
A geometric obstruction
In the standard (proper) learning model the procedure is only allowed to take values in the given class . At a first glance this restriction seems to be completely reasonable; after all, the learner’s goal is to find a function that mimics the behaviour of the best function in , and there is no apparent reason to look for such a function outside . However, a more careful consideration shows that this restriction comes at a high cost:
Let and fix an integer . Set to be a ‘noisy’ -perturbation of the midpoint , that is slightly closer to than to . Then, given samples , any proper procedure will necessarily make the wrong choice with probability ; that is, with probability at least , and on that event, the excess risk is .
In other words, by considering such targets, and given accuracy , the sample complexity of any learning procedure taking values in cannot be better than even if one is interested only in constant confidence.
Example 1.1 serves as a strong indication of a general phenomenon: there are seemingly simple problems, including ones involving classes with a finite number of functions (in this example, only two functions…), in which the sample complexity is significantly higher than what would be expected given the class’ size. The reason for such slow rates is that the ‘location’ of the target relative to the class is not ‘favourable’111It is well understood that the target is in a favourable location when the set of functions in that ‘almost minimize’ the risk functional consists only of perturbations of the unique minimizer , see  for more details.. In contrast, if happens to be convex then any target is in a favourable location and there is no geometric obstruction that forces slow rates; the same also holds for a general class in the case of independent additive noise—when for some and that is mean-zero and independent of .
Here we are interested in general triplets , making it is impossible to guarantee that the unknown target is in a favourable location relative to . Therefore, to have any hope of addressing this obstruction, one must remove the restriction that the procedure is proper; instead we consider unrestricted procedures, that is, procedures that are allowed to take values outside .
A natural way of finding generic statistical obstructions is identifying the reasons why a statistical procedure may make mistakes. Roughly put, there are two sources of error :
Intrinsic errors: When is ‘rich’ close to the true minimizer , it is difficult to ‘separate’ class members with the limited data the learner has. In noise-free (realizable) problems this corresponds to having a large version space—the (random) subset of , consisting of all the functions that agree with target on the given sample.
External errors: When the ‘noise level’ increases, that is, when is relatively far form , interactions between and class members can cause distortions. These interactions make functions that are close to indistinguishable and causes the procedure to make mistakes in the choices it makes.
Obviously, describing the effect each one of these sources of error have on the sample complexity is of the utmost importance. The “statistical obstructions” we refer to are defined for any class and underlying distribution , and are the result of the intrinsic and external factors in two specific collections of learning problems involving and (keeping in mind that the learner does not know ). The targets one considers are either:
Realizable targets; that is, targets of the form where ; or
Additive, independent gaussian noise, that is, targets of the form , where and is a centred gaussian random variable, independent of
and with variance.
The idea is that an optimal statistical procedure must be able to address such simple problems, making them our choice of ‘trivial’ statistical obstructions. And, the sample complexity needed to overcome the intrinsic and external errors for targets as in or is a rather minimal ‘price’ one should be willing to pay when trying to address general prediction problems.
The first ‘trivial’ statistical obstruction we consider has to do with realizable problems. Since the learner has no information on the underlying distribution , there is no way of excluding the possibility that there are that are far from each other, and yet agree on a set of constant measure — say . Hence, given a sample of cardinality , there is a probability of at least that the two functions are indistinguishable on the sample. This trivial reason for having a version space with a large diameter sets the bar of the sample complexity at at-least .
The introduction of the other trivial obstructions requires additional notation. It is not surprising that the resulting sample complexity has to do with localized Rademacher averages.
Let be the unit ball in , set and put . The star-shaped hull of a class and a function is given by
in other words, consists of the union of all the intervals whose end points are and . From here on we set
Note that is the set one obtains by taking , intersecting it with an ball centred at and of radius , and then shifting to .
For a triplet let
where are independent, symmetric -valued random variables that are independent of and the expectations are taken with respect to both and .
Intuitively, is the sample size needed to overcome ‘intrinsic’ errors while is the sample size one must have to overcome the ‘external’ ones. More accurately, one has the following:
[12, 18] There is an absolute constant for which the following holds. Under mild assumptions222The mild assumptions on and have to do with the continuity of the processes appearing in the definitions of and . We refer to [12, 18] for more information on these lower bounds. on and ,
If for every realizable target , one has that with probability at least the diameter of the version space is at most , then the sample size is at least
where the supremum is taken with respect to all triplets involving the fixed class , the fixed (but unknown) distribution , and targets of the form , .
If is a learning procedure that performs with accuracy and confidence for any target of the form as in , then it requires a sample size of cardinality at least
where the supremum is taken with respect to all triplets involving the fixed class , the fixed (but unknown) distribution , and targets of the form , .
Claim 1.3 provides a lower bound on the sample complexity needed to overcome the trivial obstructions associated with and at a constant confidence level. When one is interested in a higher confidence level, one has the following:
 There is an absolute constant for which the following holds. Under mild assumptions on and , any learning procedure that performs with accuracy and confidence for any target of the form as in , requires a sample size of cardinality at least
where, as always, .
With the geometric obstruction and the trivial statistical obstructions in mind, a (seemingly wildly optimistic) notion of an optimal sample complexity and an optimal procedure is the following:
An unrestricted procedure is optimal if there are constants and such that for (almost) every triplet , the procedure performs with accuracy and confidence with sample complexity
At a first glance, this benchmark seems to be too good to be true. One source of optimism is  in which a (proper) procedure that attains (1.4) is established—the median-of-means tournament. However, tournaments are shown to be optimal only for problems involving convex classes (or general classes but for targets that consist of independent additive noise). The success of the tournament procedure introduced in  does not extend to more general learning problems; not only is it a proper procedure, its analysis uses the favourable location of the target in a strong way.
2 The main result in detail
As a first step in an accurate formulation of our main result, let us specify what we mean by “almost every triplet”.
For a class , let and assume that for every there exists such that for every ,
Equation (2.1) is a uniform integrability condition for , and as such it is only slightly stronger than a compactness assumption on : (2.1) holds for any individually, and the fact that is reasonably small allows the ‘cut-off’ points to be chosen uniformly for any .
An indication that Assumption 2.1 is rather minimal is norm equivalence: that there are constants and (which can be arbitrarily close to ) such that for every . An norm equivalence implies that Assumption 2.1 holds with depending only on and , and the standard proof is based on tail integration and Chebychev’s inequality.
Norm equivalence occurs frequently in statistical problems — for example, in linear regression, where the class in question consists of linear functionals in. It is standard to verify that
norm equivalence is satisfied for random vectorsthat are subgaussian; log-concave; of the form , where the ’s are independent copies of a symmetric, variance random variable that is bounded in for some ; and in many other situations (see, e.g. ).
While Assumption 2.1 is weaker than any norm equivalence, it is actually stronger than the small-ball condition which plays a central role in [17, 22, 16]. Indeed, a small-ball condition means that there are and such that
Invoking Assumption 2.1 for an arbitrary , it is evident that satisfies which, by the Paley-Zygmund Theorem, guarantees a small-ball condition for constants and that depend only on and .
The need for a slightly stronger assumption than (2.2) arises because a small-ball condition leads only to an isomorphic lower bound on quadratic forms: it implies that with high probability,
but the constant cannot be made arbitrarily close to . It turns out that proving that our procedure is optimal requires a version of (2.3) for a constant that can be taken close to . We show in what follows that Assumption 2.1 suffices for that.
Next, we need an additional parameter that gives information on the way the target interacts with the class .
For a triplet set
(recall that ).
In particular, for any ,
Equation (2.4) plays a significant role in what follows, and the following examples may help in giving a better understanding of it:
If for some and is a mean-zero, square-integrable random variable that is independent of then .
Let for some and as in . If for every , then . More generally, the same holds if the norm equivalence is true for and for every .
If for every , , then one may take .
The proofs of all these observations is completely standard and we omit them.
Before we formulate the main result and for a reason that will become clear immediately, we need to outline a few preliminary details on the procedure we introduce.
The procedure receives as input a class and a sample , and returns a subset . The two crucial features of are that
It contains ; and
If then either is ‘very close’ to or alternatively, is much closer to than is.
Now, let be a triplet and fix an accuracy and a confidence level .
Given an integer , let is the set generated by procedure after being given the class and the sample .
Set and let be the triplet .
For an integer let to be the set generated by the procedure after being given and an independent sample .
Let to be any function in .
Let satisfy Assumption 2.1. Then for every accuracy and confidence parameter we have that
provided that , where
and . The constants depend only on the uniform integrability function : if we set ; ; and , then and depend only on and .
We describe the procedure in detail in Section 3.
Just like in , Theorem 2.2 has striking consequences: it implies that all the statistical content of a learning problem associated with the triplet is actually coded in the ‘trivial’ sample complexity , which, by Claim 1.3 and Claim 1.4, corresponds to the bare minimum one requires to overcome the trivial obstacles at a constant accuracy level. And, once that minimal threshold is passed, the procedure requires only an additional sample whose cardinality is a lower bound on the sample complexity had the target been where and is a centred gaussian variable that is independent of .
Of course, is a random object, and to avoid having a data-dependent component in the sample complexity estimate one may simply take the largest sample complexity required for a set which satisfies that
since satisfies that condition. With that in mind, let us introduce the following notation.
For a triplet , let be the collection of all subsets of that contain . Set
to be all the triplets associated with such classes , the original distribution and the target .
Clearly, for Theorem 2.2 to hold it suffices that
and often this upper bound is not much worst than (2.2).
Although Assumption 2.1 is rather natural, it does not cover one of the main families of problems encountered in learning theory: when the class consists of functions uniformly bounded by some constant and the target is also bounded by the same constant.
While Theorem 2.2 is not directly applicable to this bounded framework, its proof actually is. In fact, because sums of iid bounded random variables exhibit a strong concentration phenomenon, the proof of a version of Theorem 2.2 that holds in the bounded framework is much simpler than in the general case we focus on here. Because our main interest is heavy-tailed problems, we only sketch the analogous result in the bounded framework in Appendix B.
To illustrate Theorem 2.2, let us present the following classical example, which has been studied extensively in Statistics literature.
2.1 Example – finite dictionaries
One of the most important questions in modern high dimensional statistics has to do with prediction problems involving finite classes of functions ordictionaries333In statistics literature, this is called model-selection aggregation for a finite dictionary. Aggregation problems of this type have been studied extensively over the years and we refer the reader to [2, 3, 9, 13, 15, 23, 24, 26] for more information on the subject.. Because finite classes can never be convex, they fall out of the scope of  and the resulting prediction problems call for a totally different approach.
For the sake of simplicity let illustrate the outcome of Theorem 2.2 by focusing on dictionaries in , i.e., for , let . Let be a centred random vector in and as a working hypothesis (which can be relaxed further) assume that
There is a constant such that for every , .
The unknown target is of the form , where and is an unknown, mean-zero, square-integrable random variable that is independent of .
Clearly, the functions can be heavy-tailed, as the norm equivalence only implies that is slightly smaller than . Also, since is just square integrable, need not have any finite moment beyond the second one. This setup is totally out of reach for methods that exploit direct concentration arguments, and specifically, the results in
need not have any finite moment beyond the second one. This setup is totally out of reach for methods that exploit direct concentration arguments, and specifically, the results in[13, 24], which deal with dictionaries consisting of functions bounded by and targets that are bounded by are not applicable here.
For any such triplet let be the minimizer in of the squared risk functional and set . Applying Theorem 2.2 for a given accuracy and confidence parameter , the procedure selects
the constants and from Theorem 2.2 depend only on , as does . Hence, if
with probability at least ; the maximum in (2.7) is with respect to all triplets
As it happens, it is straightforward to obtain an upper estimate on (2.7) that holds for any dictionary of cardinality . For example, one may show that if is a dictionary consisting of points, is -subgaussian444i.e., if in addition to being centred it satisfies that for any and , ., and is as above, then
where depends only on . Hence, an upper estimate on (2.7) that holds for any such triplet and in particular for any is that
It is well known that (2.10) is the best possible sample complexity estimate that holds for all possible dictionaries with points. Note, however, that (2.10) is attained after two significant steps that may come at a cost: first, in (2.7) one replaces the triplet by the collection of triplets ; and second, (2.9) is a bound that holds for any dictionary of cardinality , completely disregarding the geometry of the given class. Hence, (2.10) is a ‘worst-case’ upper bound on the required sample complexity for the triplet . A better upper bound can be derived if one has more information on the structure of the dictionary, as its geometry is reflected in and .
Despite being suboptimal, (2.10) is actually a considerable improvement on the current state-of-the-art in such problems, established in . For example, let us compare the results from  to (2.10) in the case where is an -subgaussian random vector and for some and that is square-integrable and independent of .
 There is a procedure for which the following holds. If we set , then
provided that for some
and scales like up to logarithmic factors.
Both Theorem 2.7 and (2.10) deal with that is -subgaussian and , where can be heavy-tailed. Even if we take for granted that for some (which is not automatic; is assumed only to be square-integrable), it is clear that (2.10) is a much sharper estimate. Indeed, the clearest difference between Theorem 2.7 and (2.10) is the way the sample complexity scales with the confidence parameter : the former is polynomial in and the latter is logarithmic in .
The procedure from Theorem 2.7 is suboptimal because it is based on Empirical Risk Minimization
(ERM), and ERM-based procedures perform poorly when faced with heavy-tailed data. ERM does reasonably well only when there are almost no outliers and the few existing outliers are not very far from the ‘bulk’ of the data, but it does not cope well otherwise. Few and well-behaved outliers are to be expected only when the random variables involved have rapidly decaying tails (subgaussian) but when faced with data that is heavier tailed, like the’s in the example, ERM is bound to fail. We refer the reader to  for a detailed discussion on ERM’s sub-optimality, and turn now to describe a procedure that overcomes these issues. Like in , the procedure is based on a median-of-means tournament—though a very different one than the tournament used in .
3 The procedure in detail
The procedure we introduce is denoted by and consists of two components, and .
– estimating distances
The procedure receives as input a class of functions and a sample . It has one tuning parameter, an integer .
For any pair of functions and a sample , set and let
where is the nonincreasing rearrangement of .
– comparing statistical performance of functions
The second component of the procedure receives as input a class ; a sample ; and all the outcomes for , which were computed using the sample .
has several tuning parameters, denoted by and it is also given the wanted accuracy and confidence parameters and . Here and throughout this article, given the accuracy we set for a constant that is specified in what follows. We also show that all the tuning parameters (including ) depend only on the uniform integrability function at two values: we set , and for we set ; the tuning parameters depend only on and .
To define , Let
We split to coordinate blocks which, without loss of generality are assumed to be of equal size, denoted by .
For and let
Set if, for more than of the coordinate blocks , one has
It is a little easier to follow the meaning of Definition 3.2 if one thinks of as a tournament procedure, and Definition 3.2 as representing the outcome of a ‘home-and-way’ type match between any two elements in : the function wins its home match against if . Therefore, consists of all the functions in that have won all their home matches in the tournament. Note that it is possible to have both and .
Therefore, the complete procedure is as follows:
Estimating distances between class members has nothing to do with the unknown target and does not require the “labels” . Thus, may be considered as a pre-processing step and can then be used for any target —simply by running for the class . In fact, there is nothing special about ; it may be replaced by any data-dependent procedure that satisfies as long as (recall that , the required accuracy level). This fact will be of use when we explore the bounded framework, in Appendix B.
Finally, there are some situations in which is not needed at all. For example, if is a random vector in with independent, mean-zero, variance random variables as coordinates then its covariance structure coincides with the standard Euclidean structure in . Thus, for any , and there is no need to estimate the distances between linear functionals .
4 Proof of Theorem 2.2
The proof of Theorem 2.2 is rather involved and technical, and requires some unpleasant ‘constant chasing’; unfortunately, that cannot be helped. We begin its presentation with a short road-map, outlining the argument.
The first component in the proof is a reduction step: identifying a sufficient (random) condition, which, once verified, implies that the procedure performs as expected. This reduction step is presented in Section 4.1 and Section 4.2.
The study of the random condition is the heart of the matter. To prove its validity with the desired confidence, one has to show that the quadratic and multiples components of the excess risk functional are ‘regular’ in an appropriate sense once the sample size is large enough. Proving that is the topic of Section 4.3.
To ease some of the technical difficulty in the proofs of the random components from Section 4.3, it is helpful to keep in mind the following facts:
All the constants appearing in the proof are derived from the uniform integrability function at two different, well specified levels. Although we keep track of those constants, one should realize that since is known, they are just fixed numbers, and of limited significance to the understanding of what is going on.
The number of coordinate blocks used in the tournament is . The motivation behind this choice is simple: if a certain property holds for a single function on an individual block with constant probability, then by the independence of the blocks the probability that the property is satisfied by a majority of the blocks is exponential in . With our choice of , the resulting confidence is , which is precisely what we are looking for.
The choice of the sample size is made to ensure that one has ‘enough randomness’, leading to a regular behaviour of the random variables involved in the proof. We establish a quantitative estimate on the sample size that is needed for that regularity, but it is instructive to note as the proof progresses that the wanted control becomes more likely as increases.
4.1 A deterministic interlude
There is a feature that plays an important role in most unrestricted procedures: if one can find two almost minimizers of the risk that are far apart, their midpoint is much closer to than is. Each procedure looks for such functions and exploits their existence in a different way, but up to this point, all the methods that have been used to that end were based on empirical minimization. This deterministic interlude is a step towards an alternative path: finding, without resorting to ERM, a subset of the given class that consists of functions that are either very close to , or that their average with is significantly closer to than is.
Let be a triplet, fix and recall that . Let
be the hyperplane supporting the ballat . Observe that if happens to be convex then —the ‘positive side’ of , defined by the condition . Indeed, this follows from the characterization of the nearest point map onto a closed, convex subset of a Hilbert space.
Of course, need not be convex. Therefore, as a preliminary goal one would like to identify a subset of , containing and possibly other functions as well, as long as they satisfy the following:
If then is an almost minimizer of the risk, in the sense that .
If and then is significantly smaller than .
We call such a subset an essential part of the class, though it depends on the entire triplet:
Let be a triplet. For and , a subset is -essential if and for every ,
Observe that (4.1) amounts to