In this paper we pursue a tight characterization of the sample complexity of learning a classifier, under a particular data distribution, and using a particular learning rule.
Most learning theory work focuses on providing sample-complexity upper bounds which hold for a large class of distributions. For instance, standard distribution-free VC-dimension analysis shows that if one uses the Empirical Risk Minimization (ERM) learning rule, then the sample complexity of learning a classifier from a hypothesis class with VC-dimension is at most , where is the maximal excess classification error (Vapnik and Chervonenkis, 1971; Anthony and Bartlett, 1999). Such upper bounds can be useful for understanding the positive aspects of a learning rule. However, it is difficult to understand the deficiencies of a learning rule, or to compare between different rules, based on upper bounds alone. This is because it is possible, and is often the case, that the actual number of samples required to get a low error, for a given data distribution using a given learning rule, is much lower than the sample-complexity upper bound. As a simple example, suppose that the support of a given distribution is restricted to a subset of the domain. If the VC-dimension of the hypothesis class, when restricted to this subset, is smaller than , then learning with respect to this distribution will require less examples than the upper bound predicts.
Of course, some sample complexity upper bounds are known to be tight or to have an almost-matching lower bound. For instance, the VC-dimension upper bound is tight (Vapnik and Chervonenkis, 1974). This means that there exists some data distribution in the class covered by the upper bound, for which this bound cannot be improved. Such a tightness result shows that there cannot be a better upper bound that holds for this entire class of distributions. But it does not imply that the upper bound characterizes the true sample complexity for every specific distribution in the class.
The goal of this paper is to identify a simple quantity, which is a function of the distribution, that does precisely characterize the sample complexity of learning this distribution under a specific learning rule. We focus on the important hypothesis class of linear classifiers, and on the popular rule of margin-error-minimization (MEM). Under this learning rule, a learner must always select a linear classifier that minimizes the margin-error on the input sample.
The VC-dimension of the class of homogeneous linear classifiers in is (Dudley, 1978). This implies a sample complexity upper bound of using any MEM algorithm, where is the excess error relative to the optimal margin error.111This upper bound can be derived analogously to the result for ERM algorithms with being the excess classification error. It can also be concluded from our analysis in Theorem 6 below. We also have that the sample complexity of any MEM algorithm is at most , where is the average squared norm of the data and is the size of the margin (Bartlett and Mendelson, 2002). Both of these upper bounds are tight. For instance, there exists a distribution with an average squared norm of that requires as many as examples to learn, for some universal constant (see, e.g., Anthony and Bartlett, 1999). However, the VC-dimension upper bound indicates, for instance, that if a distribution induces a large average norm but is supported by a low-dimensional sub-space, then the true number of examples required to reach a low error is much smaller. Thus, neither of these upper bounds fully describes the sample complexity of MEM for a specific distribution.
We obtain a tight distribution-specific characterization of the sample complexity of large-margin learning for a rich class of distributions. We present a new quantity, termed the margin-adapted dimension, and use it to provide a tighter distribution-dependent upper bound, and a matching distribution-dependent lower bound for MEM. The upper bound is universal, and the lower bound holds for a rich class of distributions with independent features.
The margin-adapted dimension refines both the dimension and the average norm of the data distribution, and can be easily calculated from the covariance matrix and the mean of the distribution. We denote this quantity, for a margin of , by . Our sample-complexity upper bound shows that examples suffice in order to learn any distribution with a margin-adapted dimension of using a MEM algorithm with margin
. We further show that for every distribution in a rich family of ‘light tailed’ distributions—specifically, product distributions of sub-Gaussian random variables—the number of samples required for learning by minimizing the margin error is at least.
Denote by the number of examples required to achieve an excess error of no more than relative to the best possible -margin error for a specific distribution , using a MEM algorithm. Our main result shows the following matching distribution-specific upper and lower bounds on the sample complexity of MEM:
Our tight characterization, and in particular the distribution-specific lower bound on the sample complexity that we establish, can be used to compare large-margin ( regularized) learning to other learning rules. We provide two such examples: we use our lower bound to rigorously establish a sample complexity gap between and regularization previously studied in Ng (2004), and to show a large gap between discriminative and generative learning on a Gaussian-mixture distribution. The tight bounds can also be used for active learning algorithms in which sample-complexity bounds are used to decide on the next label to query.
In this paper we focus only on large margin classification. But in order to obtain the distribution-specific lower bound, we develop new tools that we believe can be useful for obtaining lower bounds also for other learning rules. We provide several new results which we use to derive our main results. These include:
Linking the fat-shattering of a sample with non-negligible probability to a difficulty of learning using MEM.
Showing that for a convex hypothesis class, fat-shattering is equivalent to shattering with exact margins.
Linking the fat-shattering of a set of vectors with the eigenvalues of the Gram matrix of the vectors.
Providing a new lower bound for the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. This bound extends previous results in analysis of random matrices.
1.1 Paper Structure
We discuss related work on sample-complexity upper bounds in Section 2. We present the problem setting and notation in Section 3, and provide some necessary preliminaries in Section 4. We then introduce the margin-adapted dimension in Section 5. The sample-complexity upper bound is proved in Section 6. We prove the lower bound in Section 7. In Section 8 we show that any non-trivial sample-complexity lower bound for more general distributions must employ properties other than the covariance matrix of the distribution. We summarize and discuss implication in Section 9. Proofs omitted from the text are provided in Appendix A
2 Related Work
As mentioned above, most work on “sample complexity lower bounds” is directed at proving that under some set of assumptions, there exists a data distribution for which one needs at least a certain number of examples to learn with required error and confidence (for instance Antos and Lugosi, 1998; Ehrenfeucht et al., 1988; Gentile and Helmbold, 1998). This type of a lower bound does not, however, indicate much on the sample complexity of other distributions under the same set of assumptions.
For distribution-specific lower bounds, the classical analysis of Vapnik (1995, Theorem 16.6) provides not only sufficient but also necessary conditions for the learnability of a hypothesis class with respect to a specific distribution. The essential condition is that the metric entropy of the hypothesis class with respect to the distribution be sub-linear in the limit of an infinite sample size. In some sense, this criterion can be seen as providing a “lower bound” on learnability for a specific distribution. However, we are interested in finite-sample convergence rates, and would like those to depend on simple properties of the distribution. The asymptotic arguments involved in Vapnik’s general learnability claim do not lend themselves easily to such analysis.
Benedek and Itai (1991) show that if the distribution is known to the learner, a specific hypothesis class is learnable if and only if there is a finite -cover of this hypothesis class with respect to the distribution. Ben-David et al. (2008) consider a similar setting, and prove sample complexity lower bounds for learning with any data distribution, for some binary hypothesis classes on the real line. Vayatis and Azencott (1999) provide distribution-specific sample complexity upper bounds for hypothesis classes with a limited VC-dimension, as a function of how balanced the hypotheses are with respect to the considered distributions. These bounds are not tight for all distributions, thus they also do not fully characterize the distribution-specific sample complexity.
As can be seen in Equation (1), we do not tightly characterize the dependence of the sample complexity on the desired error (as done, for example, in Steinwart and Scovel, 2007), thus our bounds are not tight for asymptotically small error levels. Our results are most significant if the desired error level is a constant well below chance but bounded away from zero. This is in contrast to classical statistical asymptotics that are also typically tight, but are valid only for very small . As was recently shown by Liang and Srebro (2010), the sample complexity for very small
(in the classical statistical asymptotic regime) depends on quantities that can be very different from those that control the sample complexity for moderate error rates, which are more relevant for machine learning.
3 Problem Setting and Definitions
Consider a domain , and let be a distribution over . We denote by the marginal distribution of on . The misclassification error of a classifier on a distribution is
The margin error of a classifier with respect to a margin on is
For a given hypothesis class , the best achievable margin error on is
We usually write simply since is clear from context.
A labeled sample is a (multi-)set . Given , we denote the set of its examples without their labels by . We use
also to refer to the uniform distribution over the elements in. Thus the misclassification error of on is
and the -margin error on is
A learning algorithm is a function , that receives a training set as input, and returns a function for classifying objects in into real values. The high-probability loss of an algorithm with respect to samples of size , a distribution and a confidence parameter is
In this work we investigate the sample complexity of learning using margin-error minimization (MEM). The relevant class of algorithms is defined as follows. An margin-error minimization (MEM) algorithm maps a margin parameter to a learning algorithm , such that
The distribution-specific sample complexity for MEM algorithms is the sample size required to guarantee low excess error for the given distribution. Formally, we have the following definition. [Distribution-specific sample complexity] Fix a hypothesis class . For , , and a distribution , the distribution-specific sample complexity, denoted by , is the minimal sample size such that for any MEM algorithm , and for any ,
Note that we require that all possible MEM algorithms do well on the given distribution. This is because we are interested in the MEM strategy in general, and thus we study the guarantees that can be provided regardless of any specific MEM implementation. We sometimes omit and write simply , to indicate that is assumed to be some fixed small constant.
In this work we focus on linear classifiers. For simplicity of notation, we assume a Euclidean space for some integer , although the results can be easily extended to any separable Hilbert space. For a real vector , stands for the Euclidean norm. For a real matrix , stands for the Euclidean operator norm.
Denote the unit ball in by . We consider the hypothesis class of homogeneous linear separators, . We often slightly abuse notation by using to denote the mapping .
We often represent sets of vectors in using matrices. We say that is the matrix of a set if the rows in the matrix are exactly the vectors in the set. For uniqueness, one may assume that the rows of are sorted according to an arbitrary fixed full order on vectors in . For a PSD matrix denote the largest eigenvalue of by and the smallest eigenvalue by .
We use the -notation as follows: stands for for some constants . stands for for some constants . stands for for some polynomial and some constant .
As mentioned above, for the hypothesis class of linear classifiers , one can derive a sample-complexity upper bound of the form , where and is the excess error relative to the -margin loss. This can be achieved as follows (Bartlett and Mendelson, 2002). Let be some domain. The empirical Rademacher complexity of a class of functions with respect to a set is
where are independent uniform -valued variables. The average Rademacher complexity of with respect to a distribution over and a sample size is
Assume a hypothesis class
and a loss function. For a hypothesis , we introduce the function , defined by . We further define the function class .
Assume that the range of is in . For any , with probability of over the draw of samples of size according to , every satisfies (Bartlett and Mendelson, 2002)
To get the desired upper bound for linear classifiers we use the ramp loss, which is defined as follows. For a number , denote . The -ramp-loss of a labeled example with respect to a linear classifier is . Let , and denote the class of ramp-loss functions by
The ramp-loss is upper-bounded by the margin loss and lower-bounded by the misclassification error. Therefore, the following result can be shown. For any MEM algorithm , we have
Combining this with Proposition 4 we can conclude a sample complexity upper bound of .
In addition to the Rademacher complexity, we will also use the classic notions of fat-shattering (Kearns and Schapire, 1994) and pseudo-shattering (Pollard, 1984), defined as follows. Let be a set of functions , and let . The set is -shattered by with the witness if for all there is an such that . The -shattering dimension of a hypothesis class is the size of the largest set that is -shattered by this class. We say that a set is -shattered at the origin if it is -shattered with the zero vector as a witness.
Let be a set of functions , and let . The set is pseudo-shattered by with the witness if for all there is an such that . The pseudo-dimension of a hypothesis class is the size of the largest set that is pseudo-shattered by this class.
5 The Margin-Adapted Dimension
When considering learning of linear classifiers using MEM, the dimension-based upper bound and the norm-based upper bound are both tight in the worst-case sense, that is, they are the best bounds that rely only on the dimensionality or only on the norm respectively. Nonetheless, neither is tight in a distribution-specific sense: If the average norm is unbounded while the dimension is small, then there can be an arbitrarily large gap between the true distribution-dependent sample complexity and the bound that depends on the average norm. If the converse holds, that is, the dimension is arbitrarily large while the average-norm is bounded, then the dimensionality bound is loose.
Seeking a tight distribution-specific analysis, one simple approach to tighten these bounds is to consider their minimum, which is proportional to
. Trivially, this is an upper bound on the sample complexity as well. However, this simple combination is also not tight: Consider a distribution in which there are a few directions with very high variance, but the combined variance in all other directions is small (see Figure1). We will show that in such situations the sample complexity is characterized not by the minimum of dimension and norm, but by the sum of the number of high-variance dimensions and the average squared norm in the other directions. This behavior is captured by the margin-adapted dimension which we presently define, using the following auxiliary definition.
Let and let be a positive integer. A distribution over is -limited if there exists a sub-space of dimension such that where is an orthogonal projection onto .
[margin-adapted dimension]The margin-adapted dimension of a distribution , denoted by , is the minimum such that the distribution is -limited.
We sometimes drop the argument of when it is clear from context. It is easy to see that for any distribution over , . Moreover, can be much smaller than this minimum. For example, consider a random vector with mean zero and statistically independent coordinates, such that the variance of the first coordinate is , and the variance in each remaining coordinate is . We have but .
can be calculated from the uncentered covariance matrix as follows: Let be the eigenvalues of this matrix. Then
A quantity similar to this definition of was studied previously in Bousquet (2002). The eigenvalues of the empirical covariance matrix were used to provide sample complexity bounds, for instance in Schölkopf et al. (1999). However, generates a different type of bound, since it is defined based on the eigenvalues of the distribution and not of the sample. We will see that for small finite samples, the latter can be quite different from the former.
Finally, note that while we define the margin-adapted dimension for a finite-dimensional space for ease of notation, the same definition carries over to an infinite-dimensional Hilbert space. Moreover, can be finite even if some of the eigenvalues are infinite, implying a distribution with unbounded covariance.
6 A Distribution-Dependent Upper Bound
In this section we prove an upper bound on the sample complexity of learning with MEM, using the margin-adapted dimension. We do this by providing a tighter upper bound for the Rademacher complexity of . We bound for any -limited distribution , using covering numbers, defined as follows.
Let be a normed space. An -covering of a set with respect to the norm is a set such that for any there exists a such that The covering-number for given , and is the size of the smallest such -covering, and is denoted by . Let . For a function , the norm of is . Thus, we consider covering-numbers of the form .
The empirical Rademacher complexity of a function class can be bounded by the covering numbers of the same function class as follows (Mendelson, 2002, Lemma 3.7): Let . Then
To bound the covering number of , we will restate the functions in as sums of two functions, each selected from a function class with bounded complexity. The first function class will be bounded because of the norm bound on the subspace used in Definition 5, and the second function class will have a bounded pseudo-dimension. However, the second function class will depend on the choice of the first function in the sum. Therefore, we require the following lemma, which provides an upper bound on such sums of functions. We use the notion of a Hausdorff distance between two sets , defined as .
Let be a normed space. Let be a set, and let be a mapping from objects in to sets of objects in . Assume that is -Lipschitz with respect to the Hausdorff distance on sets, that is assume that
Let . Then
For any set , denote by a minimal -covering for with respect to , so that . Let such that . There is a such that . In addition, by the Lipschitz assumption there is a such that . Lastly, there is a such that . Therefore
Thus the set is a cover of . The size of this cover is at most .
The following lemma provides us with a useful class of mappings which are -Lipschitz with respect to the Hausdorff distance, as required in Lemma 6. The proof is provided in Appendix A.2. Let be a function and let be a function class over some domain . Let be the mapping defined by
Then is -Lipschitz with respect to the Hausdorff distance. The function class induced by the mapping above preserves the pseudo-dimension of the original function class, as the following lemma shows. The proof is provided in Appendix A.3.
Let be a function and let be a function class over some domain . Let be defined as in Equation (6). Then the pseudo-dimension of is at most the pseudo-dimension of .
Equipped with these lemmas, we can now provide the new bound on the Rademacher complexity of in the following theorem. The subsequent corollary states the resulting sample-complexity upper bound for MEM, which depends on . Let be a distribution over , and assume is -limited. Then
In this proof all absolute constants are assumed to be positive and are denoted by or for some integer . Their values may change from line to line or even within the same line.
Consider the distribution which results from drawing and emitting . It too is -limited, and . Therefore, we assume without loss of generality that for all drawn from , . Accordingly, we henceforth omit the argument from and write simply .
Following Definition 5, Let be an orthogonal projection onto a sub-space of dimension such that . Let be the complementary sub-space to . For a set , denote .
We would like to use Equation (5) to bound the Rademacher complexity of . Therefore, we will bound for . Note that
Shifting by a constant and negating do not change the covering number of a function class. Therefore, is equal to the covering number of . Moreover, let
Then , thus it suffices to bound . To do that, we show that satisfies the assumptions of Lemma 6 for the normed space . Define
Let be the mapping defined by
We now proceed to bound the two covering numbers on the right hand side. First, consider . By Lemma 6, the pseudo-dimension of is the same as the pseudo-dimension of , which is exactly , the dimension of . The covering number of can be bounded by the pseudo-dimension of as follows (see, e.g., Bartlett, 2006, Theorem 3.1):
where are independent standard normal variables. The right-hand side can be bounded as follows:
To finalize the proof, we plug this inequality into Equation (5) to get
In the last inequality we used the fact that . Setting we get
Taking expectation over both sides, and noting that , we get
Corollary (Sample complexity upper bound)
Let be a distribution over . Then
By Proposition 4, we have
By definition of , is -limited. Therefore, by Theorem 6,
We conclude that
Bounding the second right-hand term by , we conclude that .
One should note that a similar upper bound can be obtained much more easily under a uniform upper bound on the eigenvalues of the uncentered covariance matrix.222This has been pointed out to us by an anonymous reviewer of this manuscript. An upper bound under sub-Gaussianity assumptions can be found in Sabato et al. (2010).
However, such an upper bound would not capture the fact that a finite dimension implies a finite sample complexity, regardless of the size of the covariance. If one wants to estimate the sample complexity, then large covariance matrix eigenvalues imply that more examples are required to estimate the covariance matrix from a sample. However, these examples need not be labeled. Moreover, estimating the covariance matrix is not necessary to achieve the sample complexity, since the upper bound holds for any margin-error minimization algorithm.
7 A Distribution-Dependent Lower Bound
The new upper bound presented in Corollary 6 can be tighter than both the norm-only and the dimension-only upper bounds. But does the margin-adapted dimension characterize the true sample complexity of the distribution, or is it just another upper bound? To answer this question, we first need tools for deriving sample complexity lower bounds. Section 7.1 relates fat-shattering with a lower bound on sample complexity. In Section 7.2 we use this result to relate the smallest eigenvalue of a Gram-matrix to a lower bound on sample complexity. In Section 7.3 the family of sub-Gaussian product distributions is presented. We prove a sample-complexity lower bound for this family in Section 7.4.
7.1 A Sample Complexity Lower Bound Based on Fat-Shattering
The ability to learn is closely related to the probability of a sample to be shattered, as evident in Vapnik’s formulations of learnability as a function of the -entropy (Vapnik, 1995). It is well known that the maximal size of a shattered set dictates a sample-complexity upper bound. In the theorem below, we show that for some hypothesis classes it also implies a lower bound. The theorem states that if a sample drawn from a data distribution is fat-shattered with a non-negligible probability, then MEM can fail to learn a good classifier for this distribution.333In contrast, the average Rademacher complexity cannot be used to derive general lower bounds for MEM algorithms, since it is related to the rate of uniform convergence of the entire hypothesis class, while MEM algorithms choose low-error hypotheses (see, e.g., Bartlett et al., 2005). This holds not only for linear classifiers, but more generally for all symmetric hypothesis classes. Given a domain , we say that a hypothesis class is symmetric if for all , we have as well. This clearly holds for the class of linear classifiers .
Let be some domain, and assume that is a symmetric hypothesis class. Let be a distribution over . If the probability of a sample of size drawn from to be -shattered at the origin by is at least , then for all . Let . We show a MEM algorithm such that
thus proving the desired lower bound on .
Assume for simplicity that is even (otherwise replace with ). Consider two sets , each of size , such that is -shattered at the origin by . Then there exists a hypothesis such that the following holds:
For all , .
For all , .
For all , .
It follows that . In addition, let . Then . Moreover, we have due to the symmetry of . On each point in , at least one of and predict the wrong sign. Thus . It follows that for at least one of , we have . Denote the set of hypotheses with a high misclassification error by
We have just shown that if is -shattered by then at least one of the following holds: (1) or (2) .
Now, consider a MEM algorithm such that whenever possible, it returns a hypothesis from . Formally, given the input sample , if , then . It follows that
The last inequality follows from the argument above regarding and . The last expression is simply half the probability that a sample of size from is shattered. By assumption, this probability is at least . Thus we conclude that It follows that .
As a side note, it is interesting to observe that Theorem 7.1 does not hold in general for non-symmetric hypothesis classes. For example, assume that the domain is , and the hypothesis class is the set of all functions that label a finite number of points in by and the rest by . Consider learning using MEM, when the distribution is uniform over , and all the labels are . For any and , a sample of size is -shattered at the origin with probability . However, any learning algorithm that returns a hypothesis from the hypothesis class will incur zero error on this distribution. Thus, shattering alone does not suffice to ensure that learning is hard.
7.2 A Sample Complexity Lower Bound with Gram-Matrix Eigenvalues
We now return to the case of homogeneous linear classifiers, and link high-probability fat-shattering to properties of the distribution. First, we present an equivalent and simpler characterization of fat-shattering for linear classifiers. We then use it to provide a sufficient condition for the fat-shattering of a sample, based on the smallest eigenvalue of its Gram matrix.
Let be the matrix of a set of size in . The set is -shattered at the origin by if and only if is invertible and for all , . To prove Theorem 7.2 we require two auxiliary lemmas. The first lemma, stated below, shows that for convex function classes, -shattering can be substituted with shattering with exact -margins. Let