# Sample Complexity Result for Multi-category Classifiers of Bounded Variation

We control the probability of the uniform deviation between empirical and generalization performances of multi-category classifiers by an empirical L1 -norm covering number when these performances are defined on the basis of the truncated hinge loss function. The only assumption made on the functions implemented by multi-category classifiers is that they are of bounded variation (BV). For such classifiers, we derive the sample size estimate sufficient for the mentioned performances to be close with high probability. Particularly, we are interested in the dependency of this estimate on the number C of classes. To this end, first, we upper bound the scale-sensitive version of the VC-dimension, the fat-shattering dimension of sets of BV functions defined on R^d which gives a O(1/epsilon^d ) as the scale epsilon goes to zero. Secondly, we provide a sharper decomposition result for the fat-shattering dimension in terms of C, which for sets of BV functions gives an improvement from O(C^(d/2 +1)) to O(Cln^2(C)). This improvement then propagates to the sample complexity estimate.

There are no comments yet.

## Authors

• 2 publications
10/24/2019

### On sample complexity of neural networks

We consider functions defined by deep neural networks as definable objec...
12/03/2018

### Rademacher Complexity and Generalization Performance of Multi-category Margin Classifiers

One of the main open problems in the theory of multi-category margin cla...
02/18/2017

### Sample complexity of population recovery

The problem of population recovery refers to estimating a distribution b...
10/13/2019

### Generalization Bounds for Neural Networks via Approximate Description Length

We investigate the sample complexity of networks with bounds on the magn...
07/22/2019

### Fast rates for empirical risk minimization over càdlàg functions with bounded sectional variation norm

Empirical risk minimization over classes functions that are bounded for ...
06/03/2019

### Optimal Learning of Mallows Block Model

The Mallows model, introduced in the seminal paper of Mallows 1957, is o...
01/28/2021

### Interpolating Classifiers Make Few Mistakes

This paper provides elementary analyses of the regret and generalization...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the VC framework [42], both for binary and multi-category classification tasks, when minimal assumption on the predictive model is made, the (optimal) way one controls the uniform convergence of the empirical performance to the generalization one depends on the loss function used based on which these performances are defined. The choice of the loss function leads to an upper bound involving one of capacity measures, the quantity characterizing the rate of the uniform convergence. The seminal work dealt with the standard indicator loss function [43] leading to bounds involving the VC-dimension as a capacity measure. This was improved in [12] via the Rademacher complexity since the mentioned capacity measure is upper bounded by the VC-dimension. Classifiers implementing real-valued functions offer a richer setting to the assessment of their classification performance since the latter can be defined based on a family of margin loss functions which can be distinguished into two classes: margin indicator loss function and those that are Lipschitz continuous [27]. A generalization bound on the basis of the margin indicator loss function was first obtained in [9], and extended to the multi-class case in [22, 23]. These bounds are in terms of the empirical -norm covering number as a capacity measure. For Lipschitz continuous margin loss functions, analogous bounds for general function classes, both in binary and multi-class setting, involve the Rademacher complexity as a capacity measure [27, 30, 31, 23, 37]. In this paper, for an instance of the above mentioned loss functions, we are interested in the multi-class extension of the uniform Glivenko-Cantelli result, Lemma 10 combined with Lemma 11 of [11], a result controlled by an empirical -norm covering number.

The main—and the sole—assumption we make in this paper regarding predictive models is that the functions they implement are of bounded variation. According to Helly’s selection theorem the space of bounded variation functions (the space) can be compactly embedded in the -space [8, 3]. In this sense it is relevant to study the uniform convergence over sets of the mentioned space of functions via an -norm covering number. Also, the space contains other interesting classes of functions such as absolutely continuous, Lipschitz continuous functions as well as a Sobolev space (and mainly has applications in image processing tasks [7]

). The functions implemented by most classifiers (such as support vector machines

[16][4] and nearest neighbours [28]) can be said to be of bounded variation, and thus the assumption we make is not too restrictive.

In our extension, we closely follow the combinatorial method of Pollard [38] based on which he derived the rate of the uniform convergence for the classical Glivenko-Cantelli problem via an empirical -norm approximation of the set. We then translate this result into a sample complexity estimate (as in [11]), i.e., for fixed , a minimum sample size sufficient for the uniform deviation between the empirical and true means to be at most with probability at least . The VC theory relies on the independent (and identical) distribution assumption, but it still applies if one replaces the independence assumption with the asymptotic independence one, the condition satisfied by the so-called mixing processes[15, 46]. Following [33], we also extend the mentioned results to such processes, where instead of sample size we now deal with the number of independent blocks (or efficient sample size).

No matter the setting mentioned above, the main focus of the present work is on elaborating the dependency of a sample complexity estimate on the number of classes. The first step towards this goal is the estimation of the metric entropy of sets of the BV space. Such a bound exists in the -norm [19], and can be shown to hold with respect to the -norm for all probability measures with Lebesgue densities. Instead, we upper bound the fat-shattering dimension of sets of the space which then can be substituted in metric entropy bounds for general function classes [2, 10, 35, 37] which hold for any probability measure on the domain of the mentioned functions. We obtain a bound scaling as a as . To make explicit the dependency on the number of classes we appeal to a particular bound, the decomposition of capacity measure which allows one to upper bound a capacity measure of a composite class by a set of that of basic classes. Since we are dealing with two capacity measures, the covering number (or the metric entropy) and the fat-shattering dimension, and since they are related to each other via a combinatorial bound (or the metric entropy bound), one can perform the decomposition at the level of either of them. Decomposition results exist for covering numbers and the fat-shattering dimension [23, 17]. The main contribution of this paper is a new efficient decomposition result for the fat-shattering dimension which scales with as a . This is an improvement over that in [17] applied in the multi-class setting where the dependency on worsens with the growth rate of the fat-shattering dimension of basic classes, and particularly, for sets of the space it is a . This decomposition leads to a new, dimension-free, i.e., not depending on the sample size metric entropy bound with a sharper dependency on compared to Corollary 1 in [37]. The application of this result gives a sample complexity estimate with a dependency improving upon a obtained based on the decomposition of the -norm metric entropies with , and with . For that in , our bound gives a comparable result.

The rest of the paper is organized as follows. In Section 2 we introduce the theoretical background. Section 3 is dedicated to upper bounding the probability of the uniform deviation between empirical and generalization performances of the classifiers of interest by an empirical -norm. Section 4 discusses metric entropy bounds for sets of the space, and for such sets derives a new upper bound on the fat-shattering dimension. In Section 5 we introduce an efficient decomposition of the fat-shattering dimension based on which we derive a sample complexity estimate and compare it with the estimate obtained via the decomposition of the empirical -norm metric entropy. Conclusions and ongoing research are given in Section 6. Finally, the case of mixing processes is addressed in Appendix.

## 2 Preliminaries

We consider -category pattern classification problems with finite . Each object is represented by its description and the categories belong to the set . Let be a random pair with values in , distributed according to an unknown probability measure . The only information about is given by an -sample made up of independent copies of .

The classifiers considered in the present manuscript are defined based on classes of functions mapping to a hypercube in and a decision rule which for each and for each , returns either the index of the component function whose value is the highest or a dummy category in case of ties. The classification performance of such classifiers can be assessed based on functions computing the difference between two component functions,

 ∀(x,y)∈Z,fg(x,y)=12(gy(x)−maxk≠ygk(x)),

and a margin loss function which penalizes all values below some margin . The performance of most well-known classifiers such as neural networks [4], support vector machines [16], nearest neighbours [28] and boosting method [39] can be studied in this margin framework. The margin loss function used here is the (parametrized) truncated hinge loss defined as

 ϕγ(t)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1,t⩽0,1−tγ,t∈(0,γ],0,t>γ,

where and , which is -Lipschitz continuous. This loss function is “insensitive” to the values of its argument strictly below zero and above in the sense that, if instead of functions , we use their truncated versions

 fg,γ(x,y)=max(0,min(γ,12(gy(x)−maxk≠ygk(x)))),

where , it holds

 ϕγ(fg(z))=ϕγ(fg,γ(z)),∀z∈Z.

We denote , and for fixed . This kind of transitioning from to results in tighter upper bounds in terms of the co-domain now shrinked to .

With these definitions at hand, we now can define the margin risk of every as

 Lγ(g)=EZ[ϕγ(fg,γ(Z))],

and its empirical margin risk as

 Lγ,n(g)=1nn∑i=1ϕγ(fg,γ(Zi)).

We are interested in the rate of convergence of the empirical margin risk to the margin risk uniformly over . The rate of this is controlled by the capacity of a classifier: the higher the capacity, the slower the convergence. There are several families of capacity measures: the ones appearing in this work are covering/packing numbers [26] and the fat-shattering dimension [25]. They are defined below.

We denote by a class of real-valued functions on some metric space . Denote by a (proper) -net of with respect to the metric :

 ∀f∈F,∃¯f∈¯F,ρ(f,¯f)<ϵ.

The covering number of , , is the smallest cardinality of -nets of . Related to the covering number is the notion of packing number. A subset is -separated with respect to the metric if for any two distinct elements , . The -packing number of is the maximal cardinality of its -separated subsets. The (pseudo-)metric of interest in this work is the empirical one: for any and , define as

 dp,tn(f,f′)=(1nn∑i=1∣∣f(ti)−f′(ti)∣∣p)1p,∀p∈[1,+∞)

and Note that, since for any , there holds

 N(ϵ,F,dp,tn)⩽N(ϵ,F,dq,tn). (1)

We denote , and similarly for packing numbers. The logarithm of covering number is called metric entropy.

For , a subset of is said to be -shattered by if there is a witness such that for any , there is a function satisfying:

 ∀i∈{1,…,n},bi(f(ti)−s(ti))⩾ϵ.

The fat-shattering dimension of at scale , , is the maximal cardinality of a subset of -shattered by , if such a maximum exists, otherwise is said to have infinite fat-shattering dimension at scale .

In this work, we make the regularity assumption on the classes of component functions that they are of bounded variation. To introduce the space of functions of bounded variation, we shall first give the definitions of Lebesgue, and Sobolev spaces. For a broader account on the mentioned spaces the reader may consult [1, 21, 47, 6].

Let denote the set of all real-valued measurable functions on a measure space . For all , is the Lebesgue space of (equivalence classes of) -summable functions :

 ∥f∥Lp(μ)=(∫T|f(t)|pdμ(t))1p<∞,

and

 L∞(T,A,μ)={f∈F:∥f∥L∞(μ)=esssupt∈T|f(t)|<∞},

where . We abbreviate , and denote the metric induced from the norm by . In the rest of the section we assume to be the Lebesgue measure, which in the case of Euclidean spaces coincides with the definitions of length, area and volume. In such a case, we will drop the measure from the notation of the norm.

Let be an open subset of . Let and let be a multi-index with . For any , the partial derivatives are denoted by and the higher order partial derivatives by

 Dα=∂|α|∂tα11…∂tαdd.

The gradient of a real-valued function on is denoted by . Denote by -times continuously differentiable real-valued functions, and abbreviate . Let be the set of -times continuously differentiable functions from to with compact support contained in . For a given and for a given , is called the -th weak derivative of , if for all ,

 ∫Tf(t)Dαϕ(t)dt=(−1)|α|∫Tϕ(t)fw(t)dt.

Let . Denote by the space of functions with (in the sense of weak derivative) in and with the norm defined as

 ∥f∥Wk,p=∫T∑0⩽|α|⩽k|Dαf(t)|pdt=∑0⩽|α|⩽k∥Dαf∥pLp(T)

for and

 ∥f∥Wk,∞=max0⩽|α|⩽k∥Dαf∥L∞(T).

is called the Sobolev space of integer order. Now we are ready to give the definition of the space.

A function on is said to be of bounded variation if and only if it is in , and is a finite (vector) Radon measure (i.e., for any and for every Borel set ,

 Dif(B)=supK⊂BK is compactDif(K),

see page 256 in [20]), such that for all and for all

 ∫Tf(t)Diϕ(t)dt=−∫Tϕ(t)Dif(t)dt.

Let with , then . Let and . The total variation of is defined as

or equivalently as

 |Df|(T)=sup{d∑i=1∫Tϕi(t)Dif(t)dt:ϕ∈(Cc(T),Rd),∥ϕ∥∞⩽1}.

The set of all functions of bounded variation on is denoted by . By we will denote the set of all bounded variation functions from to . By definition, for any , . It also holds that .

For the rest of the paper, for all , we let be a class with and of total variation . Clearly, . Also, to avoid measurability problems, we assume that all real-valued functions in this paper satisfy image-admissible Suslin condition [18].

## 3 Uniform convergence via an empirical L1-norm covering number

Following the combinatorial method of Pollard [38], we extend Lemma 10 combined with Lemma 11 of Bartlett and Long [11] to the multi-category setting. This gives a result where the scale of the covering number involves the margin parameter due to the use of a margin loss function. Unlike Pollard, and as in [11], we do not eliminate the additional sample introduced in the proof, as a result the exponential factor is reduced at the cost of making the covering number depend on points. However, the latter has no impact when using dimension-free combinatorial bounds. The extension of this result to mixing processes is given in Appendix.

###### Theorem 1.

Fix and . Then for any , there holds

 Pn(supg∈G(Lγ(g)−Lγ,n(g))>ϵ)⩽2N1(ϵγ8,FG,γ,2n)exp(−nϵ232). (2)
###### Proof sketch.

The proof is based on the following steps.

1)Apply the symmetrization technique of Vapnik and Chervonenkis [43] to the the left-hand side of (2) to bound it by

 2P2n{supg∈G(1nn∑i=1(ϕγ(fg,γ(Z′i))−ϕγ(fg,γ(Zi))))⩾ϵ2}.

where is a sequence of independent copies of also called a “ghost” sample.

2) Approximate by its finite cover with respect to the empirical -norm. Let be a subset of so that is an -net of of minimal cardinality , i.e., for all , there exists

 12nn∑i=1(|f¯g,γ(zi)−fg,γ(zi)|+∣∣f¯g,γ(z′i)−fg,γ(z′i)∣∣)<ϵγ8.

At this step we make use of the -Lipschitz property of :

 12nn∑i=1|ϕγ(f¯g,γ(zi))−ϕγ(fg,γ(zi))|+12nn∑i=1∣∣ϕγ(f¯g,γ(z′i))−ϕγ(fg,γ(z′i))∣∣ ⩽12nγn∑i=1(|f¯g,γ(zi)−fg,γ(zi)|+∣∣f¯g,γ(z′i)−fg,γ(z′i)∣∣)<ϵ8.

On the other hand,

 1nn∑i=1ϕγ(f¯g,γ(zi))−ϕγ(f¯g,γ(z′i))+1nn∑i=1ϕγ(fg,γ(z′i))−ϕγ(fg,γ(zi))<ϵ4.

It follows that

 1nn∑i=1(ϕγ(fg,γ(z′i))−ϕγ(fg,γ(zi)))⩾ϵ2⟹1nn∑i=1(ϕγ(f¯g,γ(z′i))−ϕγ(f¯g,γ(zi)))>ϵ4.

This bounds the probability in step (1) as

 P2n{max¯g∈¯G(1nn∑i=1(ϕγ(f¯g,γ(Z′i))−ϕγ(f¯g,γ(Zi))))>ϵ4}. (3)

3) For each , and admit the same distribution, and thus the difference in (3

) is a symmetric random variable which allows one to do the second symmetrization by introducing independent Bernoulli random variables

taking values in with equal probability. Then (3) is equal to

 ∫Z2nPσn(max¯g∈¯G1nn∑i=1σi(ϕ′¯g(z′i)−ϕ′¯g(zi))>ϵ4)dP2n(z2n) (4)

where .

4) Focusing on the integrand, apply the union bound and Hoeffding’s inequality (Theorem 2 in [24]), to upper bound the quantity (4) by

 exp(−nϵ232)∫Z2nN(ϵγ8,FG,γ,d1,z2n)dP2n(z2n).

Finally, the claimed bound follows from the fact that the expected value of the covering number is less than

 supz2n∈Z2nN(ϵγ8,FG,γ,d1,z2n).

## 4 Bounds on the metric entropy and the fat-shattering dimension of sets of the BV space

The straightforwad way to estimate the -norm metric entropy of sets of the space is to appeal to Theorem 10.1.2 in [6] which states the following. Let . For any and for any , there exists a function , such that and Let and suppose is the ball containing all satsifying the above conditions for each . Suppose that is an -net of with respect to the -norm. Then, for any function there is a function , such that

 ∫T|fϵ(t)−¯f(t)|dt<ϵ.

By the triangle inequality, on the other hand,

 ∫T|f(t)−¯f(t)|dt⩽∫T|f(t)−fϵ(t)|dt+∫T|fϵ(t)−¯f(t)|dt<2ϵ.

This implies that is a -net of . Then, according to Theorem 5.2 of [14], the upper bound on the metric entropy of subsets of Sobolev spaces, it holds

 lnN(ϵ,F,dL1)⩽K(2ϵ)d,

where is a constant possibly depending on and . However, the explicit form of this dependency is not known which is clearly a donwside.

Recently, [19] derived an upper bound on the -norm metric entropy of sets of the space thanks to Poincaré type inequalities [6], with explicit constants. However, in view of Inequality (2), we need a weighted -norm metric entropy estimate of the function class of interest (in fact, any weighted -norm works thanks to Inequality (1)). Using just the mentioned theorem, one can attempt at this as follows. Let be of total variation . If we assume

to be a family of probability distributions

on with the Lebesgue density satisfying where , then from Theorem 3.1 in [19] based on Hölder’s inequality it follows that:

###### Corollary 1.

Fix . Then for any ,

 lnN(ϵ,F,dL1(PT))⩽KM(√dAVKP)ddK2P(1ϵ)d,

where is an absolute constant.

Now, we need the empirical version of the bound holding for , a linear combination of Dirac measures supported on random variables , , taking values in and distributed independently according to . One could do it based on the argument in Lemma 3 in [10]. Following similarly to the proof of Theorem 1 and using Lemma 2 of [10], there holds

 PnT(supf,¯f∈F1nn∑i=1∣∣(f−¯f)(ti)∣∣−∫T∣∣(f−¯f)(t)∣∣dPT>ϵ2)⩽2N21(ϵ32,F,n)exp(−nϵ264),

where . Applying the combinatorial bound in [35] (any other bound for general function classes such as Lemma 3.5 in [2] could have been used, but this bound provides a better dependency on ),

 N(ϵ,F,d2,tn)⩽(7Mϵ)20dF(ϵ96), (5)

to the right-hand side gives

 PnT(supf,¯f∈F1nn∑i=1∣∣(f−¯f)(ti)∣∣−∫T∣∣(f−¯f)(t)∣∣dPT>ϵ2)⩽2(224Mϵ)40dF(ϵ3072)exp(−nϵ264).

Now, upper bound the right-hand side of the above inequality by . Let be an -net of with respect to the -norm. Then, for

where is arbitrarily small, for almost all points , and for any , there exists such that

 1nn∑i=1∣∣f(ti)−¯f(ti)∣∣−∫T∣∣f(t)−¯f(t)∣∣dPT<ϵ2,

implying

 1nn∑i=1∣∣f(ti)−¯f(ti)∣∣<ϵ.

Then, an -net of with respect to the -norm, is an -net of with respect to the empirical -norm. More precisely,

 lnN(ϵ,F,d1,tn)⩽KM(√dAVKP)ddK2P(2ϵ)d, (6)

is the metric entropy of with respect to the metric . This bound however is inefficient in the sense that it holds (almost surely) for large values of and only for the distributions with Lebesgue densities.

In fact, bounding the fat-shattering dimension of , then combining it with any metric entropy bound for general function classes (for instance the bound of (5)) in the empirical metric, , would result in a dedicated (to the space) metric entropy bound. In the following theorem, we bound the fat-shattering dimension of sets of the space.

###### Theorem 2.

Fix . Then,

 dF(ϵ)⩽(2A√Vdϵ)d. (7)
###### Proof.

According to the fundamental result on line integrals (see, for instance [45]), and Hölder’s inequality, for any , there holds

 f(t2)−f(t1) ⩽∫10⟨Df(t1+β(t2−t1)),t2−t1⟩dβ ⩽(∫10∥Df(t1+β(t2−t1))∥22dβ)12(∫10∥t2−t1∥22dβ)12 ⩽∥t2−t1∥2(∫10∥Df(t1+β(t2−t1))∥22dβ)12,

where is the standard Euclidean norm. Then,

 ∫10∥Df(t1+β(t2−t1))∥22dβ⩽∫T∥Df(t)∥22dt =∫T⟨Df(t),Df(t)⟩dt ⩽supϕ∈(Cc(T),Rd)∫T⟨Df(t),ϕ(t)⟩dt ⩽|Df|(T)⩽V.

Combining with the above inequality yields

 f(t2)−f(t1)⩽∥t2−t1∥2√V. (8)

Now, suppose that is a set of maximal cardinality -shattered by . Re-arrange the indices so that . Since is -shattered by , there exists a function in satisfying and for all . Then, since ,

 ∀ti∈S∖{t1},f(ti)−f(t1)⩾s(ti)−s(t1)+2ϵ⩾2ϵ.

Thus, for any point with there exists a function in for which and for all . Consequently,

 ∀tj∈S∖{t1,…,ti},f(tj)−f(ti)⩾2ϵ.

From these inequalities and from (8) it follows that is -separated with respect to the Euclidean metric . This implies that the fat-shattering dimension of is at most the packing number of its domain:

 n⩽M(T,(2ϵ√V),d2).

By the volume comparison argument, it follows that

 M(T,ϵ,d2) ⩽∣∣T+ϵ2B∣∣∣∣ϵ2B∣∣⩽∣∣√dAB+ϵ2B∣∣∣∣ϵ2B∣∣=(1+2√dAϵ)d,

where denotes the volume of . Combining the two bounds gives the desired result. ∎

Now, substituting Inequality (7) in the combinatorial bound (5), yields:

 lnN(ϵ,F,d2,tn)⩽20(198A√Vdϵ)dln(7Mϵ). (9)

In fact, one could use any metric entropy result for general function classes, such as Lemma 3.5 in [2] which depends on the sample size :

 lnN(ϵ,F,d∞,xn)⩽dF(ϵ4)log2⎛⎜ ⎜⎝2MendF(ϵ4)ϵ⎞⎟ ⎟⎠ln(16M2nϵ2). (10)

Notice that in both cases we have an additional logarithmic factor of , compared to the dedicated bound, Inequality (6), which at first sight might seem to be a drawback, particularly, for the latter bound displaying a as . However, it will prove to be useful when elaborating the dependency on the number of classes which is addressed in the upcoming section.

###### Remark 1.

Our bound on the fat-shattering dimension can be extended in a straightforward way to the space on general metric spaces called doubling spaces, i.e., metric spaces where each ball can be covered by a finite number of balls of half the radius. This is possible thanks to the work of [36] extending the functions to doubling spaces, and the packing number bound of [29] for sets of the mentioned spaces. In this case, the fat-shattering dimension will grow as a as , where denotes the doubling dimension of , and is equal to .

## 5 Decomposition of capacity measures and sample complexity estimate

In this section we estimate the sample size sufficient for the probability (2) to be at most with the emphasis on making explicit the dependency of this estimate on the number of classes. The latter is possible thanks to decomposition result of a capacity measure of which upper bounds the mentioned quantity by that of . This can be done either at the level of the metric entropy or the fat-shattering dimension, since the former is related to the latter via combinatorial bounds (as has been seen in the preceding section).

### 5.1 Metric Entropy

For general function classes, the decomposition result for metric entropies in the -norm was provided in [17] and extended to all -norms in Lemma 1 in [23]. In the context of this work, it takes the following form:

 lnN(ϵ,FG,γ,dp,zn)⩽ClnN(ϵC1p,G0,dp,xn). (11)

The important part in this bound to pay attention to is the scale of the metric entropy of the component class on the right-hand side which depends on the number of classes: the ”worst“ case is giving dependency and the ”optimal“ case corresponds to in which case the dependency on vanishes. Applying it to the metric entropy bound (6) with , or to (9) with , would yield a result scaling with as a . For problems involving a large number of classes and high dimensional input spaces, this is quite a prohibitive dependency. Now, for , we can use the metric entropy bounds established in Corollary 1 in [37] which when applied to sets of gives results scaling with as

 lnN(ϵ,FG,γ,dp,zn)⩽2Clogd2(2C)(60√dVAϵ)dln(30enlog2(2C)Mϵ),

and slightly worse for the dimension-free one, but still an improvement over the cases . Now, contrast it with the extreme case for which we have