# Learning the Hypotheses Space from data Part II: Convergence and Feasibility

In part I we proposed a structure for a general Hypotheses Space H, the Learning Space L(H), which can be employed to avoid overfitting when estimating in a complex space with relative shortage of examples. Also, we presented the U-curve property, which can be taken advantage of in order to select a Hypotheses Space without exhaustively searching L(H). In this paper, we carry further our agenda, by showing the consistency of a model selection framework based on Learning Spaces, in which one selects from data the Hypotheses Space on which to learn. The method developed in this paper adds to the state-of-the-art in model selection, by extending Vapnik-Chervonenkis Theory to random Hypotheses Spaces, i.e., Hypotheses Spaces learned from data. In this framework, one estimates a random subspace M̂∈L(H) which converges with probability one to a target Hypotheses Space M^∈L(H) with desired properties. As the convergence implies asymptotic unbiased estimators, we have a consistent framework for model selection, showing that it is feasible to learn the Hypotheses Space from data. Furthermore, we show that the generalization errors of learning on M̂ are lesser than those we commit when learning on H, so it is more efficient to learn on a subspace learned from data.

## Authors

• 7 publications
• 4 publications
• 7 publications
05/31/2013

### On model selection consistency of regularized M-estimators

Regularized M-estimators are used in diverse areas of science and engine...
01/26/2020

### Learning the Hypotheses Space from data Part I: Learning Space and U-curve Property

The agnostic PAC learning model consists of: a Hypothesis Space H, a pro...
09/08/2021

### Learning the hypotheses space from data through a U-curve algorithm: a statistically consistent complexity regularizer for Model Selection

This paper proposes a data-driven systematic, consistent and non-exhaust...
06/01/2016

### Model selection consistency from the perspective of generalization ability and VC theory with an application to Lasso

Model selection is difficult to analyse yet theoretically and empiricall...
08/10/2017

### Hypotheses testing on infinite random graphs

Drawing on some recent results that provide the formalism necessary to d...
01/25/2018

### Model selection and local geometry

We consider problems in model selection caused by the geometry of models...
06/07/2022

### Compositional Exploration of Combinatorial Scientific Models

We implement a novel representation of model search spaces as diagrams o...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In part I of this paper, we proposed the Learning Space of a general Hypotheses Space and discussed how it can be employed to avoid overfitting when estimating in a complex space with relative shortage of examples, aided by an U-curve property, which enables the selection of a Hypotheses Space without exhaustively searching . Carrying further the agenda of part I, we now study the theoretical feasibility of learning a Hypotheses Space from data and propose a consistent model selection framework based on Learning Spaces.

The fundamental result in Machine Learning which guarantees the feasibility of approximating a target hypothesis by an , estimated from a finite sample , is the Vapnik-Chervonenkis Theory (see [10, 11, 15, 12, 13, 14]), which treats two types of generalization errors:

 P(suph∈H∣∣LDN(h)−L(h)∣∣>ϵ) and P(L(^h)−L(h⋆)>ϵ) (1)

for , which we call type I and type II generalization errors, respectively. In (1) is the out-of-sample error and is the in-sample error of an . If type I generalization error is small, then we are able to estimate the out-of-sample error of any hypothesis in with great precision and confidence, while if type II generalization error is small then we are confident that the estimated hypothesis well approximate a target .

The VC bounds for (1) are based on the concept of VC dimension, which is a measure of the complexity of and, in binary classification problems, of the power of the hypotheses in

in classifying instances of the input variable

into the categories of the output variable . The bounds for these generalization errors, fixed and the sample size , are monotonically increasing functions of VC dimension, so that as smaller the Hypotheses Space is, in the VC dimension sense, the better is the generalization on it, so that we may regulate the VC dimension of the Hypotheses Space in order to better generalize. This may be accomplished by selecting a proper subset on which to learn. However, when we restrict the learning to such subspace we commit another two errors, which we call types III and IV generalization errors, that are respectively

 P(L(h⋆M)−L(h⋆)>ϵ) P(L(^hM)−L(h⋆)>ϵ)

for , in which is a target hypothesis of and is the hypothesis of estimated from . If type III error is small, then we are confident that a target hypothesis of well approximates a target hypothesis of . If type IV error is small, then we are confident that the estimated hypothesis of well approximates a target of . If both types III and IV generalization errors are small, then it is feasible to learn on instead of learning on , as types I and II errors are at least as small on as they are on , for so .

If there is no prior information about the target which allows us to restrict to an , so type III generalization error is zero and type IV reduces to II, it may not be possible to restrict beforehand and still estimate a good hypothesis relatively to . However, due to the Learning Space structure of proposed in part I, we may restrict to a random subset which is determined by an U-curve algorithm (see [2, 6, 8, 9]), from sample and possibly a validation sample. Such a subset is random for it depends on sample : is learned from data.

In this paper, we propose a framework for model selection to learn from data a subspace of on which to apply a learning algorithm, which is such that types I and II generalization errors are smaller than on , so that less examples are needed, and types III and IV generalization errors are asymptotically zero, i.e., tend to zero when the sample size increases. We show that, in the proposed approach, all generalization errors converge to zero when , and that converges almost surely to a target subspace of which is the subspace in with least VC dimension that contains a target hypothesis . We say that is consistent if it satisfies these two properties. Our approach does not demand the specification of a Hypotheses Space a priori, but rather introduces the learning of a Hypotheses Space from data among those in , so prior information is all embed in .

The concept of target Hypotheses Space is central in our approach. As is the case in all optimization problems, there must be an optimal solution to the model selection problem which satisfies certain conditions. In the proposed approach, the optimal is the subspace in with least VC dimension which contains a target hypothesis. From the point of view of generalization errors, this subspace provides the best circumstances in to learn hypotheses with a fixed sample of size . On the one hand, type III generalization error is zero and type IV reduces to type II. On the other hand, the VC dimension is minimal under theses constraints, so types I and II generalization errors are smallest. Now, having defined the target (optimal) subspace in , one needs to develop procedures to consistently estimate it.

The structured data driven selection of a Hypotheses Space based on the Learning Space extends the state-of-the-art in Machine Learning, as the standard procedure to select a Hypotheses Space consists of defining a priori

a set of subspaces, and then selecting one of them based on a loss function, a process which does not take advantage of a more complex structure of a Hypotheses Space, and is performed by an exhaustive search on this set of subspaces (see

[1, Chapter 4]). Also, even when there is some structure between the candidates subspaces, as is the case in the Structured Risk Minimization (SRM) Inductive Principle [11, Chapter 4], there is a need for the specification of tuning parameters such as confidence and for a combinatorial algorithm which may not be computable if the number of candidates is too great. Our method does not demand the specification of tuning parameters and, due to the U-curve property presented in part I, we may avoid an exhaustive search of in order to select a subspace, so the method is practical in a handful of cases, even though it is still combinatorial.

Following this Introduction, Section 2 presents an overview of the concepts related to the VC theory which are used in this paper. In Section 3, we propose a framework to learn Hypotheses Spaces from data, and show that it is consistent, in the sense that the generalization errors converge to zero and converges to a target subspace when . Finally, in the Discussion, we present the main contributions of this paper and future research perspectives.

## 2 Vapnik-Chervonenkis Theory and Learning Spaces

In this section, we discuss the main concepts related to VC Theory and present an overview of the Learning Space. As in part I, we focus in the case where the output variable is binary and the loss function is the simple one, so the risk is the classification error. However, much of the theory may be extended to other cases, in a Statistical Learning framework. The notation used here is defined in part I.

### 2.1 Vapnik-Chervonenkis Theory

The VC Theory justifies the learning on Hypotheses Spaces with infinite cardinality, and is also the hard stone of the learning machine developed in this paper. The main concepts of the VC theory are the shatter coefficients and VC dimension of a Hypotheses Space. In this section, we briefly present these concepts and the main results related to them. For a comprehensive presentation of the VC theory see [1, 4, 10, 11, 15, 12, 13, 14]. The shatter coefficients and VC dimension of a Hypotheses Space are defined as follows.

###### Definition 1

Suppose that , and . The -th shatter coefficient of is defined as

 S(H,N)=max(x1,…,xN)∈XN∣∣{(h(x1),…,h(xN)):h∈H}∣∣

in which is the cardinality of a set. The VC dimension of a Hypotheses Space is the greatest integer such that and is denoted by . If , for all integer , we denote .

The shatter coefficients and VC dimension measure the richness of hypotheses in , or the power of the hypotheses in in classifying instances with values in in the categories . Therefore, the VC dimension is a measure of the complexity of , and is an useful tool for bounding the so called generalization errors, which, in classical VC theory, are of two types:

 P(suph∈H∣∣LDN(h)−L(h)∣∣>ϵ) (2)

and

 P(L(^h)−L(h⋆)>ϵ) (3)

for fixed. We call (2) type I generalization error and (3) type II generalization error. On the one hand, if type I generalization error is small for small , then we are confident that we can properly estimate the loss of each hypothesis in

by the in-sample error, i.e., we can generalize the in-sample error to out-of-sample instances. On the other hand, if type II error is small for small

, then we are confident that the estimated hypothesis is as good as a target hypothesis of , i.e., we can properly approximate target hypotheses. Further discussions about generalization errors may be found at [1, 10].

In the remaining of this section, we present the classical VC bounds for types I and II generalization errors, given and , which we seek to extend to a random subspace in later sections asserting the feasibility of learning (on) . For simplicity, we assume throughout this paper that and is the simple loss function. However, the results may be applied to cases in which bounds for types I and II generalization errors are available, with the necessary modifications (see for example the results in [11, Chapter 3]). Proofs for all results of this section may be found at [4, 10]. The VC bounds are based on a generalization of Glivenko-Cantelli Theorem [4, Theorem 12.4] and Hoeffding’s inequality [5].

A first bound for type II generalization error may be obtained for the case in which and without much effort.

###### Proposition 1 (Vapnik and Chervonenkis [12] as cited in [4])

Assume that and . Then, for every and

 P(L(^h)>ϵ)≤|H|e−Nϵ

With little work we may obtain a bound for the generalization errors when by applying Hoeffding’s inequality and the union bound.

###### Proposition 2

Assume that . Then, for every and

 P(suph∈H∣∣LDN(h)−L(h)∣∣>ϵ)≤2|H|e−2Nϵ2

A first distribution free bound for the generalization errors on general Hypotheses Spaces may be obtained by an extension of Glivenko-Cantelli Theorem based on the -th shatter coefficient.

###### Proposition 3 (Vapnik and Chervonenkis [15] as cited in [4])

Let be a general Hypotheses Space. Then, for every and

 P(suph∈H∣∣LDN(h)−L(h)∣∣>ϵ)≤8S(H,N)e−Nϵ232 P(L(^h)−L(h⋆)>ϵ)≤8S(H,N)e−Nϵ2128

Bounds based on the VC dimension instead of the -th shatter coefficient may also be obtained for generalization errors of Proposition 3 and tighter bounds for the generalization errors in Propositions 1 to 4 may be found in [4, Chapter 12].

###### Proposition 4 (Vapnik [10, Theorem 4.4])

Let be a general Hypotheses Space with . Then, for every and

 P(suph∈H∣∣LDN(h)−L(h)∣∣>ϵ)≤4exp{[dVC(H)N(1+log(2NdVC(H)))−(ϵ−1/N)2]N}

and

The classical VC theory presented above may be applied to a given Hypotheses Space without any further work by making the obvious modifications. However, it cannot be applied in its current form to a random subset of . In fact, the main objective of this paper is to show the convergence to zero of generalization errors of learning on , random, learned from data, when . With respect to types I and II generalization errors, this mean showing bounds analogous to Propositions 1 to 4 for type I generalization error, with the supremum taken over , and type IV generalization error, which is type II with in place of . This is accomplished in next sections.

### 2.2 Learning Spaces

In order to learn from data a subspace on which to apply a learning machine, we define a structure on , namely, a Learning Space, that has nice properties which allow the consistent learning of such subset. The Learning Space was proposed in part I and is as follows.

###### Definition 2

Let be a general Hypotheses Space and be a finite subset of the powerset of , . We say that the poset is a Learning Space if

• and implies

We define the VC dimension of as

 dVC(L(H))\coloneqqmaxi∈JdVC(Mi)

A Learning Space is a poset of subspaces of which covers and has a constraint on the VC dimension of related sets. By (i), it may be seen as an structuring of that we may take advantage of to select a subspace on which to apply a learning machine. Given , may be constructed by a lattice isomorphism from a poset to that preserves the partial ordering . The Learning Space is said to be a Lattice Learning Space if is a complete lattice,111For a definition of complete lattice see [3]. in which is the infimum operator, is the supremum operator, is the least subset and is the greatest subset of (see part I for more details).

The main feature of is that it may be employed to avoid overfitting when estimating , and also to select a subspace on which to apply a learning machine. Indeed, may be selected among the subspaces in whose estimated target hypotheses have the lowest estimated out-of-sample error. These subspaces are local and global minimums of dense chains, as follows.

###### Definition 3

A sequence is called a dense chain of if, and only if, for all in which means distance on the Hasse diagram .

###### Definition 4

The subspace is:

• a weak local minimum of a dense chain of if , in which ;

• a strong local minimum of if it is a weak local minimum of all dense chains of which contain it;

• a global minimum of a dense chain if ;

• a global minimum of if .

In order to select we need only to identify all weak, or all strong, local minimums of , as is desired to be one of them: a global minimum. A routine to find weak/strong local minimums may be implemented via an U-curve algorithm (see part I for more details).

### 2.3 Estimating the Target Hypotheses Space

The structuring of a Hypotheses Space by a Learning Space brings upon two paradigms to the Machine Learning Theory regarding the Hypotheses Space on which to apply a learning machine. First, a Learning Space may be a tool to reduce a priori the Hypotheses Space to an . Indeed, we may have prior information about that permits us to reduce our Hypotheses Space, e.g., that . Nevertheless, this reduction from to may be now done by selecting a sub-Learning Space , filtering according to some criteria.

The class may also be selected based on the complexity we desire to have. A manner of choosing is to limit the VC dimension of the subspaces in . Such limiting value may be determined by the sample size available and the desirable error and confidence by inverting the bound of Proposition 4.

###### Proposition 5

If for ,

 dVC(M)≥2Nexp{−NdVC(M)[1Nlog(δ4)+(ϵ−1/N)2]+1}

then

 P(suph∈M∣∣L(h)−LDN(h)∣∣≤ϵ)≥1−δ.

The downside of choosing based on Proposition 5 is that we do not know the approximation error

 L(h⋆M)−L(h⋆)

so that choosing a priori based on and the desired complexity of the Hypotheses Space may not be feasible, for, even though types I and II generalization errors are small on , i.e., is close to for , and is close to , may be too greater than so there is considerable loss in precision (bias) when we reduce our Hypotheses Space from to . In other words, the VC theory may be applied to bound the error between the estimated and target hypotheses of , but does not guarantee that the target hypotheses of are as good as the target of . Although an SRM algorithm may mitigate this issue, it is still dependent on tuning parameter , which should be chosen a priori.

A manner of ensuring that, at least when “ is large”, the error, that we call type III generalization error,

 P(L(h⋆^M)−L(h⋆)>ϵ) (4)

is small for , is to choose a random subspace of based on data, instead of constraining beforehand. This brings us to the second paradigm: by the structuring of we may learn from data the Hypotheses Space on which to apply a learning machine by a procedure free of tuning parameters such as and . However, we need to ensure that it is feasible to learn Hypotheses Spaces this way, in the sense that it is possible to bound not only types I and II generalization errors, but also, and perhaps most importantly, type III generalization error on ; this is the topic of next section.

## 3 The Feasibility of Learning Hypotheses Spaces from Data

In this section, we bound types I and II generalization errors when learning on a random subspace of , defined on a probability space , which is as follows. Let be a sample of size , a Learning Space of and a consistent estimator of , independent of 222We will show that these two conditions are not necessary. We assume them at this point to make the ideas clearer.. Suppose that we have an -measurable learning machine , dependent on the Learning Space , satisfying

 ω∈Ω(DN,^P)−−−−→(DN(ω),^P(ω))  ML(H)  −−−−−−→^M(ω) (5)

which is such that, given and , it learns a subspace . Note from diagram (5) that is an -measurable -valued function as it is the composition of measurable functions, i.e., . Even though is a function of and , we drop the subscript to ease notation.

The main feature of which allows the learning of Hypotheses Spaces is that type III generalization error is asymptotically zero:

 P(L(h⋆^M)−L(h⋆)>ϵ)N→∞−−−−→0 (6)

for all which is equivalent to

 P(^M∩h⋆=∅)N→∞−−−−→0

as .

We first study types I and II generalization errors obtained when learning on a random subset of . Theorem 1 extends Proposition 1 when learning on an in the case in which and . In this scenario, learning on may be less efficient than learning on when we cannot guarantee with probability 1 that , for, otherwise, learning on is at least as efficient as learning on . In what follows, means expectation under .

###### Theorem 1

Let be a random subset learned by , and assume that and . Then, for all and ,

 P(L(^h^M)>ϵ) ≤P(L(h⋆^M)>0)+P(L(h⋆^M)=0)E(|^M|∣∣L(h⋆^M)=0)e−Nϵ.

The bounds for types I and II generalization errors of Propositions 2, 3 and 4 may be tightened by taking expectations of functions of . All bounds of Theorem 2 for are at least as good as the respective bounds for a Hypotheses Space of VC dimension , so that, in general, regarding types I and II generalization errors, learning on is at least as efficient as learning on .

###### Theorem 2

Let be a random subset learned by . Then

 P( suph∈^M∣∣LDN(h)−L(h)∣∣>ϵ)≤

and

 P( L(^h^M)−L(h⋆^M)>ϵ)

All quantities above on the right-hand side of the inequalities are lesser than the same expressions but with and in place of and , respectively. This shows that the sample complexity needed to learn on is at most that of . This implies that this complexity is at most that of , but may be much less if . We conclude that types I and II generalization errors on are lesser than that on and the sample complexity needed to learn on is that of , and not of . Therefore, even if , it may be feasible to learn on it, if we restrict the learning to such that the expectations on the right hand side of inequalities in Theorem 2 tend to zero as . However, even when these inequalities guarantee the consistency of regarding types I and II generalization errors, it is still necessary to check that types III and IV generalization errors are small to attest the feasibility of learning on .

We now study the error

 P(L(^h^M)−L(h⋆)>ϵ) (7)

which, when small, make it feasible to learn on instead of on . This error, that we call type IV generalization error, is the one we commit when learning on instead of on , as, by Theorem 2, types I and II generalization errors are, in general, smaller on , so the loss incurred by learning on is due to type III, and specially type IV, generalization errors. Types II, III and IV generalization errors are represented in Figure 1.

As type IV generalization error may be bounded by the following triangular inequality of types II and III generalization errors

 P( L(^h^M)−L(h⋆)>ϵ)≤P(L(^h^M)−L(h⋆^M)>ϵ/2)+P(L(h⋆^M)−L(h⋆)>ϵ/2) (8)

we need only to show that type III generalization error (4) tends to zero as to guarantee that (7) is asymptotically zero, as the first term on the right-hand side of (8) tends to zero as by Theorem 2 if .

From the bounds of Theorems 1 and 2, and (8), we observe a pattern on the behaviour of types I, II, III and IV generalization errors which is depicted in Figure 1. On the one hand, as one may expect, types I and II generalization errors on are at most as great as on when , which is natural as with probability one and, therefore, is at most as complex as , but may be much simpler. Actually, the complexity of is bounded above by that of a Hypotheses Space of VC dimension . On the other hand, type III, and consequently type IV, generalization errors on may be too great, making it unfeasible to learn on .

Therefore, if guarantees a type III generalization error small, and consequently a small type IV generalization error, then learning on is as good as learning on , although learning on is more efficient, as types I and II generalization errors are smaller on it, and less examples are needed. Hence, there is a tradeoff between

• the complexity of needed to obtain small types III and IV generalization errors;

• types I and II generalization errors obtained when restricting the learning to .

As greater the complexity of , the smaller we expect types III and IV generalization errors to be; and greater types I and II generalization errors to be. Nevertheless, the complexity of may be too great in the sense that it is not worth the effort to learn on : one would rather learn on anyhow and guarantee zero types III and IV generalization errors. Therefore, this tradeoff must be regarded when learning Hypotheses Spaces: is it worth to learn on a simpler subspace at the cost of types III and IV generalization errors? With this tradeoff in mind, we seek to develop learning machines that guarantee asymptotically zero types III and IV generalization errors, which an implementation is possible due to the U-curve property.

### 3.1 A Machine to Learn Hypotheses Spaces

A suitable learning machine to learn Hypotheses Spaces needs to satisfy (6). Furthermore, it is desired the Hypotheses Space learned by to be as simple as it can be under the restriction that it is consistent, i.e., types III and IV generalization errors are asymptotically zero. The simpler nature of ensures smaller types I and II generalization errors by Theorem 2, so we need only to ensure small types III and IV generalization errors when the sample size is great enough. With this desired consistency in mind, the target Hypotheses Space of the learning machine to be developed, i.e., the Hypotheses Space that we seek to approximate by , is the somewhat smallest subspace which contains a target hypothesis, as follows.

Define in the equivalence relation given by

 Mi∼Mj if, and only if, dVC(Mi)=dVC(Mj) and L(h⋆i)=L(h⋆j)

for : two subspaces in are equivalent if they have the same VC dimension and their target hypotheses are as good. In this framework, let

 L⋆=argminMi∈ \nicefracL(H)∼L(h⋆i)

be the equivalence classes which contain a target hypothesis of , so that we define the target subspace as

 M⋆=argminMi∈L⋆dVC(Mi)

which may be seen, in some sense, as the class of the smallest subspaces in that are not disjoint with .

The target class will have only one element if is unique and an additional condition is satisfied by . Define for each the set as the union of the subsets in which are properly contained in . Then is unique if implies , i.e., subtracting the hypotheses which are in related subspaces of lesser VC dimension, the subspaces with same are disjoint. This condition is satisfied for example by the Partition Lattice Learning Space on which each subspace with is an union of subspaces with VC dimension exactly one unity less. Indeed, in this case, , as it is either the subspace of the constant functions if is constant or , in which . There may be other conditions on and which guarantee the uniqueness of in some cases of interest. Nonetheless, an unique does not imply that the class has only one element as there may be two subspaces in with same minimum VC dimension which contain . The condition above is distribution-free, i.e., does not depend on .

We will develop a learning machine such that

 ^M=ML(H)(DN,^P)N→∞−−−−→M⋆ with % probability 1 (9)

in which . Not only (9) implies (6), but it also implies that

 E(F(^M))N→∞−−−−→F(M⋆)

by the Dominated Convergence Theorem, in which is any -measurable real-valued function, as the domain of is finite. The convergence of ensures that the functions of on right-hand side of inequalities of Theorems 1 and 2 tend to the same functions evaluated at , when . From this follows that is consistent.

A learning machine which satisfies (9) may be defined by mimicking the definition of , but employing the in-sample and estimated out-of-sample errors instead of the out-of-sample error. Define in the equivalence relation given by

 Mi^∼Mj if, and only if, dVC(Mi)=dVC(Mj) and ^L(^hi)=^L(^hj)

for , which is a random -measurable equivalence relation. We remember that the representative (estimative) hypothesis of an is obtained by empirical risk minimization, while is a consistent estimator of the out-of-sample error, commonly given by expectation of a loss function under an empirical measure generated by a validation sample. This is performed to avoid overfitting and to obtain a regularized estimation procedure via an U-curve problem (see part I for more details). Let

 ^L=argminMi∈ \nicefracL(H)^∼^L(^hi)

be the subset of containing all global minimums of by Definition 4. Then, selects

 ^M=argminMi∈^LdVC(Mi). (10)

By selecting this way, we get to learn on relatively simple Hypotheses Spaces, what is more efficient, yielding smaller types I and II generalization errors. Indeed, selecting this manner we ensure that it is going to have the smallest VC dimension under the constraint that it is a global minimum of and, as the quantities inside the expectations on Theorem 2 are monotonically crescent functions of VC dimension, fixed and , we tend to have smaller expectations, thus tighter bounds. Furthermore, this choice of is in accordance with the paradigm of selecting the simplest model that properly express reality, which in this case is represented by the fact that is the simplest global minimum. Observe that if under satisfies an U-curve property, then (10) may be computed via an U-curve algorithm (see part I for more details).

We now present conditions which ensure that converges to with probability 1. In order to find in , we do not need to know exactly for all , i.e., we do not need . We argue that it suffices to have close enough to so one can figure out that contains a target hypothesis, even if one does not know its error for sure. The close enough above depends of , i.e., is not distribution-free, and is given by the maximum discrimination error (MDE) of under defined as

 ϵ⋆=ϵ⋆(L(H),P)\coloneqqminM∈\nicefracL(H)∼L(h⋆M)≠L(h⋆M⋆)L(h⋆M)−L(h⋆M⋆)

The MDE is the minimum difference between the out-of-sample error of a global target hypothesis and of a target of a subspace in which does not contain a global target. In other words, it is the difference between the error of the best subspace and the second to best. The MDE is greater than zero if there exists at least one such that , i.e., there is a subset in which does not contain a target hypothesis. If for all , then type III generalization error is zero, and type IV reduces to type II, so by Theorems 1 and 2 it is feasible to learn (on) . Thus we assume .

The terminology MDE is used for we can show that if we are able to -estimate by , and have -close to , for all simultaneously, with high probability, then with high probability, i.e., we can discriminate from all other with high probability. We show that a given constant times is the greatest error one can commit when estimating by and can have between and in order for to be equal to . This is the result of next lemma.

###### Lemma 1

Suppose that there exist such that

 P(supi∈J∣∣L(^hi)−^L(^hi)∣∣<ϵ⋆/4,supi∈J∣∣L(^hi)−L(h⋆i)∣∣<ϵ⋆/4)≥1−δ. (11)

Then

 P(^M=M⋆)≥1−δ. (12)

From Lemma 1 one sees that the independence between and is not a necessary condition to guarantee the consistency of . Nevertheless, from this lemma and the independence between and , it follows immediately the almost sure convergence of in a quite general framework. We show that if is an uniformly consistent estimator of and type II generalization error tends to zero for all , then converges to with probability when .

###### Theorem 3

Assume that and are independent. If for all

 limN→∞P(suph∈H∣∣L(h)−^L(h)∣∣>ϵ)=0 and limN→∞P(supi∈J∣∣L(^hi)−L(h⋆i)∣∣>ϵ)=0 (13)

then

 ^M=ML(H)(DN,^P)N→∞−−−−→M⋆ with % probability 1.

If , then

 limN→∞P(supi∈J∣∣L(^hi)−L(h⋆i)∣∣>ϵ)=0

for all by Proposition 4, so that will converge to with probability if

 limN→∞P(suph∈H∣∣L(h)−^L(h)∣∣>ϵ)=0 (14)

for all . Now, if

 ^L(h)=1MNMN∑k=1ℓ(h(~Xk),~Yk)

in which is a validation sample of , independent of , then (14) follows from Proposition 4 if .

###### Corollary 1

If is the empirical mean of under a validation sample, independent of , whose size increases to infinity with , then is consistent.

From the results above, assuming that is fixed, follows that the Learning Space plays an important role in the rate of convergence of to 1, by means of . If the MDE of under is great, then we need less precision when estimating by and by in order for be equal to , so less examples are needed to learn . Also, the sample complexity to learn is that of the most complex Hypotheses Subspace in , as the supremum inside each probability is in , what implies that this is at most the complexity of a subspace with VC dimension , which may be lesser than that of . Therefore, one must embed into all prior information about and/or , seeking to increase and decrease . It is clear that under this approach to model selection, the modelling of and the development of families of Learning Spaces which solve classes of problems are important topics for future researches.

From a practical point of view, the split of a sample between training () and validation () may be guided by Theorem 3: fixed , find the sample sizes for which we may take as bounds for the quantities inside the limits (13) such that the lower bound for is greatest (see (18) for such bound). Observe that is a bound for a type I-kind generalization error, while is a bound for a type II-kind generalization error. Therefore, we may choose the ratio based on bounds available to types I and II generalization errors for and each .

By the deductions above, it follows that both types III and IV generalization errors tend to zero if conditions (13) are satisfied, as implies that type III and IV errors are zero. However, we are not able, by making use of the bounds provided by VC theory and extended to in this paper, to find a bound for these generalization errors which do not depend on , and thus on and . In other words, we have established a distribution free convergence of to , but are not able at this time to provide a distribution free rate to such convergence, although we know that it is exponential depending on if .

## 4 Discussion

In this paper, we proposed a data-drive systematic, structured and consistent approach to model selection, establishing bounds for generalization errors and the convergence of the estimated random Hypotheses Space to the target Hypotheses Space . We introduced the maximum discrimination error and evidenced the importance of properly embedding all prior information into the Learning Space. In this section, we discuss some ramifications and perspectives of the proposed approach.

The asymptotic behaviour of learning Hypotheses Spaces under the Learning Space approach evidences an important property of Machine Learning which may be the key to understand why learning machines work even when there is lack of data. Consider the following example. Let be the space of all binary functions with domain , , and be the Partition Lattice Learning Space (see part I for more details). Define

 L1(H)\coloneqq{M∈L(H):dVC(M)≤2}

as the subspaces in with VC dimension less or equal to two, which is a Learning Space with . From the results of this paper and part I we have the following properties when:

• Learning on : the weak U-curve property is satisfied, so we do not have to exhaustively search , whose size is the Bell number, and the sample complexity needed to have low types I and II generalization errors is at most that of a subspace of VC dimension .

• Learning on : we have at principle to exhaustively search , which has size . However, the sample complexity needed to have low types I and II generalization errors is at most that of a subspace of VC dimension by Theorem 2.

From these properties we conclude that:

1. Types I and II generalization errors are smaller on and, as the MDE and the target subspace are the same in and , the rate of convergence of types III and IV Generalization Errors are at principle the same at both Learning Spaces.

2. Although we loose in types I and II generalization errors when learning on , we gain in computational efficiency, as we do not have to exhaustively search due to the U-curve property. If the sample size is great enough so that generalization errors on are acceptable, learning on it is better, due to less need for computational power.

These properties are evidence that the lack of experimental data may be mitigated by a great computational capacity. On the one hand, if we have few examples, but high computational power, we may learn on as the sample complexity is that of a Hypotheses Space with VC dimension so less examples are needed. On the other hand, if we have a sufficient number of examples, then we may learn on which demand less computational power, due to the U-curve property. As in both cases the MDE and target space is the same, we do not loose regarding types III and IV generalization errors when learning in any of them, so we need either computational power or a sample of size sufficiently large to decrease types I and II generalization errors.

The proposed framework may also be applied in order to try to reduce the approximation error, which is as follows. Let be the set of all functions with domain , the support of , and image , which is possibly a Hypotheses Space with infinite VC dimension. Denote

 hBayes=argminh∈H⋆L(h)

so that is the Bayes error, the least error we commit when classifying instances with values . Note that may or may not be in , and when it is not, we commit the error

 L(h⋆)−L(hBayes)

which is called approximation error (see [4, Chapter 12]). This error is in general not controllable and, in order to decrease it, one must increase , which in turn increases the risk of overfitting if the sample size is not great enough. This scenario in depicted in Figure 2 (a).

However, with the method presented in this paper and part I, we may increase mitigating the risk of overfitting, so we expect to be able to reduce the approximation error. In a perfect scenario, one would expect the scheme presented in Figure 2 (b): we choose an highly complex so it contains , but we actually learn on a relatively simple subspace which also contains . Even if is not in our complex , which we actually do not know, we may expect the approximation error to be small, as, in principle, we chose an more complex than we would if we were not able to control overfitting nor search .

The considerations about Bayes error, the extended learning model under a Learning Space, the U-curve property and the feasibility of learning the Hypotheses Space with less a priori choices as possible, i.e., lack of tuning parameters, are the main contributions of this sequence of papers. We strongly believe that the Learning Space framework for model selection may be applied to a handful of established learning models in order to obtain optimal or suboptimal solutions, which may improve their performance. Also, the approach may be used as a regularization procedure which may be audited, by tracing an U-curve algorithm, in order to obtain answers to the why a Hypotheses Space was selected, being quite useful in the field of Machine Learning interpretability. Finally, we believe that the consequences of the sample size versus computational power paradigm should be further investigated to improve our knowledge about the high performance of black box algorithms.

There are multiple perspectives for future researches. From a theoretical standpoint, one could try to find distribution-free bounds for types III and IV generalization errors in some cases of interest, as when is the empirical measure generated by a validation sample, which depend only on . Furthermore, it would be interesting to show a variant of Theorem 3 when there is dependence between and . Observe that this is most often the case in practice, when cross-validation and resampling methods are used instead of an independent validation sample. Regarding U-curve algorithms, there is a lack of a multi-purpose general implementation of it, as it is implemented only for specific cases (see [7]). Also, there is a lot of ground to break in the direction of developing families of Learning Spaces which solve a class of problems, showing the existence of the U-curve property for other Learning Spaces, and developing optimal and suboptimal U-curve algorithms. The extension to the learning model proposed here may be a tool to understand and enhance the always increasing niche of high performance and computing demanding learning applications.

## Appendix A Proof of results

#### Proof:

[Proof of Theorem 1] If , then is equal to

 P(L(^h^M)>ϵ|L(h⋆^M)=0)P(L(h⋆^M)=0)+P(L(^h^M)>ϵ,L(h⋆^M)>0)

which is lesser than