# Measurability Aspects of the Compactness Theorem for Sample Compression Schemes

It was proved in 1998 by Ben-David and Litman that a concept space has a sample compression scheme of size d if and only if every finite subspace has a sample compression scheme of size d. In the compactness theorem, measurability of the hypotheses of the created sample compression scheme is not guaranteed; at the same time measurability of the hypotheses is a necessary condition for learnability. In this thesis we discuss when a sample compression scheme, created from com- pression schemes on finite subspaces via the compactness theorem, have measurable hypotheses. We show that if X is a standard Borel space with a d-maximum and universally separable concept class C, then (X,C) has a sample compression scheme of size d with universally Borel measurable hypotheses. Additionally we introduce a new variant of compression scheme called a copy sample compression scheme.

## Authors

• 1 publication
• ### Unlabeled Compression Schemes Exceeding the VC-dimension

In this note we disprove a conjecture of Kuzmin and Warmuth claiming tha...
11/29/2018 ∙ by Dömötör Pálvölgyi, et al. ∙ 0

• ### Agnostic Sample Compression for Linear Regression

We obtain the first positive results for bounded sample compression in t...
10/03/2018 ∙ by Steve Hanneke, et al. ∙ 0

• ### A New Lower Bound for Agnostic Learning with Sample Compression Schemes

We establish a tight characterization of the worst-case rates for the ex...
05/21/2018 ∙ by Steve Hanneke, et al. ∙ 0

• ### Artificial general intelligence through recursive data compression and grounded reasoning: a position paper

This paper presents a tentative outline for the construction of an artif...
06/14/2015 ∙ by Arthur Franz, et al. ∙ 0

• ### A Geometric Approach to Sample Compression

The Sample Compression Conjecture of Littlestone & Warmuth has remained ...
11/18/2009 ∙ by Benjamin I. P. Rubinstein, et al. ∙ 0

• ### Unlabeled sample compression schemes and corner peelings for ample and maximum classes

We examine connections between combinatorial notions that arise in machi...
12/05/2018 ∙ by Jérémie Chalopin, et al. ∙ 0

• ### Model Compression

With time, machine learning models have increased in their scope, functi...
05/20/2021 ∙ by Arhum Ishtiaq, et al. ∙ 10

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1.1 Vapnik-Chervonenkis Dimension

We begin with the definitions of a concept space and the VC dimension associated to a concept space.

A concept space is a pair consisting of a set equipped with a set of subsets of . is referred to as the domain, and is referred to as the concept class. For a subset of , denote

 \CC⊓A={C∩A:C∈C},

and we say that is a subspace of if and .

[[vapnik:264]] We say that a subset of is shattered by if .

[[vapnik:264]] The Vapnik-Chervonenkis dimension or VC-dimension of (denoted , or when is understood) is

 VC(C)=sup{|A|:A⊆X,A is finite, A is % shattered by C}.

In particular if the value is infinite, we say .

The following are some elementary or well known examples of VC dimension which can be found in every text on statistical learning.

###### Example

Let be any infinite set and , Then clearly because every (finite) has and so is shattered.

###### Example

Let be any totally ordered set with at least two elements, and let

 C={Ix:x∈X}∪{∅},

where is an initial segment of . For any where , without loss of generality , we have

 C⊓{x,y}={∅,{y},{x,y}},

hence is shattered, however and so is not shattered. Therefore .

###### Example

Let and

 C={[a,b]×[c,d]:a,b,c,d∈R}.

Clearly shatters . Now let

 A={(a1,a2),(b1,b2),(c1,c2),(d1,d2),(e1,e2)}

be given. Without loss of generality is the leftmost point, is the highest point, is the rightmost point, and is the lowest point. Since

 {(a1,a2),(b1,b2),(c1,c2),(d1,d2)}⊆[a,b]×[c,d]∈C,

we have

 (e1,e2)∈[a1,c1]×[d2,b2]⊆[a,b]×[c,d],

and so

 {(a1,a2),(b1,b2),(c1,c2),(d1,d2)}∉C⊓A.

Therefore =4.

Unless otherwise specified, from now on we will consider to be our concept space, and .

[[vapnik:264]] The n’th shatter coefficients of are defined to be

 s(\CC,n)=max{|\CC⊓A|:A⊆X, |A|=n}.

Note that .

###### Notation

Let denote

 (n≤d)=d∑i=0(ni).

[Sauer-Shelah Lemma [MR0307902]] Let . Then

 s(\CC,n)≤(n≤d)≤(end)d.

We can consider as “function class"; a family of valued functions on : Let

 F\CC={χC:C∈\CC}

where is the indicator function of on . Similarly, if is a family of valued functions on we can get a concept class

 \CCF={C∈2X:χC=f, for some f∈F}.

Defining shattering for a function class as: is shattered by if . We can see that shatters iff shatters , and shatters iff shatters so the two notions are equivalent.
In the future we will consider concepts as functions, but will still use set relations and operations on concepts, which will have the obvious meaning; for instance will be the same as , the same as support support, the same as , etc.

### 1.2 Maximum and Maximal Classes

The following definitions are due to [welzl87rangespaces]. Let . A concept class is d-maximum if for every finite,

 |\mC⊓A|=(|A|≤d).

A concept class is d-maximal if ,
and for any we have .

Note that if is -maximum, then because for , if then

 |\mC⊓A|=(d≤d)=2d=2|A|,

so is shattered, and if then

 |\mC⊓A|=(|A|≤d)<2|A|,

so is not shattered.

As a consequence of Zorn’s Lemma every concept class of VC dimension is contained in a -maximal concept class.

Maximum does not necessarily imply maximal and vice versa. Also note that if is -maximum, any subspace of is -maximum as well, but this is not necessarily the case for -maximal.

###### Example

Let ,
. It is easy to check is -maximal but not -maximum since

 |\mC|=10<11=(4≤2).
###### Example ([Floyd95samplecompression])

Let ,
. It is easy to check is -maximal but not -maximum since

 |\mC|=10<11=(4≤2).
###### Example

Let and . For any finite, without loss of generality with , we have that

 |\mC⊓A|=|{∅,{x1},{x1,x2},...,{x1,...,xn}}|=|A|+1=(|A|≤1),

thus is -maximum. However, is not -maximal since . Note that any concept space where is totally ordered with no minimal element, and where is the set of all initial segments, is -maximum. This is also the case if has at least two elements, where is the set of all initial segments and the empty set.

###### Remark

If is finite, then -maximum implies -maximal.
If is -maximum, then any has

 |\mC∪{A}|=|\mC|+1=(|X|≤d)+1>(|X|≤d)

hence by Sauer’s Lemma , and therefore is -maximal.

[[welzl87rangespaces]] Let be finite with VC-dimension . For , there are at most sets such that and .

###### Démonstration.

Let , and

 C′={C∈C:x0∈C and C∖{x0}∈\mC}.

Suppose

 |C′|>(|X|−1≤d−1).

Then

 |C′⊓Y|=|C′|>(|X|−1≤d−1)=(|Y|≤d−1),

thus by Sauer’s Lemma . Let be points in shattered by , and let . Now by the definition of , for each there is such that , hence

 \mC⊓A⊃2{x1,...,xd}∪{B∪{x0}:B∈2{x1,...,xd}}=2A,

[[welzl87rangespaces]] Let be finite with VC-dimension . The concept space is -maximum if and only if

 |\mC|=(|X|≤d).
###### Démonstration.

If is -maximum then by the definition

 |\mC|=(|X|≤d).

For the converse, we will use induction on .
If , then is maximum and

 |\mC|=2d=(d≤d).

Assume the statement of the theorem is true for all where , and let have . Let and let . By the induction hypothesis, it suffices to show that

 |C⊓Y|=(n≤d).

By lemma 1.2.7,
has size at most . Define

 π:C∖C′→C⊓Y by π(C)=C∩Y.

We will show is injective. Suppose there is in such that

 π(C1)=C1∩Y=C2∩Y=π(C2).

If , then

 C1=(C1∩Y)∪{x0}=(C2∩Y)∪{x0}=C2,

and if , then

 C1=C1∩Y=C2∩Y=C2,

so without loss of generality . We get that

 C1∖{x0}=C1∩Y=C2∩Y=C2∈C

hence , a contradiction, therefore is injective. Finally,

 |C⊓Y|≥|C∖C′|=|C|−|C′|≥(n+1≤d)−(n≤d−1)=(n≤d).

### 1.3 Concepts as Relations

In this section we will look at concept spaces defined as a relation on a pair of sets. This will allow us to characterize useful notions of embeddings for concept spaces as found in [Ben-david98combinatorialvariability]. It will also allow us to define the dual concept space of a concept space.

We can define a concept class on a domain via a relation for some set , by where . Similarly given , the corresponding space in the form is . A subclass of is where , and . This is convenient for defining the idea of a dual to a concept space as follows: Given a concept space , the dual concept space of , denoted

 (X,Y,R)∗,

is

 (Y,X,R∗), where R∗={(y,x):(x,y)∈R}.

The dual concept space of a space represented as , can be thought of as

 (\mC,{{C∈C:x∈C}:x∈X}).

[[Ben-david98combinatorialvariability]] Let , be concept spaces. An embedding from to is a function such that for every

 (x,y)∈X×Y, (x,y)∈R iff π((x,y))∈R′.

A generalized embedding from to is a function and a function such that for every ,

 if τ(x)=0 then (x,y)∈R iff π((x,y))∈R′, if τ(x)=1 then (x,y)∈R iff π((x,y))∉R′.

is weakly (generalized) embeddable in if every finite subclass of is (generalized) embeddable in .

The above notions partially order any set of concept spaces; if there exists an embedding or generalized embedding from to , we will denote that

 (X,Y,R)⪯emb(X′,Y′,R′)

or

 (X,Y,R)⪯gemb(X′,Y′,R′)

respectively.
If is weakly embeddable in , or weakly generalized embeddable in , we will denote that

 (X,Y,R)⪯wemb(X′,Y′,R′)

or

 (X,Y,R)⪯wgemb(X′,Y′,R′)

respectively.

Let us say that and are bi-embeddable if and .

A concept space may have some redundant points in as far as is concerned, but we can reduce it to its essential information by setting:

 x∼x′ in X iff ∀y∈Y, (x,y)∈R⟺(x′,y)∈R, y∼y′ in Y iff ∀x∈X, (x,y)∈R⟺(x,y′)∈R.
 R∼={([x]∼,[y]∼)∈X/∼×Y/∼:(x,y)∈R}

separates the points of and is bi-embeddable to via the quotient map for , and mapping each equivalence class to its (choose any) representative for
.

(1) .

(2) .

###### Notation

In the proof of the next proposition and throughout the further text we use the notation for symmetric difference of a set; i.e.

[[Ben-david98combinatorialvariability]] If then .

###### Démonstration.

Let be a finite subset of that is shattered, let

 B={bD∈Y:D⊆A, CbD∩A=D},

and let , be the generalized embedding from into . is injective because for , there exists . Without loss of generality . We have:

 if τ(x)=0, then x∈Cb1 implies π1(x)∈Cπ2(b1) and x∉Cb2 implies π1(x)∉Cπ2(b2); if τ(x)=1, then x∈Cb1 implies π1(x)∉Cπ2(b1) and x∉Cb2 implies% π1(x)∈Cπ2(b2).

In either case and so . This also shows that is injective, hence

 2|A|≥|{Cπ2(b)∩π1(A):b∈B}|≥|{Cb:b∈B}|=2|A|

and therefore is shattered in . ∎

 log2(VC(X,Y,R))−1
###### Démonstration.

Since , it suffices to show the first inequality. Let be a set of cardinality . One has via . Noting that is embeddable in any class of the same or greater VC-dimension, , and thus . Therefore and so . ∎

if and only if .

### 2.1 Introduction of Sample Compression Schemes

Sample compression schemes, introduced by Littlestone and Warmuth ([Littlestone86relatingdata]), are naturally arising algorithms which learn concepts by saving finite samples of concepts to subsets of size at most .

The following notations will be used in the definitions of sample compression schemes, and throughout the text.

###### Notation

For let

 [X]

let

 \CC|A={C|A:C∈\CC},

where and is the function restricted to the domain , and let

 \CC|[X]

We can similarly define

 [X]≤d, \CC|[X]≤d, [X]=d, and \CC|[X]=d.
###### Notation

For two functions , with , let

 g⊑f

be the notation for extending .

For , an unlabelled sample compression scheme of size d on is a function

 \mH:[X]≤d→2X

with the property that

 ∀f∈\CC|[X]<∞, ∃σ∈[dom(f)]≤d, such that f⊑\mH(σ).

A labelled sample compression scheme of size d on is a function

 \mH:\CC|[X]≤d→2X

with the property that

 ∀f∈\CC|[X]<∞, ∃g∈\CC|[X]≤d,%suchthatg⊑f⊑\mH(g).

We will call the range of a sample compression scheme the hypothesis class and denote it by .

###### Example

Let be any totally ordered set, and let be the set of all initial segments of . Defining

 \mH:{x}↦Ix, ∅↦∅, and \mH′:{x}↦Ix∖{x}, ∅↦X,

we will show and are unlabelled sample compression schemes of size on . Given a sample , if on its domain then and . Otherwise exists, and so

 {xf}∈[dom(f)]≤1, f⊑\mH({xf})=Ixf.

Thus is a sample compression scheme of size on .
Similarly for , if on its domain then and . Otherwise exists, and so

 {xf}∈[dom(f)]≤1, f⊑\mH({xf})=Ixf∖{xf}.

Therefore is also a sample compression scheme of size on .

If has an unlabelled compression scheme of size , then has a labelled compression scheme of size .

###### Démonstration.

Let have an unlabelled compression scheme of size . For every there is such that , and so any function where will be a labelled compression scheme of size . ∎

From now on we will only be dealing with unlabelled sample compression schemes unless otherwise mentioned.

[[Ben-david98combinatorialvariability]] If and has a (labelled or unlabelled) sample compression scheme of size , then also has a sample compression scheme of size and of the same type. If has a sample compression scheme of size , then every subspace has a sample compression scheme of size .

### 2.2 Compactness Theorem

[Compactness Theorem, Ben-David and Litman [Ben-david98combinatorialvariability]] A concept space has a sample compression scheme of size if and only if every finite subspace of has a sample compression scheme of size .

The compactness theorem is true for both types of sample compression schemes and similarly for all forms of extended sample compression schemes given in a following section. We will provide the proof of the theorem for unlabelled sample compression schemes. The proof we provide is simpler and more direct than the proof in [Ben-david98combinatorialvariability] which is based on the Compactness Theorem of Predicate Logic. We use an approach with ultralimits, normally used in Analysis. (For preliminary information on filters and ultrafilters, see appendix A.2)

###### Démonstration.

Necessity: By corollary 2.1.7 if has a sample compression scheme of size every (finite) subspace of has a sample compression scheme of size .
Sufficiency: For all denote the sample compression scheme of size for as . Let be an ultrafilter on containing the filter base

 {{B∈[X]<∞:F⊆B}:F∈[X]<∞}.

Define as

 \mH(σ)(x)=1⟺{B∈[X]<∞:σ∪{x}⊆B, \mHB(σ)(x)=1}∈U.

Note for given , is defined as the ultralimit of the net of zeros and ones along .

We will show is a sample compression scheme of size on . Let , and denote . Note that

 ∀B∈[X]<∞, D⊆B, we have f∈(\CC⊓B)|[X]<∞, and so ∃σB∈[D]≤d such % that f⊑\mHB(σB). (1)

We have that is finite so let . For letting

 \mSi={B∈[X]<∞:D⊆B, f⊑\mHB(σi)},

by (1) we see that

 m⋃i=1\mSi={B∈[X]<∞:D⊆B}∈U

thus, by a property of ultrafilters, such that . Let and let

 \mSti0={B∈[X]<∞:D⊆B, \mHB(σi0)(x)=t} (where t∈{0,1}).

We have

 f(x)=1 ⇒ ∀B∈\mSi0, \mHB(σi0)(x)=1 ⇒ \mSi0⊆\mS1i0⊆{B∈[X]<∞:σi0⊆B, \mHB(σi0)(x)=1}∈U ⇒ \mH(σi0)(x)=1; f(x)=0 ⇒ ∀B∈\mSi0, \mHB(σi0)(x)=0 ⇒ \mSi0⊆\mS0i0⊆{B∈[X]<∞:σi0⊆B, \mHB(σi0)(x)=0}∈U ⇒ {B∈