# A General Memory-Bounded Learning Algorithm

In an era of big data there is a growing need for memory-bounded learning algorithms. In the last few years researchers have investigated what cannot be learned under memory constraints. In this paper we focus on the complementary question of what can be learned under memory constraints. We show that if a hypothesis class fulfills a combinatorial condition defined in this paper, there is a memory-bounded learning algorithm for this class. We prove that certain natural classes fulfill this combinatorial property and thus can be learned under memory constraints.

## Authors

• 7 publications
• 17 publications
• ### Lessons Learned in Migrating from Swing to JavaFX

Martin P. Robillard and Kaylee Kutschera...
11/11/2018 ∙ by Martin P. Robillard, et al. ∙ 0

• ### Towards a combinatorial characterization of bounded memory learning

Combinatorial dimensions play an important role in the theory of machine...
02/08/2020 ∙ by Alon Gonen, et al. ∙ 0

• ### Graphs of bounded cliquewidth are polynomially χ-bounded

We prove that if C is a hereditary class of graphs that is polynomially ...
10/01/2019 ∙ by Marthe Bonamy, et al. ∙ 0

• ### Do we still need fuzzy classifiers for Small Data in the Era of Big Data?

The Era of Big Data has forced researchers to explore new distributed so...
03/08/2019 ∙ by Mikel Elkano, et al. ∙ 0

• ### Learning from Sparse Data by Exploiting Monotonicity Constraints

When training data is sparse, more domain knowledge must be incorporated...
07/04/2012 ∙ by Eric E. Altendorf, et al. ∙ 0

• ### ScaIL: Classifier Weights Scaling for Class Incremental Learning

Incremental learning is useful if an AI agent needs to integrate data fr...
01/16/2020 ∙ by Eden Belouadah, et al. ∙ 8

• ### Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation

Many machine learning approaches are characterized by information constr...
11/14/2013 ∙ by Ohad Shamir, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We are amidst the big data era, which had made the design of learning algorithms that require limited amount of memory crucial. Many works (Shamir, 2014; Steinhardt et al., 2016; Raz, 2016; Kol et al., 2017; Moshkovitz and Moshkovitz, 2017a, b; Raz, 2017; Garg et al., 2017; Beame et al., 2017)

have discussed the limitations of bounded-memory learning. In this paper we explore what can be properly learned with bounded-memory algorithms. Specifically, we suggest a general bounded-memory learning algorithm for the case where the examples are sampled from the uniform distribution. We also apply this algorithm to some natural hypothesis classes.

Our general algorithm is first applied to discrete threshold functions, , where the domain is a discretization of the segment and each hypothesis corresponds to a number and There is a simple learning algorithm for this class: save in memory the largest example with label

As a sanity check, we show that indeed this class can be learned using our general algorithm. The second class we consider is equal-piece classifiers,

, which is a generalization of the class but now each hypothesis is defined by a few numbers and It is unknown how to generalize the simple algorithm for to the class ; moreover, to the best of our knowledge it remains unclear how to properly learn this class with bounded-memory. However we show that our general algorithm is also applicable for this class.

The third class we consider is decision lists. This class was introduced by Rivest (1987), who also described a learning algorithm for this class. Unfortunately, it is not a bounded-memory algorithm because it saves all the given labeled examples in memory. Nevo and El-Yaniv (2002); Dhagat and Hellerstein (1994) provided learning algorithms for this class but the number of examples used is polynomial only under the assumption that the number of alternations in the decision list is constant. Klivans and Servedio (2006) presented an algorithm that uses a super-polynomial number of examples if the length of the decision list is linear in the number of variables. Long and Servedio (2007) also limited themselves to the uniformly distributed examples scenario but they considered improper learning of this class. Our general algorithm provides a proper bounded-memory learning algorithm.

### 1.1 Intuition for the General Algorithm

This paper suggests a general bounded-memory learning algorithm in the case where the examples are sampled from the uniform distribution. We also describe a combinatorial condition, called separability, that suffices for the correctness of the algorithm. All of our three applications are hypothesis classes that satisfy separability.

The general bounded-memory algorithm builds upon a basic general learning algorithm which we describe briefly. Let be a family of Boolean hypotheses over domain The fundamental theorem of statistical learning implies that learning with accuracy and constant confidence is possible after observing labeled examples and using memory bits by saving all examples in memory. This can be done by maintaining the version space; i.e., all hypotheses that are consistent with the labeled examples seen so far, which means that is a (preferably small) set that contains the correct hypothesis. As long as there are hypotheses in the version space with an error larger than , a counting argument implies that there is a large set of examples that is able to reduce the size of the version space substantially. The basic learning algorithm that simply keeps the version space in memory uses a large amount of memory. Either one stores bits in memory, indicating which hypothesis is in the version space, or one stores all the examples seen so far.

In this work we show that in many scenarios the basic learning algorithm can be implemented using bounded memory. The difficulty is that each example can eliminate a different subset of the version space and thus all the labeled examples must be saved in memory. The rationale for the current work is that in many situations there is a large subset of examples that would eliminate the same large part of the version space. Thus, it suffices to only save a few bits of information about the example (indicating that it belongs to the subset ) rather than the whole example.

### 1.2 Informal Summary of our Results

The results are informally summarized below.

1. We introduce the combinatorial condition of separability for hypothesis classes.

2. We present a general memory-bounded proper learning algorithm in the case where the examples are sampled from the uniform distribution and the realizability assumption holds. We prove the correctness of this algorithm in the case where the classes satisfy the separability condition.

3. We prove that discrete threshold functions satisfy separability.

4. We prove that equal-piece classifiers satisfy separability, and thus can be learned with a proper and bounded memory algorithm.

5. We prove that decision lists satisfy separability, and thus can be learned with a proper and bounded memory algorithm.

### 1.3 Organization

In Section 2 we formally present the notion of separability. In Section 3 we present the general bounded-memory algorithm and in Section 4 we show that this algorithm can be used to properly learn the three classes presented above with bounded-memory. The technical proofs are presented in the Appendix.

## 2 Separable Classes

In what follow we fix a bipartite graph . The density between sets of vertices and is , where is the number of edges with a vertex in and a vertex in . Two vertices are -close if , where denotes the set of neighbors of vertex . An -ball with center is the set

 Bh(ϵ)={h′∈A|h′ and h are ϵ-close}.

We next define the weak separability of a vertex subset and a whole graph.

###### Definition 1 (weak-separable).

Let . We say that is -weak-separable if for every vertex we have .
We say that is -separable if there are subsets and with , such that

###### Definition 2 ((α,ϵ)-separable graph).

We say that a bipartite graph is -separable if any that is -weak-separable is also -separable.

A hypothesis class over domain can be represented as a bipartite-graph in the following way. The vertices are the hypotheses and the examples , and the edges connect every hypothesis to the examples if and only if We call the appropriate bipartite graph the hypotheses graph of

###### Definition 3 ((α,ϵ)-separable class).

A hypothesis class is -separable if its hypotheses graph is -separable.

## 3 A Bounded Memory Algorithm

An -bounded memory learning algorithm is one that uses at most labeled examples sampled from the uniform distribution, bits of memory, and returns a hypothesis that is

-close to the correct hypothesis with probability at least

In Algorithm 1 we present the general bounded-memory proper learning algorithm. The algorithm uses the following subroutines:

• Is-close — tests whether is -close to the correct hypothesis (Algorithm 2 in the Appendix)

• Estimate — estimates up to an additive error of , where is the correct hypothesis (Algorithm 3 in the Appendix)

The correctness of the subroutines is stated in the following technical claims and proved in the Appendix.

###### Claim 4.

Let be a bipartite graph. For any that is -separable there are with , with and with such that implies and implies .

###### Claim 5.

There is an algorithm such that for any hypothesis , for any and for any integer it uses labeled examples and

• if is -close to the correct hypothesis, then with probability at least the algorithm returns True.

• if is not -close to the correct hypothesis, then with probability at least the algorithm returns False.

###### Claim 6.

Denote by the correct hypothesis. There is an algorithm such that for any set of examples with for any and for any integer the algorithm uses labeled examples and returns with with probability at least .

The next theorem proves the correctness of the algorithm (the details appear in the Appendix); we will omit the symbol for simplicity.

###### Theorem 7.

Let be -separable hypothesis class and denote . Then for any integer there is a -bounded memory algorithm for .

## 4 Applications

In this section we prove that Discrete Threshold Functions (in Section 4.1), Equal-Piece Classifiers (in Section 4.2), and Decision Lists (in Section 4.3) are separable. This implies, using Theorem 7, that they are properly learnable with bounded memory.

### 4.1 Threshold functions

The class of threshold functions in is and The class of discrete thresholds is defined similarly but over discrete domain of size with and

###### Theorem 8.

For any , the class is -separable.

The proof of the theorem appears in the Appendix. Using Theorem 7, with we can deduce the following corollary.

###### Corollary 9.

For any there is a -bounded memory learning algorithm for .

### 4.2 Equal-Piece Classifiers

Each hypothesis in corresponds to a (disjoint) union of intervals each of length exactly , that is, and if and only if is inside one of these intervals. More formally, the examples are the numbers as defined in Section 4.1 and the hypotheses correspond to the parameters with and they define the intervals

 [ah1,ah1+p],[ah2,ah2+p],…,[ahk,ahk+p].

An example has if and only if there is such that

Note that the class is quite complex since it is easy to verify that it has a VC-Dimension of at least (partition into consecutive equal parts and take one point from each part — this set is shattered by ).

###### Theorem 10.

For any with and the class is -separable.

###### Proof.

Fix and that is -weak-separable. To prove that is -separable we will show that is -weak-separable.

Assume by contradiction that is not -weak-separable. We will show that there is a set with and with such that and ; thus, in particular which will prove the claim. We will in fact prove that there is an open interval of length such that the sets and satisfying We call such a separating. This will prove that is -separable since for we have that (from the assumption in the claim regarding the upper bound on ), , and For ease of notation we replace by from now on.

Next, we will prove something even stronger by showing that there is a sequence and a sequence of sets such that for all if there is no separating in the “window”

 Wi:=[ui,ui+p−iα]

and there is no separating in the previous windows either, then the following four properties are satisfied:

1. every is similar up to the current window:

 |{x∈X∩[0,ui]|h1(x)≠h2(x)}|≤αi+2∑j=1j|X|
2. no hypothesis has an endpoint in the current window: for all and it holds that

 ahk+p∉[ui,ui+p−iα].

Assume by contradiction that there is no separating . Hence, we deduce that the four properties hold for all windows By Property 2, for , it holds that Fix By Property 1 we know that

 |Tℓ| ≥ |T|(1−6αp) (since α<\nicefracp18) > |T|⋅23 > α|T|

Take any Since , Property 3 implies that (since ), which is a contradiction to weak-separability of .

To complete the proof we prove by induction on that if there are no separating in the windows up to then there are that have the previous four properties.

Induction basis: Since and , Properties 1 and 2 hold. Since , Property 3 holds. There is no endpoint in the interval (since the length of each interval in any hypothesis in is ) and since we can assume that for each and it holds that thus proving Property 4 holds.

Induction step: By the induction hypothesis there is no separating in the full range of . To use it we intuitively move a small sliding-window of length in the current window . For each such we can calculate the number of hypotheses in that contain ,

 cI1=|{h∈Ti|∃k,I⊆[ahk,ahk+p]}|

and the number of hypotheses in that do not intersect ,

 cI0=|{h∈Ti|∀k,I∩[ahk,ahk+p]=∅}|.

Note that is separating if and only if and Thus, our assumption is that for every either or Observe that by Property 4 there are no endpoints of in , which immediately implies that can only increase as we slide within . We next consider two cases depending on whether there is with or not. In each case we need to define and prove that the four properties hold for which will complete the proof.

• Case 1: there is no with , i.e., is always smaller than as we slide Define

 Ti+1={h∈Ti|∀I⊆Wi∀k,I⊈[ahk,ahk+p]} and ui+1=ui+|Wi|

Property 1 holds since by definition of and by Property 1 of the induction hypothesis

 |Ti+1|>|Ti|−α|T|≥(1−2αi)|T|−α|T|≥(1−2(i+1)α)|T|.

From the definition of and the induction hypothesis

 ui+1 = ui+p−iα ≥ ip−αi∑j=1j+p−iα ≥ (i+1)p−αi+1∑j=1j

Property 2 holds. Before we prove that Properties 3 and 4 hold we prove the following auxiliary claim.

###### Claim 11.

For each and it holds that .

###### Proof.

Assume by contradiction that there is and with . This means that there is with which implies that and From Property 4 there is no end point in the current window ; hence , which is a contradiction to the definition of with

To prove that Property 3 holds for : we get that for each by Claim 11 and the induction hypothesis we have that

 |{x∈X∩[0,ui+1]|h1(x)≠h2(x)}| = |{x∈X∩[0,ui]|h1(x)≠h2(x)}| + |{x∈X∩(ui,ui+1−α]|h1(x)≠h2(x)}| + |{x∈X∩(ui+1−α,ui+1]|h1(x)≠h2(x)}| ≤ αi+2∑j=1j|X|+α≤αi+3∑j=1j|X|

To prove that Property 4 holds for : if there is an endpoint in then, since the length of is smaller than , its start point is before . This contradicts Claim 11.

• Case 2: there is with . Since increases as we slide , we focus on the first sliding-window such that . Since there is no separating in we get that There are again two cases depending on whether is at the beginning of or not.

• Case 2.1: if we define

 Ti+1={h∈Ti|∃k,I∩[ahk,ahk+p]≠∅}andui+1=ui+p+α.

To prove that Property 1 holds for : follows from the induction hypothesis and the fact that and

To prove that Property 2 holds for : follows from the induction hypothesis and the definition of

Before we prove that Properties 3 and 4 hold we need the following auxiliary claim.

###### Claim 12.

For each and it holds that .

###### Proof.

For each there is such that . Hence and Since there is no end point in the current window we have that To sum up, we have , which proves the claim.

To prove that Property 3 holds for : note that is equal to

 |{x∈X∩[0,ui]|h1(x)≠h2(x)}| + |{x∈X∩(ui,ui+α)|h1(x)≠h2(x)}| + |{x∈X∩[ui+α,ui+p−iα]|h1(x)≠h2(x)}| + |{x∈X∩(ui+p−iα,ui+p+α]|h1(x)≠h2(x)}| ≤ αi+2∑j=1j,

where the inequality follows from the induction hypothesis and Claim 12.

To prove Property 4 holds for : if there is an with an end-point in

 [ui+1,ui+1+|Wi+1|]=[ui+1,ui+1+p−(i+1)α]

then its start-point is in

 [ui+1−p,ui+1−(i+1)α]=[ui+α,ui+p−iα]⊆[ui,ui+|Wi|],

which is a contradiction to Claim 12 and Property 4 for

• Case 2.2: If we define

 Ti+1={h∈Ti|∃k.ahk∈I}andui+1=i2+p.

To prove that Property 1 holds for : Since is the first sliding window with there are at most of the hypotheses in that start before . Since there are at most of the hypotheses in that start after . In other words there are at least hypotheses that intersect ; i.e., . By Property 1 for we have

To prove that Property 2 holds for : simply note that .

Before we prove that Properties 3, 4 hold we need the following auxiliary claims.

###### Claim 13.

For each and it holds that .

###### Proof.

Since for each it holds that that there is such that then it holds that . Hence,

###### Claim 14.

For each and it holds that .

###### Proof.

From Property 4 for we know that there is no end point in the current window and specifically in (because ). Since for each it holds that there is such that then the claim follows. ∎

To prove that Property 3 holds for : take and note that

 |{x∈X∩[0,ui+1]|h1(x)≠h2(x)}| = |{x∈X∩[0,ui]|h1(x)≠h2(x)}| + |{x∈X∩(ui,i1)|h1(x)≠h2(x)}| + |{x∈X∩[ii,i2)|h1(x)≠h2(x)}| + |{x∈X∩[i2,i1+p)|h1(x)≠h2(x)}| + |{x∈X∩[i1+p,ui+1]|h1(x)≠h2(x)}|

By the induction assumption the first term is at most , from Claims 13,14 the second and fourth term are equal to , the third and the fifth terms are at most each because the lengths of and are of size This means that we have proven that Property 3 holds for window

To prove Property 4 holds for : note that if there is a hypothesis with an end point in the window , its corresponding start point is in By construction of there is no with a start point in the interval .

Using Theorem 7, with we can deduce the following corollary.

###### Corollary 15.

For any with there is a learning algorithm for that is

 (k⋅log|HEP;p|α3,log|HEP;p|α2,0.1,ϵ)-bounded memory

### 4.3 Decision Lists

A decision list is a function defined over Boolean inputs of the following form:

 if ℓ1 then b1 else if ℓ2 % then b2 else …if ℓk then bk else bk+1,

where are literals over the Boolean variables and are bits in We say that the -th level in the last expression is the part “” and the literal leads to the bit . Given some assignment to the Boolean variables we say that the literal is true if it is true under this assignment. Note that there is no need to use the same variable twice in a decision list. In particular, we can assume without loss of generality that . Denote the set of all decision lists over Boolean inputs by

###### Theorem 16.

For any the class is -separable.

###### Proof.

Fix and that is -weak-separable. To prove that is -separable we will show that is -weak-separable. To show that we will find two literals and with that have the following properties :

1. For all hypotheses in :

• appears at level

• is in a lower level than

2. Similarly for all hypotheses in :

• appear at level

• is in a lower level than

3. There is a level and a bit such that all hypotheses in

1. are identical up to level

2. leads to the same value in levels to

Note that for any decision list permuting consecutive literals that all lead to the same bit creates an equivalent decision list; thus, when we write “identical decision lists”, we mean identical up to this kind of permutation.

The correctness of the last three properties will finish the proof since we can take to consist of all the assignments where the literals and are true. In this case it holds that , the disjoint subsets are large (i.e., ). To bound from below, we partition into two parts: all assignments such that at least one of the literals are true (recall that level is defined in Item 3), and . Assume without loss of generality that bit defined in Item 3b is equal to . Note that

 |d(T1,S)−d(T0,S)| = ∣∣ ∣∣∑a∈Se(T1,a)|T1||S|−e(T0,a)|T0||S|∣∣ ∣∣ = ∣∣ ∣∣∑a∈S1e(T1,a)|T1||S|−e(T0,a)|T0||S|+∑a∈S2e(T1,a)|T1||S|−e(T0,a)|T0||S|∣∣ ∣∣ = ∣∣ ∣∣∑a∈S2e(T1,a)|T1||S|−e(T0,a)|T0||S|∣∣ ∣∣ = ∑a∈S2e(T1,a)|T1||S| ≥ 2−max{i0,i1}+1≥ϵ,

where the third equality follows from Item 3a, the fourth equality follows from Item 3b, and the first inequality follows from Item 2 since for each assignment that is false in all literals that appear before level we have that and these assignments constitute a fraction out of the assignments in . The last inequality follows from Items 1,2.

To prove that there are literals and subsets as desired in , we will prove by induction on level that if there are not literals up to level , then there is a subset with , a bit and such that for all hypotheses in

• are identical up to level