1 Introduction
We are amidst the big data era, which had made the design of learning algorithms that require limited amount of memory crucial. Many works (Shamir, 2014; Steinhardt et al., 2016; Raz, 2016; Kol et al., 2017; Moshkovitz and Moshkovitz, 2017a, b; Raz, 2017; Garg et al., 2017; Beame et al., 2017)
have discussed the limitations of boundedmemory learning. In this paper we explore what can be properly learned with boundedmemory algorithms. Specifically, we suggest a general boundedmemory learning algorithm for the case where the examples are sampled from the uniform distribution. We also apply this algorithm to some natural hypothesis classes.
Our general algorithm is first applied to discrete threshold functions, , where the domain is a discretization of the segment and each hypothesis corresponds to a number and There is a simple learning algorithm for this class: save in memory the largest example with label
As a sanity check, we show that indeed this class can be learned using our general algorithm. The second class we consider is equalpiece classifiers,
, which is a generalization of the class but now each hypothesis is defined by a few numbers and It is unknown how to generalize the simple algorithm for to the class ; moreover, to the best of our knowledge it remains unclear how to properly learn this class with boundedmemory. However we show that our general algorithm is also applicable for this class.The third class we consider is decision lists. This class was introduced by Rivest (1987), who also described a learning algorithm for this class. Unfortunately, it is not a boundedmemory algorithm because it saves all the given labeled examples in memory. Nevo and ElYaniv (2002); Dhagat and Hellerstein (1994) provided learning algorithms for this class but the number of examples used is polynomial only under the assumption that the number of alternations in the decision list is constant. Klivans and Servedio (2006) presented an algorithm that uses a superpolynomial number of examples if the length of the decision list is linear in the number of variables. Long and Servedio (2007) also limited themselves to the uniformly distributed examples scenario but they considered improper learning of this class. Our general algorithm provides a proper boundedmemory learning algorithm.
1.1 Intuition for the General Algorithm
This paper suggests a general boundedmemory learning algorithm in the case where the examples are sampled from the uniform distribution. We also describe a combinatorial condition, called separability, that suffices for the correctness of the algorithm. All of our three applications are hypothesis classes that satisfy separability.
The general boundedmemory algorithm builds upon a basic general learning algorithm which we describe briefly. Let be a family of Boolean hypotheses over domain The fundamental theorem of statistical learning implies that learning with accuracy and constant confidence is possible after observing labeled examples and using memory bits by saving all examples in memory. This can be done by maintaining the version space; i.e., all hypotheses that are consistent with the labeled examples seen so far, which means that is a (preferably small) set that contains the correct hypothesis. As long as there are hypotheses in the version space with an error larger than , a counting argument implies that there is a large set of examples that is able to reduce the size of the version space substantially. The basic learning algorithm that simply keeps the version space in memory uses a large amount of memory. Either one stores bits in memory, indicating which hypothesis is in the version space, or one stores all the examples seen so far.
In this work we show that in many scenarios the basic learning algorithm can be implemented using bounded memory. The difficulty is that each example can eliminate a different subset of the version space and thus all the labeled examples must be saved in memory. The rationale for the current work is that in many situations there is a large subset of examples that would eliminate the same large part of the version space. Thus, it suffices to only save a few bits of information about the example (indicating that it belongs to the subset ) rather than the whole example.
1.2 Informal Summary of our Results
The results are informally summarized below.

We introduce the combinatorial condition of separability for hypothesis classes.

We present a general memorybounded proper learning algorithm in the case where the examples are sampled from the uniform distribution and the realizability assumption holds. We prove the correctness of this algorithm in the case where the classes satisfy the separability condition.

We prove that discrete threshold functions satisfy separability.

We prove that equalpiece classifiers satisfy separability, and thus can be learned with a proper and bounded memory algorithm.

We prove that decision lists satisfy separability, and thus can be learned with a proper and bounded memory algorithm.
1.3 Organization
In Section 2 we formally present the notion of separability. In Section 3 we present the general boundedmemory algorithm and in Section 4 we show that this algorithm can be used to properly learn the three classes presented above with boundedmemory. The technical proofs are presented in the Appendix.
2 Separable Classes
In what follow we fix a bipartite graph . The density between sets of vertices and is , where is the number of edges with a vertex in and a vertex in . Two vertices are close if , where denotes the set of neighbors of vertex . An ball with center is the set
We next define the weak separability of a vertex subset and a whole graph.
Definition 1 (weakseparable).
Let . We say that is weakseparable if for every vertex we have .
We say that is separable if there are subsets and with , such that
Definition 2 (separable graph).
We say that a bipartite graph is separable if any that is weakseparable is also separable.
A hypothesis class over domain can be represented as a bipartitegraph in the following way. The vertices are the hypotheses and the examples , and the edges connect every hypothesis to the examples if and only if We call the appropriate bipartite graph the hypotheses graph of
Definition 3 (separable class).
A hypothesis class is separable if its hypotheses graph is separable.
3 A Bounded Memory Algorithm
An bounded memory learning algorithm is one that uses at most labeled examples sampled from the uniform distribution, bits of memory, and returns a hypothesis that is
close to the correct hypothesis with probability at least
In Algorithm 1 we present the general boundedmemory proper learning algorithm. The algorithm uses the following subroutines:

Isclose — tests whether is close to the correct hypothesis (Algorithm 2 in the Appendix)
The correctness of the subroutines is stated in the following technical claims and proved in the Appendix.
Claim 4.
Let be a bipartite graph. For any that is separable there are with , with and with such that implies and implies .
Claim 5.
There is an algorithm such that for any hypothesis , for any and for any integer it uses labeled examples and

if is close to the correct hypothesis, then with probability at least the algorithm returns True.

if is not close to the correct hypothesis, then with probability at least the algorithm returns False.
Claim 6.
Denote by the correct hypothesis. There is an algorithm such that for any set of examples with for any and for any integer the algorithm uses labeled examples and returns with with probability at least .
The next theorem proves the correctness of the algorithm (the details appear in the Appendix); we will omit the symbol for simplicity.
Theorem 7.
Let be separable hypothesis class and denote . Then for any integer there is a bounded memory algorithm for .
4 Applications
In this section we prove that Discrete Threshold Functions (in Section 4.1), EqualPiece Classifiers (in Section 4.2), and Decision Lists (in Section 4.3) are separable. This implies, using Theorem 7, that they are properly learnable with bounded memory.
4.1 Threshold functions
The class of threshold functions in is and The class of discrete thresholds is defined similarly but over discrete domain of size with and
Theorem 8.
For any , the class is separable.
The proof of the theorem appears in the Appendix. Using Theorem 7, with we can deduce the following corollary.
Corollary 9.
For any there is a bounded memory learning algorithm for .
4.2 EqualPiece Classifiers
Each hypothesis in corresponds to a (disjoint) union of intervals each of length exactly , that is, and if and only if is inside one of these intervals. More formally, the examples are the numbers as defined in Section 4.1 and the hypotheses correspond to the parameters with and they define the intervals
An example has if and only if there is such that
Note that the class is quite complex since it is easy to verify that it has a VCDimension of at least (partition into consecutive equal parts and take one point from each part — this set is shattered by ).
Theorem 10.
For any with and the class is separable.
Proof.
Fix and that is weakseparable. To prove that is separable we will show that is weakseparable.
Assume by contradiction that is not weakseparable. We will show that there is a set with and with such that and ; thus, in particular which will prove the claim. We will in fact prove that there is an open interval of length such that the sets and satisfying We call such a separating. This will prove that is separable since for we have that (from the assumption in the claim regarding the upper bound on ), , and For ease of notation we replace by from now on.
Next, we will prove something even stronger by showing that there is a sequence and a sequence of sets such that for all if there is no separating in the “window”
and there is no separating in the previous windows either, then the following four properties are satisfied:



every is similar up to the current window:

no hypothesis has an endpoint in the current window: for all and it holds that
Assume by contradiction that there is no separating . Hence, we deduce that the four properties hold for all windows By Property 2, for , it holds that Fix By Property 1 we know that
Take any Since , Property 3 implies that (since ), which is a contradiction to weakseparability of .
To complete the proof we prove by induction on that if there are no separating in the windows up to then there are that have the previous four properties.
Induction basis: Since and , Properties 1 and 2 hold. Since , Property 3 holds. There is no endpoint in the interval (since the length of each interval in any hypothesis in is ) and since we can assume that for each and it holds that thus proving Property 4 holds.
Induction step: By the induction hypothesis there is no separating in the full range of . To use it we intuitively move a small slidingwindow of length in the current window . For each such we can calculate the number of hypotheses in that contain ,
and the number of hypotheses in that do not intersect ,
Note that is separating if and only if and Thus, our assumption is that for every either or Observe that by Property 4 there are no endpoints of in , which immediately implies that can only increase as we slide within . We next consider two cases depending on whether there is with or not. In each case we need to define and prove that the four properties hold for which will complete the proof.

Case 1: there is no with , i.e., is always smaller than as we slide Define
From the definition of and the induction hypothesis
Property 2 holds. Before we prove that Properties 3 and 4 hold we prove the following auxiliary claim.
Claim 11.
For each and it holds that .
Proof.
Assume by contradiction that there is and with . This means that there is with which implies that and From Property 4 there is no end point in the current window ; hence , which is a contradiction to the definition of with ∎

Case 2: there is with . Since increases as we slide , we focus on the first slidingwindow such that . Since there is no separating in we get that There are again two cases depending on whether is at the beginning of or not.

Case 2.1: if we define
To prove that Property 1 holds for : follows from the induction hypothesis and the fact that and
To prove that Property 2 holds for : follows from the induction hypothesis and the definition of
Claim 12.
For each and it holds that .
Proof.
For each there is such that . Hence and Since there is no end point in the current window we have that To sum up, we have , which proves the claim.
∎

Case 2.2: If we define
To prove that Property 1 holds for : Since is the first sliding window with there are at most of the hypotheses in that start before . Since there are at most of the hypotheses in that start after . In other words there are at least hypotheses that intersect ; i.e., . By Property 1 for we have
To prove that Property 2 holds for : simply note that .
Claim 13.
For each and it holds that .
Proof.
Since for each it holds that that there is such that then it holds that . Hence, ∎
Claim 14.
For each and it holds that .
Proof.
From Property 4 for we know that there is no end point in the current window and specifically in (because ). Since for each it holds that there is such that then the claim follows. ∎
To prove that Property 3 holds for : take and note that
By the induction assumption the first term is at most , from Claims 13,14 the second and fourth term are equal to , the third and the fifth terms are at most each because the lengths of and are of size This means that we have proven that Property 3 holds for window
To prove Property 4 holds for : note that if there is a hypothesis with an end point in the window , its corresponding start point is in By construction of there is no with a start point in the interval .
∎
Using Theorem 7, with we can deduce the following corollary.
Corollary 15.
For any with there is a learning algorithm for that is
4.3 Decision Lists
A decision list is a function defined over Boolean inputs of the following form:
where are literals over the Boolean variables and are bits in We say that the th level in the last expression is the part “” and the literal leads to the bit . Given some assignment to the Boolean variables we say that the literal is true if it is true under this assignment. Note that there is no need to use the same variable twice in a decision list. In particular, we can assume without loss of generality that . Denote the set of all decision lists over Boolean inputs by
Theorem 16.
For any the class is separable.
Proof.
Fix and that is weakseparable. To prove that is separable we will show that is weakseparable. To show that we will find two literals and with that have the following properties :

For all hypotheses in :

leads to

appears at level

is in a lower level than


Similarly for all hypotheses in :

leads to

appear at level

is in a lower level than


There is a level and a bit such that all hypotheses in

are identical up to level

leads to the same value in levels to

Note that for any decision list permuting consecutive literals that all lead to the same bit creates an equivalent decision list; thus, when we write “identical decision lists”, we mean identical up to this kind of permutation.
The correctness of the last three properties will finish the proof since we can take to consist of all the assignments where the literals and are true. In this case it holds that , the disjoint subsets are large (i.e., ). To bound from below, we partition into two parts: all assignments such that at least one of the literals are true (recall that level is defined in Item 3), and . Assume without loss of generality that bit defined in Item 3b is equal to . Note that
where the third equality follows from Item 3a, the fourth equality follows from Item 3b, and the first inequality follows from Item 2 since for each assignment that is false in all literals that appear before level we have that and these assignments constitute a fraction out of the assignments in . The last inequality follows from Items 1,2.
To prove that there are literals and subsets as desired in , we will prove by induction on level that if there are not literals up to level , then there is a subset with , a bit and such that for all hypotheses in

are identical up to level
Comments
There are no comments yet.