We are amidst the big data era, which had made the design of learning algorithms that require limited amount of memory crucial. Many works (Shamir, 2014; Steinhardt et al., 2016; Raz, 2016; Kol et al., 2017; Moshkovitz and Moshkovitz, 2017a, b; Raz, 2017; Garg et al., 2017; Beame et al., 2017)
have discussed the limitations of bounded-memory learning. In this paper we explore what can be properly learned with bounded-memory algorithms. Specifically, we suggest a general bounded-memory learning algorithm for the case where the examples are sampled from the uniform distribution. We also apply this algorithm to some natural hypothesis classes.
Our general algorithm is first applied to discrete threshold functions, , where the domain is a discretization of the segment and each hypothesis corresponds to a number and There is a simple learning algorithm for this class: save in memory the largest example with label
As a sanity check, we show that indeed this class can be learned using our general algorithm. The second class we consider is equal-piece classifiers,, which is a generalization of the class but now each hypothesis is defined by a few numbers and It is unknown how to generalize the simple algorithm for to the class ; moreover, to the best of our knowledge it remains unclear how to properly learn this class with bounded-memory. However we show that our general algorithm is also applicable for this class.
The third class we consider is decision lists. This class was introduced by Rivest (1987), who also described a learning algorithm for this class. Unfortunately, it is not a bounded-memory algorithm because it saves all the given labeled examples in memory. Nevo and El-Yaniv (2002); Dhagat and Hellerstein (1994) provided learning algorithms for this class but the number of examples used is polynomial only under the assumption that the number of alternations in the decision list is constant. Klivans and Servedio (2006) presented an algorithm that uses a super-polynomial number of examples if the length of the decision list is linear in the number of variables. Long and Servedio (2007) also limited themselves to the uniformly distributed examples scenario but they considered improper learning of this class. Our general algorithm provides a proper bounded-memory learning algorithm.
1.1 Intuition for the General Algorithm
This paper suggests a general bounded-memory learning algorithm in the case where the examples are sampled from the uniform distribution. We also describe a combinatorial condition, called separability, that suffices for the correctness of the algorithm. All of our three applications are hypothesis classes that satisfy separability.
The general bounded-memory algorithm builds upon a basic general learning algorithm which we describe briefly. Let be a family of Boolean hypotheses over domain The fundamental theorem of statistical learning implies that learning with accuracy and constant confidence is possible after observing labeled examples and using memory bits by saving all examples in memory. This can be done by maintaining the version space; i.e., all hypotheses that are consistent with the labeled examples seen so far, which means that is a (preferably small) set that contains the correct hypothesis. As long as there are hypotheses in the version space with an error larger than , a counting argument implies that there is a large set of examples that is able to reduce the size of the version space substantially. The basic learning algorithm that simply keeps the version space in memory uses a large amount of memory. Either one stores bits in memory, indicating which hypothesis is in the version space, or one stores all the examples seen so far.
In this work we show that in many scenarios the basic learning algorithm can be implemented using bounded memory. The difficulty is that each example can eliminate a different subset of the version space and thus all the labeled examples must be saved in memory. The rationale for the current work is that in many situations there is a large subset of examples that would eliminate the same large part of the version space. Thus, it suffices to only save a few bits of information about the example (indicating that it belongs to the subset ) rather than the whole example.
1.2 Informal Summary of our Results
The results are informally summarized below.
We introduce the combinatorial condition of separability for hypothesis classes.
We present a general memory-bounded proper learning algorithm in the case where the examples are sampled from the uniform distribution and the realizability assumption holds. We prove the correctness of this algorithm in the case where the classes satisfy the separability condition.
We prove that discrete threshold functions satisfy separability.
We prove that equal-piece classifiers satisfy separability, and thus can be learned with a proper and bounded memory algorithm.
We prove that decision lists satisfy separability, and thus can be learned with a proper and bounded memory algorithm.
In Section 2 we formally present the notion of separability. In Section 3 we present the general bounded-memory algorithm and in Section 4 we show that this algorithm can be used to properly learn the three classes presented above with bounded-memory. The technical proofs are presented in the Appendix.
2 Separable Classes
In what follow we fix a bipartite graph . The density between sets of vertices and is , where is the number of edges with a vertex in and a vertex in . Two vertices are -close if , where denotes the set of neighbors of vertex . An -ball with center is the set
We next define the weak separability of a vertex subset and a whole graph.
Definition 1 (weak-separable).
Let . We say that is -weak-separable if for every vertex we have .
We say that is -separable if there are subsets and with , such that
Definition 2 (-separable graph).
We say that a bipartite graph is -separable if any that is -weak-separable is also -separable.
A hypothesis class over domain can be represented as a bipartite-graph in the following way. The vertices are the hypotheses and the examples , and the edges connect every hypothesis to the examples if and only if We call the appropriate bipartite graph the hypotheses graph of
Definition 3 (-separable class).
A hypothesis class is -separable if its hypotheses graph is -separable.
3 A Bounded Memory Algorithm
An -bounded memory learning algorithm is one that uses at most labeled examples sampled from the uniform distribution, bits of memory, and returns a hypothesis that is
-close to the correct hypothesis with probability at least
In Algorithm 1 we present the general bounded-memory proper learning algorithm. The algorithm uses the following subroutines:
Is-close — tests whether is -close to the correct hypothesis (Algorithm 2 in the Appendix)
The correctness of the subroutines is stated in the following technical claims and proved in the Appendix.
Let be a bipartite graph. For any that is -separable there are with , with and with such that implies and implies .
There is an algorithm such that for any hypothesis , for any and for any integer it uses labeled examples and
if is -close to the correct hypothesis, then with probability at least the algorithm returns True.
if is not -close to the correct hypothesis, then with probability at least the algorithm returns False.
Denote by the correct hypothesis. There is an algorithm such that for any set of examples with for any and for any integer the algorithm uses labeled examples and returns with with probability at least .
The next theorem proves the correctness of the algorithm (the details appear in the Appendix); we will omit the symbol for simplicity.
Let be -separable hypothesis class and denote . Then for any integer there is a -bounded memory algorithm for .
In this section we prove that Discrete Threshold Functions (in Section 4.1), Equal-Piece Classifiers (in Section 4.2), and Decision Lists (in Section 4.3) are separable. This implies, using Theorem 7, that they are properly learnable with bounded memory.
4.1 Threshold functions
The class of threshold functions in is and The class of discrete thresholds is defined similarly but over discrete domain of size with and
For any , the class is -separable.
The proof of the theorem appears in the Appendix. Using Theorem 7, with we can deduce the following corollary.
For any there is a -bounded memory learning algorithm for .
4.2 Equal-Piece Classifiers
Each hypothesis in corresponds to a (disjoint) union of intervals each of length exactly , that is, and if and only if is inside one of these intervals. More formally, the examples are the numbers as defined in Section 4.1 and the hypotheses correspond to the parameters with and they define the intervals
An example has if and only if there is such that
Note that the class is quite complex since it is easy to verify that it has a VC-Dimension of at least (partition into consecutive equal parts and take one point from each part — this set is shattered by ).
For any with and the class is -separable.
Fix and that is -weak-separable. To prove that is -separable we will show that is -weak-separable.
Assume by contradiction that is not -weak-separable. We will show that there is a set with and with such that and ; thus, in particular which will prove the claim. We will in fact prove that there is an open interval of length such that the sets and satisfying We call such a separating. This will prove that is -separable since for we have that (from the assumption in the claim regarding the upper bound on ), , and For ease of notation we replace by from now on.
Next, we will prove something even stronger by showing that there is a sequence and a sequence of sets such that for all if there is no separating in the “window”
and there is no separating in the previous windows either, then the following four properties are satisfied:
every is similar up to the current window:
no hypothesis has an endpoint in the current window: for all and it holds that
Take any Since , Property 3 implies that (since ), which is a contradiction to weak-separability of .
To complete the proof we prove by induction on that if there are no separating in the windows up to then there are that have the previous four properties.
Induction basis: Since and , Properties 1 and 2 hold. Since , Property 3 holds. There is no endpoint in the interval (since the length of each interval in any hypothesis in is ) and since we can assume that for each and it holds that thus proving Property 4 holds.
Induction step: By the induction hypothesis there is no separating in the full range of . To use it we intuitively move a small sliding-window of length in the current window . For each such we can calculate the number of hypotheses in that contain ,
and the number of hypotheses in that do not intersect ,
Note that is separating if and only if and Thus, our assumption is that for every either or Observe that by Property 4 there are no endpoints of in , which immediately implies that can only increase as we slide within . We next consider two cases depending on whether there is with or not. In each case we need to define and prove that the four properties hold for which will complete the proof.
Case 1: there is no with , i.e., is always smaller than as we slide Define
From the definition of and the induction hypothesis
For each and it holds that .
Assume by contradiction that there is and with . This means that there is with which implies that and From Property 4 there is no end point in the current window ; hence , which is a contradiction to the definition of with ∎
Case 2: there is with . Since increases as we slide , we focus on the first sliding-window such that . Since there is no separating in we get that There are again two cases depending on whether is at the beginning of or not.
Case 2.1: if we define
To prove that Property 1 holds for : follows from the induction hypothesis and the fact that and
To prove that Property 2 holds for : follows from the induction hypothesis and the definition of
For each and it holds that .
For each there is such that . Hence and Since there is no end point in the current window we have that To sum up, we have , which proves the claim.
To prove that Property 3 holds for : note that is equal to
where the inequality follows from the induction hypothesis and Claim 12.
Case 2.2: If we define
To prove that Property 1 holds for : Since is the first sliding window with there are at most of the hypotheses in that start before . Since there are at most of the hypotheses in that start after . In other words there are at least hypotheses that intersect ; i.e., . By Property 1 for we have
To prove that Property 2 holds for : simply note that .
For each and it holds that .
Since for each it holds that that there is such that then it holds that . Hence, ∎
For each and it holds that .
From Property 4 for we know that there is no end point in the current window and specifically in (because ). Since for each it holds that there is such that then the claim follows. ∎
To prove that Property 3 holds for : take and note that
By the induction assumption the first term is at most , from Claims 13,14 the second and fourth term are equal to , the third and the fifth terms are at most each because the lengths of and are of size This means that we have proven that Property 3 holds for window
To prove Property 4 holds for : note that if there is a hypothesis with an end point in the window , its corresponding start point is in By construction of there is no with a start point in the interval .
Using Theorem 7, with we can deduce the following corollary.
For any with there is a learning algorithm for that is
4.3 Decision Lists
A decision list is a function defined over Boolean inputs of the following form:
where are literals over the Boolean variables and are bits in We say that the -th level in the last expression is the part “” and the literal leads to the bit . Given some assignment to the Boolean variables we say that the literal is true if it is true under this assignment. Note that there is no need to use the same variable twice in a decision list. In particular, we can assume without loss of generality that . Denote the set of all decision lists over Boolean inputs by
For any the class is -separable.
Fix and that is -weak-separable. To prove that is -separable we will show that is -weak-separable. To show that we will find two literals and with that have the following properties :
For all hypotheses in :
appears at level
is in a lower level than
Similarly for all hypotheses in :
appear at level
is in a lower level than
There is a level and a bit such that all hypotheses in
are identical up to level
leads to the same value in levels to
Note that for any decision list permuting consecutive literals that all lead to the same bit creates an equivalent decision list; thus, when we write “identical decision lists”, we mean identical up to this kind of permutation.
The correctness of the last three properties will finish the proof since we can take to consist of all the assignments where the literals and are true. In this case it holds that , the disjoint subsets are large (i.e., ). To bound from below, we partition into two parts: all assignments such that at least one of the literals are true (recall that level is defined in Item 3), and . Assume without loss of generality that bit defined in Item 3b is equal to . Note that
where the third equality follows from Item 3a, the fourth equality follows from Item 3b, and the first inequality follows from Item 2 since for each assignment that is false in all literals that appear before level we have that and these assignments constitute a fraction out of the assignments in . The last inequality follows from Items 1,2.
To prove that there are literals and subsets as desired in , we will prove by induction on level that if there are not literals up to level , then there is a subset with , a bit and such that for all hypotheses in
are identical up to level