Time-Space Tradeoffs for Learning from Small Test Spaces: Learning Low Degree Polynomial Functions

08/08/2017 ∙ by Paul Beame, et al. ∙ University of Washington 0

We develop an extension of recently developed methods for obtaining time-space tradeoff lower bounds for problems of learning from random test samples to handle the situation where the space of tests is signficantly smaller than the space of inputs, a class of learning problems that is not handled by prior work. This extension is based on a measure of how matrices amplify the 2-norms of probability distributions that is more refined than the 2-norms of these matrices. As applications that follow from our new technique, we show that any algorithm that learns m-variate homogeneous polynomial functions of degree at most d over F_2 from evaluations on randomly chosen inputs either requires space Ω(mn) or 2^Ω(m) time where n=m^Θ(d) is the dimension of the space of such functions. These bounds are asymptotically optimal since they match the tradeoffs achieved by natural learning algorithms for the problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The question of how efficiently one can learn from random samples is a problem of longstanding interest. Much of this research has been focussed on the number of samples required to obtain good approximations. However, another important parameter is how much of these samples need to be kept in memory in order to learn successfully. There has been a line of work improving the memory efficiency of learning algorithms, and the question of the limits of such improvement has begun to be tackled relatively recently. Shamir [15] and Steinhardt, Valiant, and Wager [17] both obtained constraints on the space required for certain learning problems and in the latter paper, the authors asked whether one could obtain strong tradeoffs for learning from random samples that yields a superlinear threshold for the space required for efficient learning. In a breakthrough result, Ran Raz [13] showed that even given exact information, if the space of a learning algorithm is bounded by a sufficiently small quadratic function of the input size, then the parity learning problem given exact answers on random samples requires an exponential number of samples even to learn an unknown parity function approximately.

More precisely, in the problem of parity learning, an unknown is chosen uniformly at random, and a learner tries to learn from a stream of samples where is chosen uniformly at random from and . With high probability uniformly random samples suffice to span and one can solve parity learning using Gaussian elimination with space. Alternatively, an algorithm with only

space can wait for a specific basis of vectors

to appear (for example the standard basis) and store the resulting values; however, this takes time. Ran Raz [13] showed that either space or time is essential: even if the space is bounded by , queries are required to learn correctly with any probability that is . In follow-on work, [9] showed that the same lower bound applies even if the input is sparse.

We can view as a (homogeneous) linear function over , and, from this perspective, parity learning learns a linear Boolean function from evaluations over uniformly random inputs. A natural generalization asks if a similar lower bound exists when we learn higher order polynomials with bounded space.

For example, consider homogenous quadratic functions over . Let and , which we identify with the space of quadratic polynomials in or, equivalently, the space of upper triangular Boolean matrices. Given an input , the learning algorithm receives a stream of sample pairs where (or equivalently when is viewed as a matrix). A learner tries to learn with a stream of samples where is chosen uniformly at random from and .

Given and , we can also view evaluating as computing where we can interpret as an element of . For randomly chosen , the vectors almost surely span and hence we only need to store samples of the form and apply Gaussian elimination to determine . This time, we only need bits to store each sample for a total space bound of . An alternative algorithm using space and time would be to look for a specific basis. One natural example is the basis consisting of the upper triangular parts of

We show that this tradeoff between space or time is inherently required to learn with probability .

Another view of the problem of learning homogenous quadratic functions (or indeed any low degree polynomial learning problem) is to consider it as parity learning with a smaller sample space of tests. That is, we still want to learn with samples such that , but now is not chosen uniformly at random from ; instead, we choose uniformly at random and set to be the upper triangular part of . Then the size of the space of tests is which is and hence is much smaller than the size space .

Note that this is the dual problem to that considered by [9] whose lower bound applied when the unknown is sparse, and the tests are sampled from the whole space. That is, the space of possible inputs is much smaller than the space of possible tests.

The techniques in [13, 9] were based on fairly ad-hoc simulations of the original space-bounded learning algorithm by a restricted form of linear branching program for which one can measure progress at learning using the dimension of the consistent subspace. More recent papers of Moshkovitz and Moshkovitz [11, 12] and Raz [14] consider more general tests and use a measure of progress based on 2-norms. While the method of [11] is not strong enough to reproduce the bound in [13] for the case of parity learning, the methods of [14] and later [12] reproduce the parity learning bound and more.

In particular, [14] considers an arbitrary space of inputs and an arbitrary sample space of tests and defines a matrix that is indexed by and has distinct columns; indicates the outcome of applying the test to the input . The bound is governed by the (expectation) matrix norm of

, which is is a function of the largest singular value of

, and the progress is analyzed by bounding the impact of applying the matrix to probability distributions with small expectation -norm. This method works fine if - i.e., the space of tests is at least as large as the space of inputs - but it fails completely if which is precisely the situation for learning quadratic functions. Indeed, none of the prior approaches work in this case.

In our work we define a property of matrices that allows us to refine the notion of the largest singular value and extend the method of [14] to the cases that and, in particular, to prove time-space tradeoff lower bounds for learning homogeneous quadratic functions over . This property, which we call the norm amplification curve of the matrix on the positive orthant, analyzes more precisely how grows as a function of for probability vectors on . The key reason that this is not simply governed by the singular values is that such not only have fixed norm, they are also on the positive orthant, which can contain at most one singular vector. We give a simple condition on the 2-norm amplification curve of that is sufficient to ensure that there is a time-space tradeoff showing that any learning algorithm for with success probability at least for some either requires space or time .

For any fixed learning problem given by a matrix , the natural way to express the amplification curve at any particular value of yields an optimization problem given by a quadratic program with constraints on , and , and with objective function that seems difficult to solve. Instead, we relax the quadratic program to a semi-definite program where we replace by a positive semidefinite matrix with the analogous constraints. We can then obtain an upper bound on the amplification curve by moving to the SDP dual and evaluating the dual objective at a particular Laplacian determined by the properties of .

For matrices associated with low degree polynomials over , the property of the matrix required to bound the amplication curves for correspond precisely to properties of the weight distribution of Reed-Muller codes over . In the case of quadratic polynomials, we can analyze this weight distribution exactly. In the case of higher degree polynomials, bounds on the weight distribution of such codes proven by Kaufman, Lovett, and Porat [8] are sufficient to obtain the properties we need to give strong enough bounds on the norm amplification curves to yield the time-space tradeoffs for learning for all degrees that are .

Our new method extends the potential reach of time-space tradeoff lower bounds for learning problems to a wide array of natural scenarios where the sample space of tests is smaller than the sample space of inputs. Low degree polynomials with evaluation tests are just some of the natural examples. Our bound shows that if the 2-norm amplification curve for has the required property, then to achieve learning success probability for of at least for some , either space or time is required. This kind of bound is consistent even with what we know for very small sample spaces of tests: for example, if is the space of linear functions over and is the standard basis then, even for exact identification, space and time are necessary and sufficient by a simple coupon-collector analysis.

Thus far, we have assumed that the outcome of each random test in is one of two values. We also sketch how to extend the approach to multivalued outcomes. (We note that, though the mixing condition of [11, 12] does not hold in the case of small sample spaces of tests, [11, 12] do apply in the case of multivalued outcomes.)

Independent of the specific applications to learning from random examples that we obtain, the measure of matrices that we introduce, the 2-norm amplification curve on the positive orthant, seems likely to have signficant applications in other contexts outside of learning.

Related work:

Independently of our work, Garg, Raz, and Tal [6] have proven closely related results to ours. The fundamental techniques are similarly grounded in the approach of [14] though their method is based on viewing the matrices associated with learning problems as 2-source extractors rather than on bounding the SDP relaxations of their 2-norm amplification curves. They use this for a variety of applications including the polynomial learning problems we focus on here.

1.1 Branching programs for learning

Following Raz [14], we define the learning problem as follows. Given two non-empty sets, a set of possible inputs, with a uniformly random prior distribution, and a set of tests and a matrix , a learner tries to learn an input given a stream of samples where for every , is chosen uniformly at random from and . Throughout this paper we use the notation that and .

For example, parity learning is the special case of this learning problem where .

Again following Raz [13], the time and space of a learner are modelled simultaneously by expressing the learner’s computation as a layered branching program: a finite rooted directed acyclic multigraph with every non-sink node having outdegree , with one outedge for each with and that leads to a node in the next layer. Each sink node is labelled by some which is the learner’s guess of the value of the input .

The space used by the learning branching program is the of the maximum number of nodes in any layer and the time is the length of the longest path from the root to a sink.

The samples given to the learner based on uniformly randomly chosen and an input determines a (randomly chosen) computation path in the branching program. When we consider computation paths we include the input in their description.

The (expected) success probability of the learner is the probability for a uniformly random that on input a random computation path on input reaches a sink node with label .

1.2 Progress towards identification

Following [11, 14] we measure progress towards identifying

using the “expectation 2-norm” over the uniform distribution: For any set

, and , define

Define to be the space of probability distributions on . Consider the two extremes for the expectation 2-norm of elements of : If is the uniform distribution on , then . This distribution represents the learner’s knowledge of the input at the start of the branching program. On the other hand if is point distribution on any , then .

For each node in the branching program, there is an induced probability distribution on , which represents the distribution on conditioned on the fact that the computation path passes through . It represents the learner’s knowledge of at the time that the computation path has reached . Intuitively, the learner has made significant progress towards identifying the input if is much larger than , say .

The general idea will be to argue that for any fixed node in the branching program that is at a layer that is , the probability over a randomly chosen computation path that is the first node on the path for which the learner has made significant progress is . Since by assumption of correctness the learner makes significant progress with at least probability, there must be at least such nodes and hence the space must be .

Given that we want to consider the first vertex on a computation path at which significant progress has been made it is natural to truncate a computation path at if significant progress has been already been made at (and then one should not count any path through towards the progress at some subsequent node ). Following [14], for technical reasons we will also truncate the computation path in other circumstances.

Definition 1.1.

We define probability distributions and the -truncation of the computation paths inductively as follows:

  • If is the root, then is the uniform distribution on .

  • (Significant Progress) If then truncate all computation paths at . We call vertex significant in this case.

  • (High Probability) Truncate the computation paths at for all inputs for which . Let be the set of such inputs.

  • (High Bias) Truncate any computation path at if it follows an outedge of with label for which . That is, we truncate the paths at if the outcome of the next sample for is too predictable in that it is highly biased towards or given the knowledge that the path was not truncated previously and arrived at .

  • If is not the root then define

    to be the conditional probability distribution on

    over all computation paths that have not previously been truncated and arrive at .

For an edge of the branching program, we also define a probability distribution , which is the conditional probability distribution on induced by the truncated computation paths that pass through edge .

With this definition, it is no longer immediate from the assumption of correctness that the truncated path reaches a significant node with at least probability. However, we will see that a single assumption about the matrix will be sufficient to prove both that this holds and that the probability is that the path reaches any specific node at which significant progress has been made.

2 Norm amplification by matrices on the positive orthant

By definition, for ,

Observe that for , the value is precisely the expected bias of the answer along a uniformly random outedge of (i.e., the advantage in predicting the outcome of the randomly chosen test).

If we have not learned the input , we would not expect to be able to predict the outcome of a typical test; moreover, since any path that would follow a high bias test is truncated, it is essential to argue that remains small at any node where there has not been significant progress.

In [14], was bounded using the matrix norm given by

where the numerator is an expectation -norm over and the denominator is an expectation -norm over . Thus

where is the largest singular value of and is a normalization factor.

In the case of the matrix associated with parity learning, and all the singular values are equal to so . With this bound, if is not a node of significant progress then and hence which is and hence small.

However, in the case of learning quadratic functions over , the largest singular value of the matrix is still (the uniform distribution on is a singular vector) and so . But in that case, when is we conclude that is at most which is much larger than 1 and hence a useless bound on .

Indeed, the same kind of problem occurs in using the method of [14] for any learning problem for which is : If is a child of the root of the branching program at which the more likely outcome of a single randomly chosen test is remembered, then . However, in this case and so . It follows that and when is the derived upper bound on at nodes where will be larger than 1 and therefore useless.

We need a more precise way to bound as a function of than the single number . To do this we will need to use the fact that – it has a fixed norm and (more importantly) it is non-negative.

Definition 2.1.

Let be a matrix. The 2-norm amplification curve of is a map given by

In other words, for and , whenever is at most , is at most .

3 Theorems

Our lower bound for learning quadratic functions will be in two parts. First, we modify the argument of [14] to use the function instead of :

Theorem 3.1.

Let , , and assume that . If has for some fixed constant , then there are constants depending only on and such that any algorithm that solves the learning problem for with success probability at least either requires space at least or time at least .

(We could write the statement of the theorem to apply to all and by replacing each occurrence of in the lower bounds with . When , we can use to bound which yields the bound given in [14].)

We then analyze the amplification properties of the matrix associated with learning quadratic functions over .

Theorem 3.2.

Let be the matrix for learning (homogenous) quadratic functions over . Then for all .

The following corollary is then immediate.

Corollary 3.3.

Let be a positive integer and . For some , any algorithm for learning quadratic functions over that succeeds with probability at least requires space or time .

This bound is tight since it matches the resources used by the learning algorithms for quadratic functions given in the introduction up to constant factors in the space bound and in the exponent of the time bound.

We obtain similar bounds for all low degree polynomials over .

Theorem 3.4.

Let and . Let be the matrix for learning (homogenous) functions of degree at most over . Then there is a constant depending on such that for all .

Again we have the following immediate corollary which is also asymptotically optimal for constant degree polynomials.

Corollary 3.5.

Fix some integer . There is a such that for positive integers and , any algorithm for learning polynomial functions of degree at most over that succeeds with probability at least requires space or time .

For the case of learning larger degree polynomials where the can depend on the number of variables , we can derive the following somewhat weaker lower bound whose proof we only sketch.

Theorem 3.6.

There are constants such that for positive integer and . any algorithm for learning polynomial functions of degree at most over that succeeds with probability at least requires space or time .

We prove Theorem 3.1 in the next section. In Section 5 we give a semidefinite programming relaxation of that provides a strategy for bounding the norm amplification curve and in Section 6 we give the applications of that method to the matrices for learning low degree polynomials. Finally, in Section 7 we sketch how to extend the framework to learning problems for which the tests have multivalued rather than simply binary outcomes.

4 Lower Bounds over Small Sample Spaces

In this section we prove Theorem 3.1. Let be the value given in the statement of the theorem, To do this we define several positive constants that will be useful:

  • ,

  • ,

  • ,

  • , and

  • .

Let be a learning branching program for with length at most and success probability at least .

We will prove that must have space . We first apply the -truncation procedure given in Definition 1.1 to yield and for all vertices in .

The following simple technical lemmas are analogues of ones proved in [14], though we structure our argument somewhat differently. The first uses the bound on the amplification curve of in place of its matrix norm.

Lemma 4.1.

Suppose that vertex in is not significant. Then

Proof.

Since is not significant . By definition of ,

Therefore, by Markov’s inequality,

Lemma 4.2.

Suppose that vertex in is not significant. Then

Proof.

Since is not significant,

Therefore, by Markov’s inequality,

Lemma 4.3.

The probability, over uniformly random and uniformly random computation path in on input , that the truncated version of reaches a significant vertex of is at least .

Proof.

Let be chosen uniformly at random from and consider the truncated path . will not reach a significant vertex of only if one of the following holds:

  1. is truncated at a vertex where .

  2. is truncated at a vertex because the next edge of is labelled by where .

  3. ends at a leaf that is not significant.

By Lemma 4.2, for each vertex on , conditioned on the truncated path reaching , the probability that is at most . Similarly, by Lemma 4.1, for each on the path, conditioned on the truncated path reaching , the probability that is at most . Therefore, since has length at most , the probability that is truncated at for either reason is at most since and .

Finally, if reaches a leaf that is not significant then, conditioned on arriving at , the probability that the input equals the label of is at most . Now

since is not significant, so we have and the probability that is correct conditioned on the truncated path reaching a leaf vertex that is not significant is less than since .

Since is correct with probability at least and these three cases in which does not reach a significant vertex account for correctness at most , which is much less than half of , must reach a significant vertex with probability at least . ∎

The following lemma is the the key to the proof of the theorem.

Lemma 4.4.

Let be any significant vertex of . There is an such that for a uniformly random chosen from and a uniformly random computation path , the probability that its truncation ends at is at most .

The proof of Lemma 4.4 requires a delicate progress argument and is deferred to the next subsection. We first show how Lemmas 4.3 and 4.4 immediately imply Theorem 3.1.

Proof of Theorem 3.1.

By Lemma 4.3, for chosen uniformly at random from and the truncation of a uniformly random computation path on input , ends at a significant vertex with probability at least . On the other hand, by Lemma 4.4, for any significant vertex , the probability that ends at is at most . Therefore the number of significant vertices must be at least and since has length at most , there must be at least significant vertices in some layer. Hence requires space . ∎

4.1 Progress towards significance

In this section we prove Lemma 4.4 showing that for any particular significant vertex a random truncated path reaches only with probability . For each vertex in let denote the probability over a random input , that the truncation of a random computation path in on input visits and for each edge in let denote the probability over a random input , that the truncation of a random computation path in on input traverses .

Since is a levelled branching program, the vertices of may be divided into disjoint sets for where is the length of and is the set of vertices at distance from the root, and disjoint sets of edges for where consists of the edges from to . For each vertex , note that by definition we only have

since some truncated paths may terminate at .

For each , since the truncated computation path visits at most one vertex and at most one edge at level , we obtain a sub-distribution on in which the probability of is and a corresponding sub-distribution on in which the probability of is . We write and to denote random selection from these sub-distributions, where the outcome corresponds to the case that no vertex (respectively no edge) is selected.

Fix some significant vertex . We consider the progress that a truncated path makes as it moves from the start vertex to . We measure the progress at a vertex as

Clearly . We first see that starts out at a tiny value.

Lemma 4.5.

If is the start vertex of then .

Proof.

By definition, is the uniform distribution on . Therefore

since is a probability distribution on . On the other hand, since is significant, . The lemma follows immediately. ∎

Since the truncated path is randomly chosen, the progress towards after

steps is a random variable. Following 

[14]

, we show that not only is the increase in this expected value of this random variable in each step very small, its higher moments also increase at a very small rate. Define

where we extend and define . We will show that for , is still , which will be sufficient to prove Lemma 4.4.

Therefore, Lemma 4.4, and hence Theorem 3.1, will follow from the following lemma.

Lemma 4.6.

For every with ,

Proof of Lemma 4.4 from Lemma 4.6.

By definition of and Lemma 4.5 we have . By Lemma 4.6, for every with ,

In particular, for every ,

Now fix to be the level of the significant node . Every truncated path that reaches will have contribution times its probability of occurring to . Therefore the truncation of a random computation path reaches with probability at most for and sufficiently large, which proves the lemma. ∎

We now focus on the proof of Lemma 4.6. Because depends on the sub-distribution over and depends on the sub-distribution over , it is natural to consider the analogous quantities based on the sub-distribution over the set of edges that join and . We can extend the definition of to edges of , where we write

Then define

Intuitively, there is no gain of information in moving from elements to elements of . More precisely, we have the following lemma:

Lemma 4.7.

For all , .

Proof.

Note that for , since the truncated paths that follow some edge are precisely those that reach , by definition, . Since the same applies separately to the set of truncated paths for each input that reach , for each we have

Therefore,

i.e., . Since , by the convexity of the map we have

Therefore

Therefore, to prove Lemma 4.6 it suffices to prove that the same statement holds with replaced by ; that is,

is the disjoint union of the out-edges for vertices , so it suffices to show that for each ,

(1)

Since any truncated path that follows must also visit , we can write . Moreover, both and have the same denominator and therefore, by definition, inequality (1), and hence Lemma 4.6, follows from the following lemma.

Lemma 4.8.

For ,

Before we prove Lemma 4.8, following [14], we first prove two technical lemmas, the first relating the distributions for and edges and the second upper bounding .

Lemma 4.9.

Suppose that is not significant and has and label . Then for , only if and , in which case

where .

Proof.

If then by definition of truncation we also will have . Therefore, since , is not a high bias edge – that is, – and hence

Let be the event that both and and define