1 Introduction
In this work, we study the number of samples needed for learning under noise and memory constraints. The study of the resources needed for learning, under memory constraints was initiated by Shamir [Sha14] and Steinhardt, Valiant and Wager [SVW16], and has been studied in the streaming setting. In addition to being a natural question in learning theory and complexity theory, lower bounds in this model also have direct applications to bounded storage cryptography [Raz16, VV16, KRT17, TT18, GZ19, JT19, DTZ20, GZ21]. [SVW16] conjectured that any algorithm for learning parities of size (that is, learning from a stream of random linear equations in ) requires either a memory of size or an exponential number of samples. This conjecture was proven in [Raz16] and in follow up works, this was generalized to learning sparse parities in [KRT17] and more general learning problems in [Raz17, MM17, MT17, GRT18, BGY18, DS18, MM18, SSV19, GRT19, DKS19, GRZ20].
In this work, we extend this line of work to noisy Boolean function learning problems. In particular, we consider the wellstudied problem of learning parity under noise (LPN). In this problem, a learner wants to learn from independent and uniformly random linear equations in where the right hand sides are obtained by independently flipping the evaluation of an unknown parity function with probability . Learning Parity with Noise (LPN) is a central problem in Learning and Coding Theory (often referred to as decoding random linear codes) and has been extensively studied. Even without memory constraints, coming up with algorithms for the problem has proven to be challenging and the current stateoftheart for solving the problem is still the celebrated work of Blum, Kalai and Wasserman [BKW03] that runs in time . Over time, the hardness of LPN (and its generalization to nonbinary finite fields) has been used as a starting point in several hardness results [KKMS08, FGKP09] and constructing cryptographic primitives [Ale03]. On the other hand, lowerbounds for the problem are known only in restricted models such as Statistical Query Learning^{1}^{1}1The SQ model does not seem to distinguish between noisy and noiseless variants of parity learning and yields the same lower bound in both cases. [Kea98].
Learning under noise is at least as hard as learning without noise and thus, memorysample lower bounds for parity learning [Raz16] holds for learning parity under noise too. It is natural to ask – can we get better space lower bounds for learning parities under noise? In this work, we are able to strengthen the memory lower bound to for parity learning with noise.
Our results actually extend to a broad class of learning problems under noise. As in [Raz17] and follow up works, we represent a learning problem using a matrix. Let , be two finite sets (where represents the conceptclass that we are trying to learn and represents the set of possible samples). Let be a matrix. The matrix represents the following learning problem with error parameter (): An unknown element was chosen uniformly at random. A learner tries to learn from a stream of samples, , where for every , is chosen uniformly at random and with probability .
Our Results
We use extractorbased characterization of the matrix to prove our lower bounds, as done in [GRT18]. Our main result can be stated as follows (Corollary 2): Assume that are such that any submatrix of of at least rows and at least columns, has a bias of at most . Then, any learning algorithm for the learning problem corresponding to with error parameter requires either a memory of size at least , or at least samples. Thus, we get an extra factor of in the space lower bound for all the bounds on learning problems that [GRT18] imply, some of which are as follows (see [GRT18] for details on why the corresponding matrices satisfy the extractorbased property):

Parities with noise: A learner tries to learn , from (a stream of) random linear equations over which are correct with probability and flipped with probability . Any learning algorithm requires either a memory of size or an exponential number of samples.

Sparse parities with noise: A learner tries to learn of sparsity , from (a stream of) random linear equations over which are correct with probability and flipped with probability . Any learning algorithm requires:

Assuming : either a memory of size or samples.

Assuming : either a memory of size or samples.


Learning from noisy sparse linear equations: A learner tries to learn , from (a stream of) random sparse linear equations, of sparsity , over , which are correct with probability and flipped with probability . Any learning algorithm requires:

Assuming : either a memory of size or samples.

Assuming : either a memory of size or samples.


Learning from noisy lowdegree equations: A learner tries to learn , from (a stream of) random multilinear polynomial equations of degree at most , over , which are correct with probability and flipped with probability . We prove that if , any learning algorithm requires either a memory of size or samples (where ).

Lowdegree polynomials with noise: A learner tries to learn an variate multilinear polynomial of degree at most over , from (a stream of) random evaluations of over , which are correct with probability and flipped with probability . We prove that if , any learning algorithm requires either a memory of size or samples.
Techniques
Our proof follows the proof of [Raz17, GRT18] very closely and builds on that proof. We extend the extractorbased result of [GRT18] to the noisy case and a straightforward adaptation to its proof gives the stronger lower bound for the noisy case (which reflects on the strength of the current techniques). The main contribution of this paper is not a technical one but establishing stronger space lower bounds for a wellstudied problem of learning parity with noise, using the current techniques.
Discussion and Open Problem
Let’s look at a space upper bound for the problem of learning parity with noise, that is, a learner tries to learn from a stream of samples of the form , where is chosen uniformly at random and with probability and with probability (here, represents the inner product of and in , that is, ).
Upper Bound:
Consider the following algorithm : Store the first samples. Check for every , if for at least fraction of the samples , agrees with . Output the first that satisfies the check. In expectation, would agree with for fraction of the samples, and otherwise for , in expectation, would agree with for half the samples. Therefore, for large enough , using Chernoff bound and a union bound, with high probability () over the samples, satisfies the check if and only if , and outputs the correct answer under such an event. uses samples and bits of space.
In this paper, we prove that any algorithm that learns parity with noise from a stream of samples (as defined above) requires bits of space or exponential number of samples. Improving the lower bound to match the upper bound (or vice versa) is a fascinating open problem and we conjecture that the upper bound is tight. As each sample gives at most bits of information about , we can at least show that a learning algorithm requires samples to learn (which corresponds to using bits of space if each sample is stored).
Conjecture 1.1.
Any learner that tries to learn from a stream of samples of the form , where is chosen uniformly at random and with probability and with probability , requires either bits of memory or samples.
The proof of the conjecture, if true, would lead to new technical insights (beyond extractorbased techniques) into proving timespace (or memorysample) lower bounds for learning problems.
Outline of the Paper
2 Preliminaries
Denote by
the uniform distribution over
. Denote by the logarithm to base. For a random variable
and an event , we denote by the distribution of the random variables , and we denote by the distribution of the random variable conditioned on the event .Viewing a Learning Problem, with error , as a Matrix
Let , be two finite sets of size larger than 1. Let and .
Let be a matrix. The matrix corresponds to the following learning problem with error parameter (). There is an unknown element that was chosen uniformly at random. A learner tries to learn from samples , where is chosen uniformly at random, and with probability and with probability . That is, the learning algorithm is given a stream of samples, , where each is uniformly distributed, and with probability and with probability .
Norms and Inner Products
Let . For a function , denote by the norm of , with respect to the uniform distribution over , that is:
For two functions , define their inner product with respect to the uniform distribution over as
For a matrix and a row , we denote by the function corresponding to the th row of . Note that for a function , we have . Here, represents the matrix multiplication of with .
Extractors and Extractors
Definition 2.1.
Extractor: Let be two finite sets. A matrix is a Extractor with error , if for every nonnegative with there are at most rows in with
Let be a finite set. We denote a distribution over as a function such that . We say that a distribution has minentropy if for all , we have .
Definition 2.2.
Extractor: Let be two finite sets. A matrix is a Extractor if for every distribution with minentropy at least and every distribution with minentropy at least ,
Branching Program for a Learning Problem
In the following definition, we model the learner for the learning problem that corresponds to the matrix , by a branching program, as done by previous papers starting with [Raz16].
Definition 2.3.
Branching Program for a Learning Problem: A branching program of length and width , for learning, is a directed (multi) graph with vertices arranged in layers containing at most vertices each. In the first layer, that we think of as layer 0, there is only one vertex, called the start vertex. A vertex of outdegree 0 is called a leaf. All vertices in the last layer are leaves (but there may be additional leaves). Every nonleaf vertex in the program has outgoing edges, labeled by elements , with exactly one edge labeled by each such , and all these edges going into vertices in the next layer. Each leaf in the program is labeled by an element , that we think of as the output of the program on that leaf.
ComputationPath: The samples that are given as input, define a computationpath in the branching program, by starting from the start vertex and following at step the edge labeled by , until reaching a leaf. The program outputs the label of the leaf reached by the computationpath.
Success Probability: The success probability of the program is the probability that , where is the element that the program outputs, and the probability is over (where is uniformly distributed over and are uniformly distributed over , and for every , with probability and with probability ).
A learning algorithm, using samples and a memory of bits, can be modeled as a branching program^{2}^{2}2The lower bound holds for randomized learning algorithms because a branching program is a nonuniform model of computation, and we can fix a good randomization for the computation without affecting the width. of length and width . Thus, we will focus on proving widthlength tradeoffs for any branching program that learns an extractorbased learning problem with noise, and such tradeoffs would translate into memorysample tradeoffs for the learning algorithms.
3 Overview of the Proof
The proof adapts the extractorbased timespace lower bound of [GRT18] to the noisy case, which in turn built on [Raz17] that gave a general technique for proving memorysamples lower bounds. We recall the arguments in [Raz17, GRT18] for convenience.
Assume that is a extractor with error , and let . Let be a branching program for the noisy learning problem that corresponds to the matrix . We want to prove that has at least length or requires at least width (that is, any learning algorithm solving the learning problem corresponding to the matrix with error parameter , requires either memory or exponential number of samples). Assume for a contradiction that is of length and width , where is a small constant.
We define the truncatedpath, , to be the same as the computationpath of , except that it sometimes stops before reaching a leaf. Roughly speaking, stops before reaching a leaf if certain “bad” events occur. Nevertheless, we show that the probability that stops before reaching a leaf is negligible, so we can think of as almost identical to the computationpath.
For a vertex of , we denote by the event that reaches the vertex . We denote by the probability for (where the probability is over ), and we denote by the distribution of the random variable conditioned on the event . Similarly, for an edge of the branching program , let be the event that traverses the edge . Denote, , and .
A vertex of is called significant if
Roughly speaking, this means that conditioning on the event that reaches the vertex , a nonnegligible amount of information is known about . In order to guess with a nonnegligible success probability, must reach a significant vertex. Lemma 4.1 shows that the probability that reaches any significant vertex is negligible, and thus the main result follows.
To prove Lemma 4.1, we show that for every fixed significant vertex , the probability that reaches is at most (which is smaller than one over the number of vertices in ). Hence, we can use a union bound to prove the lemma.
The proof that the probability that reaches is extremely small is the main part of the proof. To that end, we use the following functions to measure the progress made by the branching program towards reaching .
Let be the set of vertices in layer of , such that . Let be the set of edges from layer of to layer of , such that . Let
We think of as measuring the progress made by the branching program, towards reaching a state with distribution similar to .
We show that each may only be negligibly larger than . Hence, since it’s easy to calculate that , it follows that is close to , for every . On the other hand, if is in layer then is at least . Thus, cannot be much larger than . Since is significant, and hence is at most .
The proof that may only be negligibly larger than is done in two steps: Claim 4.12 shows by a simple convexity argument that . The hard part, that is done in Claim 4.10 and Claim 4.11, is to prove that may only be negligibly larger than .
For this proof, we define for every vertex , the set of edges that are going out of , such that . Claim 4.10 shows that for every vertex ,
may only be negligibly higher than
For the proof of Claim 4.10, which is the hardest proof in the paper, we follow [Raz17, GRT18] and consider the function . We first show how to bound . We then consider two cases: If is negligible, then is negligible and doesn’t contribute much, and we show that for every , is also negligible and doesn’t contribute much. If is nonnegligible, we use the bound on and the assumption that is a extractor to show that for almost all edges , we have that is very close to . Only an exponentially small () fraction of edges are “bad” and give a significantly larger . In the noiseless case, any “bad” edge can increase by a factor of 2 in the worst case, and hence [GRT18] raised and to the power of , as it is the largest power for which the contribution of the “bad” edges is still small (as their fraction is ). But in the noisy case, any “bad” edge can increase by a factor of at most in the worst case, and thus, we can afford to raise and to the power of . This is where our proof differs from that of [GRT18].
This outline oversimplifies many details. To make the argument work, we force to stop at significant vertices and whenever is large, that is, at significant values, as done in previous papers. And we force to stop before traversing some edges, that are so “bad” that their contribution to is huge and they cannot be ignored. We show that the total probability that stops before reaching a leaf is negligible.
4 Main Result
Theorem 1.
Let . Fix to be such that . Let , be two finite sets. Let . Let be a matrix which is a extractor with error , for sufficiently large^{3}^{3}3By “sufficiently large” we mean that are larger than some constant that depends on . and , where . Let
(1) 
Let be a branching program, of length at most and width at most , for the learning problem that corresponds to the matrix with error parameter . Then, the success probability of is at most .
Proof.
We recall the proof in [GRT18, Raz17] and adapt it to the noisy case. Let
(2) 
Our proof differs from [GRT18] starting with Claim 4.5, which allows us to set to a larger value of instead of as set in [GRT18]. Note that by the assumption that and are sufficiently large, we get that and are also sufficiently large. Since , we have . Thus,
(3) 
Let be a branching program of length and width^{4}^{4}4width lower bound is vacuous for as regardless of the width, samples are needed to learn. for the learning problem that corresponds to the matrix with error parameter . We will show that the success probability of is at most .
4.1 The TruncatedPath and Additional Definitions and Notation
We will define the truncatedpath, , to be the same as the computationpath of , except that it sometimes stops before reaching a leaf. Formally, we define , together with several other definitions and notations, by induction on the layers of the branching program .
Assume that we already defined the truncatedpath , until it reaches layer of . For a vertex in layer of , let be the event that reaches the vertex . For simplicity, we denote by the probability for (where the probability is over ), and we denote by the distribution of the random variable conditioned on the event .
There will be three cases in which the truncatedpath stops on a nonleaf :

If is a, so called, significant vertex, where the norm of is nonnegligible. (Intuitively, this means that conditioned on the event that reaches , a nonnegligible amount of information is known about ).

If is nonnegligible. (Intuitively, this means that conditioned on the event that reaches , the correct element could have been guessed with a nonnegligible probability).

If is nonnegligible. (Intuitively, this means that is about to traverse a “bad” edge, which is traversed with a nonnegligibly higher or lower probability than probability of traversal under uniform distribution on ).
Next, we describe these three cases more formally.
Significant Vertices
We say that a vertex in layer of is significant if
Significant Values
Even if is not significant, may have relatively large values. For a vertex in layer of , denote by the set of all , such that,
Bad Edges
For a vertex in layer of , denote by the set of all , such that,
The TruncatedPath
We define by induction on the layers of the branching program . Assume that we already defined until it reaches a vertex in layer of . The path stops on if (at least) one of the following occurs:

is significant.

.

.

is a leaf.
Otherwise, proceeds by following the edge labeled by (same as the computationalpath).
4.2 Proof of Theorem 1
Since follows the computationpath of , except that it sometimes stops before reaching a leaf, the success probability of is bounded (from above) by the probability that stops before reaching a leaf, plus the probability that reaches a leaf and .
The main lemma needed for the proof of Theorem 1 is Lemma 4.1 that shows that the probability that reaches a significant vertex is at most .
Lemma 4.1.
The probability that reaches a significant vertex is at most .
Lemma 4.1 is proved in Section 4.3. We will now show how the proof of Theorem 1 follows from that lemma.
Lemma 4.1 shows that the probability that stops on a nonleaf vertex, because of the first reason (i.e., that the vertex is significant), is small. The next two lemmas imply that the probabilities that stops on a nonleaf vertex, because of the second and third reasons, are also small.
Claim 4.2.
If is a nonsignificant vertex of then
Proof.
Since is not significant,
Hence, by Markov’s inequality,
Since conditioned on , the distribution of is , we obtain
Claim 4.3.
If is a nonsignificant vertex of then
Proof.
Since is not significant, . Since is a distribution, . Thus,
Since is a extractor with error , there are at most elements with
The claim follows since is uniformly distributed over and since (Equation (1)). ∎
We can now use Lemma 4.1, Claim 4.2 and Claim 4.3 to prove that the probability that stops before reaching a leaf is at most . Lemma 4.1 shows that the probability that reaches a significant vertex and hence stops because of the first reason, is at most . Assuming that doesn’t reach any significant vertex (in which case it would have stopped because of the first reason), Claim 4.2 shows that in each step, the probability that stops because of the second reason, is at most . Taking a union bound over the steps, the total probability that stops because of the second reason, is at most . In the same way, assuming that doesn’t reach any significant vertex (in which case it would have stopped because of the first reason), Claim 4.3 shows that in each step, the probability that stops because of the third reason, is at most . Again, taking a union bound over the steps, the total probability that stops because of the third reason, is at most . Thus, the total probability that stops (for any reason) before reaching a leaf is at most .
Recall that if doesn’t stop before reaching a leaf, it just follows the computationpath of . Recall also that by Lemma 4.1, the probability that reaches a significant leaf is at most . Thus, to bound (from above) the success probability of by , it remains to bound the probability that reaches a nonsignificant leaf and . Claim 4.4 shows that for any nonsignificant leaf , conditioned on the event that reaches , the probability for is at most , which completes the proof of Theorem 1.
Claim 4.4.
If is a nonsignificant leaf of then
Proof.
This completes the proof of Theorem 1. ∎
4.3 Proof of Lemma 4.1
Proof.
We need to prove that the probability that reaches any significant vertex is at most . Let be a significant vertex of . We will bound from above the probability that reaches , and then use a union bound over all significant vertices of . Interestingly, the upper bound on the width of is used only in the union bound.
The Distributions and
Recall that for a vertex of , we denote by the event that reaches the vertex . For simplicity, we denote by the probability for (where the probability is over ), and we denote by the distribution of the random variable conditioned on the event .
Similarly, for an edge of the branching program , let be the event that traverses the edge . Denote, (where the probability is over ), and .
Claim 4.5.
For any edge of , labeled by , such that , for any ,
where is a normalization factor that satisfies,
Proof.
Let be an edge of , labeled by , and such that . Since , the vertex is not significant (as otherwise always stops on and hence ). Also, since , we know that (as otherwise never traverses and hence ).
If reaches , it traverses the edge if and only if: (as otherwise stops on ) and , . Therefore, by Bayes’ rule, for any ,
where is a normalization factor, given by
Since is not significant, by Claim 4.2,
Since ,
and hence for every ,
Hence, by the union bound,
(where the last inequality follows since , by Equation (1)). ∎
Bounding the Norm of
We will show that cannot be too large. Towards this, we will first prove that for every edge of that is traversed by with probability larger than zero, cannot be too large.
Claim 4.6.
For any edge of , such that ,
Proof.
Let be an edge of , labeled by , and such that . Since , the vertex is not significant (as otherwise always stops on and hence
Comments
There are no comments yet.