1 Introduction
Arguably one of the most fundamental problems in the area of machine learning and learning theory, going back to the Perceptron Algorithm
[Ros58], is the problem of learning halfspaces, or Linear Threshold Functions (LTFs): Fix and , an LTF is a function such that if andotherwise. The associated learning problem is as follows: We observe samples
where is drawn from a fixed but unknown distribution and is, a possibly noisy version of, . We call the example and the label. Letdenote the joint distribution of
, the goal is to output a hypothesis such that the misclassification erroris minimized. For the purpose of this paper we consider the case where . In this work we make progress on a central question in the field: Identifying under which types of noise achieving small misclassification error is possible. On a conceptual level, we show that already as soon as only very few of the labels are flipped with some probability , it is likely to be computationally hard to achieve error better than . Even if the optimal error is much smaller than this.
Realizable Case, Random Classification Noise, and Agnostic Model
In the noiseless case, also called realizable case, it holds that for all
. In this setting it is wellknown that linear programming can achieve misclassfication error at most
efficiently, i.e., in time polynomial in and , and reliably, i.e., with probability close to 1, corresponding to Valiant’s PAC model [Val84]. When considering noisy labels, the two most wellstudied models are Random Classification Noise (RCN) [AL88] and the agnostic model [Hau92, KSS94]. In the former each sample is generated by first drawing and then setting with probability and setting with probability for some . It can be shown that in this model the informationtheoretic optimal misclassification error is and it is known how to efficiently find an LTF achieving misclassification error arbitrarily close to this [BFKV98]. However, one clear drawback is that the assumption that the magnitude of the noise is uniform across all examples is unrealistic. On the other extreme, in the agnostic model, no assumption whatsoever is placed on the joint distribution . It is now believed that it is computationally hard to output any hypothesis that achieves error even slightly better than . This holds even when the informationtheoretic misclassification error is a function going to zero when the ambient dimension goes to infinity [Dan16]. This is based on a hardness reduction to a problem widely believed to be computationally intractable.A More Realistic Yet Computationally Tractable Noise Model
Given the above results a natural question to ask is whether there exists a more realistic noise model in which it is still computationally tractable to achieve nontrivial guarantees. A promising candidate is the socalled Massart noise model which is defined as follows
Definition 1.1.
Let be a distribution over and let be an LTF. For , we say that a distribution over satisfies the Massart noise condition with respect to the hypothesis and to the marginal distribution if there exists a function such that samples are generated as follows: First, is drawn and then we output where with probability and with probability , i.e., is the flipping probability.
In the problem of learning halfspaces in the Massart noise model, we observe samples from an unknown distribution satisfying the Massart noise condition for some known bound , and the goal is to output a hypothesis minimizing the misclassification error
Note that the marginal distribution , the true hypothesis , and the flipping probability function are all unknown.
The model was proposed in [MN06].^{1}^{1}1Note that [RS94, Slo96] introduced an equivalent model called "malicious misclassification noise". Note that if for all , we obtain the Random Classification Noise model. As previously mentioned, the informationtheoretically optimal error in the RCN model is equal to . However, in the more general case of Massart noise, the informationtheoretically optimal error is equal to
which can potentially be much smaller than . Informationtheoretically, it was shown in [MN06] that for bounded away from , a number of samples suffices to achieve misclassification error and that this is tight up to constants. More generally, if the target halfspace is replaced by an unknown boolean function in a class of VCdimension , a number of samples suffices to achieve error . ^{2}^{2}2We remark that previous works on algorithmic aspects of the Massart model stated this sample complexity as . While this is correct, from [MN06] it follows that this is only necessary when . For smaller than this the bound of holds.
However, until recently, algorithmic results were only known when assuming that the marginal distribution of the examples belongs to some known class, e.g., is uniform or logconcave [ABHU15, ABHZ16, ZLC17] or even more general in [DKTZ20]. Under no assumption on the marginal distribution, [DGT19] was the first work that provided an efficient (improper) learning algorithm outputting a hypothesis (which is not a halfspace) such that . They use time and sample complexities which are polynomial in and . Building on this, [CKMY20] provided an efficient (proper) learning algorithm with the same error guarantees but whose output is itself a halfspace. We remark that the sample complexity of both of the above works depends on the bit complexity of points in the support of although this is informationtheoretically not necessary. This assumption was recently removed in [DKT21]. Further, the above works assume . See [DKK21b] for a quasipolynomial algorithmic result without this assumption but under Gaussian marginal.
On the other hand, until very recently, no matching computational lower bounds were known and it remained an open question to determine whether it is possible to efficiently achieve error guarantees that are better than , potentially going all the way to . This question is especially intriguing since the above algorithmic results imply that nontrivial guarantees can be achieved in the Massart noise model, which is much more realistic than RCN. The question then becomes if there are any computational limits at all in this model. As we will see, such limits do indeed exist, at least when restricting to the class of Statistical Query algorithms.
Statistical Query Algorithms and Known Lower Bounds.
Statistical Query (SQ) algorithms do not have access to actual samples from the (unknown) distribution but rather are allowed to query expectations of bounded functions over the underlying distribution. These queries return the correct value up to some accuracy. Since every such query can be simulated by samples from the distribution this is a restriction of Valiant’s PAC model. Note that a simple Chernoff bound shows that in order to simulate a query of accuracy , a number of samples is sufficient. Hence, SQ algorithms using queries of accuracy at most can be taken as a proxy for algorithms using samples and running in time . The SQ model was originally introduced by [Kea98]. See [Fel16] for a survey. Note, that it has also found applications outside of PAC learning, see e.g., [KLN11, FGV21] for examples.
Intriguingly, [Kea98] shows that any concept class that is PAC learnable in the realizable case using an SQ algorithm can also be learned in the PAC model under Random Classification Noise. Further, almost all known learning algorithms are either SQ or SQimplementable, except for those that are based on Gaussian elimination, e.g., learning parities with noise [Kea98, BKW03]. One clear advantage of this framework is that it is possible to prove unconditional lower bounds. This proceeds via the socalled SQ dimension first introduced in [BFJ94] and later refined in [FGR17, Fel17]. Although we will not see it explicitly, the lower bounds in this paper are also based on this parameter. See [DK21] and the references therein for more detail.
For learning halfspaces under Massart noise, [CKMY20] initiated the study of computational lower bounds. The authors proved that when is within a factor of 2 of , achieving error requires superpolynomially many queries. While this shows that obtaining optimal error is hard, it does not rule out the possibility of an efficient (SQ) algorithm achieving constant factor approximations. More recently [DK21] proved that for , achieving error better than requires queries of accuray better than or at least queries. This holds even when is a constant but goes to zero as the ambient dimension becomes large. This rules out any constant factor approximation algorithm, and also rules out efficient algorithms achieving error for any . Further, for close to the authors show that achieving error that is better than for some term that is constant in , but depends on and goes to 0 as goes to 1/2, also requires superpolynomial time in the SQ framework. For the special case of , [DKK21b] shows that achieving error requires queries of accuracy better than or at least queries even under Gaussian marginals. However, as with [CKMY20], this result only applies to exact learning.
As can be seen, the best previously known lower bounds are a constantfactor away from the best known algorithmic guarantees, but they do not match yet. In the present work, we close this gap by showing that the algorithmic guarantees are actually tight, at least in the SQ framework. More precisely, we will show that for arbitrary any SQ algorithms that achieves error better than either requires a superpolynomial number of queries, or requires queries of superpolynomial accuracy. Further, as for [DK21] the result holds even when goes to zero as a function of the ambient dimension and is a constant arbitrarly close to 1/2.
1.1 Results
The following theorem is our main result (see Theorem 4.1 for a more detailed version):
Theorem 1.2 (Informal version).
Let be sufficiently large and be arbitrary. There exists no SQ algorithm that learns dimensional halfspaces in the Massart noise model to error better than using at most queries of accuracy no better than .
This holds even if the optimal halfspace achieves error that vanishes as fast as for some , and even if we assume that all flipping probabilities are either or .
Some remarks are in order:

As we mentioned earlier, this lower bound matches the guarantees that are achievable in polynomial time [DGT19, CKMY20]. Moreover, since these algorithms can be implemented in the SQ learning model, this completely characterizes the error guarantees that are efficiently achievable in the SQ framework for the class of halfspaces under Massart noise. Further, this also suggests that improving over this guarantee with efficient nonSQ algorithms might be hard.

For the special case , the result implies that handling Massart noise is as hard as the much more general agnostic model – again for the class of halfspaces and in the SQ model. Namely, it is hard to achieve error better than a random hypothesis. Note that even though means that there can be examples with completely random labels, the fact that can be made go to zero implies that there would be a vanishing fraction of such examples. We remark that Daniely gave a similar SQ lower bound for the agnostic model [Dan16].

The fact that hardness still holds even if for all we have and even if is very small implies that achieving error better than
remains hard even if an overwhelming fraction of the samples have no noise in their labels. In light of the previous point this implies that even if the overwhelming majority of the points have no noise but the labels of just very few are random, outputting a hypothesis which does better than randomly classifying the points is SQhard.

The case when is computationally easy. This follows since with high probability there is a subset of the observed samples in which no labels were flipped and which is sufficiently large to apply algorithms designed for the realizable case. Hence, for values of only slightly smaller than allowed by Theorem 1.2 achieving optimal misclassfication error is possible in polynomial time.
As a consequence of the above theorem, we immediately obtain strong hardness results for a more challenging noise model, namely the Tsybakov noise model [MT99, Tsy04] defined as follows: Let and . Samples are generated as in the Massart model but the flipping probabilites are not uniformly bounded by some constant but rather need to satisfy the following condition:
It is known that informationtheoretically samples suffice to learn halfspaces up to misclassification error in this model [H14, Chapter 3]. On the other hand, algorithmic results are only known when restricting the marginal distribution to belong to a fixed class of distributions (e.g., logconcave or even more general [DKK21a]). On the other hand, we claim that our hardness result about Massart noise implies that it is SQhard to achieve error even slightly better than in the Tsybakov model. Indeed, let for some . Further, let be a constant, be bounded away from 1, and
Then the Massart condition together with the condition that and implies the Tsybakov condition. To see this note that for we obtain that
and for we have
Hence, by Theorem 1.2, or Theorem 4.1, achieving error better than requires queries of accuray better than the inverse of any polymonial or at least superpolynomially many queries, even though . Similarly, for where it is hard to achieve error better than in the sense above. This stands in strong contrast to the fact that informationtheoretically samples and time suffice to achieve misclassification optimal error. That is, even if the fraciton of flipping probabilites decreases very fast as we approach 1/2 learning in the model remains hard.
Lastly, we would like to mention that we closely follow the techniques developed in [DK21]
(and previous works cited therein). At the heart of their work one needs to design two distributions matching many moments of the standard Gaussian (and satisfying some additional properties). The main difference in our work lies in how exactly we construct these distributions, which eventually leads to the tight result.
2 Techniques
In this section, we will outline the techniques used to prove Theorem 1.2. On a high level, we will closely follow the approach of [DK21]. First, note that for satisfying , any degree polynomial over can be viewed as a linear function^{3}^{3}3We use an embedding whose component functions are the monomials of degree . over . Hence, any lower bound against learning polynomialthreshold functions (PTFs) in would yield a lower bound against learning halfspaces in . Further, if we choose so that for some constant , then an exponential lower bound against learning PTFs in would yield a superpolynomial lower bound against learning halfspaces in .
One key step of the SQ hardness result in [DK21] is to construct two specific distributions over and show that a mixture of these two distributions is SQhard to distinguish from a certain null distribution^{4}^{4}4The null distribution is the one where the example is standard Gaussian and the label is independent from .. The authors then argue that any algorithm that learns Massart PTFs up to error better than can be used to distinguish these distributions from the null distribution. We follow a similar proof strategy. The main difference lies in how we construct the two hard distributions (in a simpler way), allowing us to obtain the optimal lower bound
. In fact, we will show that two simple modifications of the standard Gaussian distribution will work.
Both distributions constructed in [DK21] as well as the ones that we will construct have the following common structure: Let be fixed but unknown, , and let be two onedimensional distributions. Define (respectively, ) as the distribution over that is equal to (respectively, ) in the direction of and equal to a standard Gaussian in the orthogonal complement. Then, define the distribution over as follows: With probability draw and return , and with probability draw and return . The goal is to output a hypothesis minimizing the misclassification error . It is easy to see that one of the constant functions or achieves error . The question is whether it is possible to achieve error better than .
Roughly speaking^{5}^{5}5This sweaps under the rock some details, see Section 4 for all details., the authors of [DK21] show the following hardness result: Suppose the first moments of and match those of upto additive error at most and their divergence with respect to is not too large. Then every SQ algorithm outputting a hypothesis achieving misclassification error slightly smaller than must either make queries of accuracy at least or must make at least queries. Hence, if we can choose to be a small constant multiple of we get an exponential lower bound as desired. The authors then proceed to construct distributions satisfying the moment conditions with and such that corresponds to an Massart PTF. In this paper, we construct distributions satisfying the moment conditions with . However, the divergence will be too large to apply the hardness result of [DK21] in a blackbox way. To remedy this, we show that its proof can be adapted to also work in this regime. Further, by choosing the parameters slightly differently, the reduction still works. In the following, we briefly describe our construction. We will give a more detailed comparison with [DK21] in Section 2.1.
Let be the bound of the Massart model and fix . We will show that we can choose and satisfying the moment conditions above, in such a way that corresponds to an Massart PTF. Note that this will directly imply Theorem 1.2 via the previously outlined reduction. We partition into three regions and such that the following conditions hold:

for ,

for ,

for all .
Suppose that can be written as the union of intervals and hence there is a degree polynomial which is nonnegative on and nonpositive on . We claim, that is an Massart PTF for the polynomial defined as
Let be the marginal distribution of on . Then this means that for all , such that it needs to hold that
Indeed, consider such that . Since and it follows that . On a high level, this is because none of the samples with label lie in this region. Similarly, the same holds for such that . Now consider such that and . Since and it follows
(2.1) 
Note that in our cosntruction it will actually hold that for all . Hence, it even holds that for all .
Our work crucially departs from [DK21] in our choice of and to satisfy Items 1 to 3 and the momentmatching condition. In fact, giving a very clean construction will turn out to be essential for achieving the tightest possible lower bound. On a high level, will be equal to an appropriate multiple of the standard Gaussian distribution on periodically spaced intervals of small size and 0 otherwise. will be the equal to for of large magnitude. For smaller we will slightly displace the intervals.
Concretely, let be such that and consider the infinte union of intervals
Denote by the pdf of a standard Gaussian distribution. We define (the unnormalized measures)
Clearly, the total probability mass of the two is the same. It can be shown that it is , so for the sake of this exposition assume that it is exactly one and that and
are in fact probability distributions (see
Section 4.1 for all details). Further, considerIt is not hard to verify that together with satisfy Items 1 to 3. Hence, our final distribution will satisfy the Massart condition. Since only if which only is the case when it follows that is very small as well.^{6}^{6}6Note, that here is the union of intervals. It is straightforward to adapt the previous discussion to this case.
The fact that the moments of match those of a standard Gaussian will follow from the fact that it is obtained by only slightly modifying it. This part is similar to [DK21]. Note that is equal to for of magnitude larger than roughly and for smaller is obtained by displacing by . Hence, it will follow that its first moments match those of (and hence also those of a standard Gaussian) up to error . In Section 4, we will show that we can choose the parameters such that for slightly smaller than we can make the first moments of and match those of a standard Gaussian up to error at most roughly which will be sufficient.
2.1 Comparison with [Dk21]
The key property that allowed us to achieve the sharp lower bound of was that on . Indeed, if we only had for some constant , the resulting distribution would no longer be Massart (cf. Eq. 2.1), and the only way to still make it so is to increase which in turn degrades the resulting lower bound. More precisely, if we only have , then the upper bound in Eq. 2.1 will now be instead of . Basic manipulations show that this is less than or equal to if and only if , which means that the lower bound that we get from the distinguishing problem is at best .
While our construction can avoid this issue because we can ensure that for (in fact, we will have ), it is unclear if the same can be achieved using the construction of [DK21], or a slight modification of it. In their work, the supports of and also consist of unions of intervals, but they increase in size as we move away from the origin. The intervals of are disjoint from those of for of small magnitude, but they start to overlap when becomes large. On each interval the distribution is also a constant multiple of , however, their specific choice makes exact computations difficult and the authors only show that where the constants in the notation can be smaller than 1. ^{7}^{7}7We remark that the authors do not work with distributions directly but with unnormalized measures. Normalizing them does not change the construction but makes the comparison easier. We note, however, that the moment bounds the authors use for their distribution are very similar to the one we use for our distribution .
On a more technical level, we cannot directly apply the hardness result [DK21, Proposition 3.8] the authors used. Suppose the first moments of and match those of a standard Gaussian up to additve error and the divergence of and with respect to the standard Gaussian is at most . Further, let
Then this result says that for every SQ algorithm achieving misclassification error better than must either make queries of accuracy better than or must make at least queries. Since in our construction we need to choose sufficiently small to match many moments — which in turn will increase the divergence — we will have which is too large for the above. On the flip side, the proof of [DK21, Proposition 3.8] can readily be adapted (in fact, this is already implicit in the proof) so that the same conclusion also holds for
for some arbitrarily small where the constant in is independent of . It will turn out that we can choose sufficiently small and in turn slightly larger so that the above yields the desired bounds. See Section 4 and Appendix A for an indepth discussion.
3 Preliminaries
For two functions , we will write if . Similarly, we will write if .
All logarithms will be to the base .
We will use to denote the onedimensional standard Gaussian distribution. We will denote its pdf by
and with a slight abuse of notation we will also refer to a standard Gaussian random variable by
.For two probability distribution and we denote their divergence by
For an unnormalized positive measure we denote its total measure by .
4 Hardness Result
In this section, we will prove the full version of Theorem 1.2. Concretely, we will show that
Theorem 4.1.
Let and be such that is at least a sufficiently large constant. There exists a parameter for which there is no SQ algorithm that learns the class of halfspaces on with Massart noise using at most queries of accuracy and which achieves misclassification error that is better than . This holds even if the optimal halfspace has misclassification error that is as small as and all flipping probabilites are either 0 or .
Note that with satisfies the assumption of the theorem and we recover Theorem 1.2. As previously mentioned, the setting is the same as in [DK21] except that we achieve a lower bound of .
We will prove Theorem 4.1 by reducing it to the following classification problem, which was introduced in [DK21], and then applying a lower bound that was proved in the same reference.
Definition 4.2 (Hidden Direction Classification Problem).
Let be two probability distributions over , let , and
be a unit vector in
. Let (respectively ) be the distribution that is equal to (respectively ) in the direction of and equal to a standard Gaussian in its orthogonal complement. Consider the distribution on defined as follows: With probability draw and output , and with probability draw and return . The Hidden Direction Classification Problem is the following: Given sample access to for a fixed but unknown , output a hypothesis (approximately) minimizing .Achieving misclassification error can trivially be achieved by one of the constant functions 1 or . The following lemma shows that in the SQ framework, one cannot do better if the distributions and (approximately) match many moments of the standard Gaussian distribution. Its proof is analogous to the one of Proposition 3.8 in [DK21]. We will give a more detailed discussion in Appendix A.
Lemma 4.3 (Adaptation of Proposition 3.8 in [Dk21]).
Let and . Let be probability distributions on such that their first moments agree with the first moments of up to error at most and such that and are finite. Denote and assume that . Then, any SQ algorithm which, given access to for a fixed but unknown , outputs a hypothesis such that
must either make queries of accuracy better than or make at least queries.
The goal is now to find distributions satisfying the conditions of Lemma 4.3 and such that the distribution corresponds to the Massart noise model. To this end, consider distributions and unions of intervals given by the following theorem which we will prove in Section 4.1.
Proposition 4.4.
Let and let be an integer. Define and let . If , there exist probability distributions on and two unions of intervals such that

and ,

(a) on , on , and (b) for all we have ,

for all the first moments of and match those of a standard Gaussian within additive error ,

at most a fraction of the measure (respectively ) lies outside (respectively ) ,

and .
Although will not correspond to a Massart distribution when considering only halfspaces, it will turn out to work when considering polynomial threshold functions, i.e., for some polynomial in . Further, we will be able to choose the parameters such that Lemma 4.3 will correspond to a superpolynomial lower bound in terms of .
Unless explicitly indicated by a subscript, in what follows the notation will only contain universal constants indpendent of the ones we define throughout the section. Fix a unit vector and let be such that
for a sufficiently large constant . Further, let
for a sufficiently large constant , so that
We would like to find and such that we can represent degree polynomials over as halfspaces over . It is sufficient to have
To this end, for and sufficiently large constants, consider
and
Notice that since
it follows that
Hence,
where the last inequality follows since and by choosing to be large enough with respect to and . Let
Further, let
and consider the probability distributions defined by Proposition 4.4 for our settings of and . Also, let be the corresponding unions of intervals. Let
so that . As we will shortly see, is an Massart polynomialthreshold function. In order to obtain an Massart halfspace, we will embed into the higher dimensional space .
Let
and define
where is a multiindex and . Furthermore, let
be the linear embedding of into that is obtained by appending by zeros. We will embed into using the embedding defined as
The hard distribution is as follows: Draw and return . The next lemma shows that this distribution satisfies the Massart property with respect to the class of halfspaces.
Lemma 4.5.
The probability distribution is an Massart halfspace with .
Proof.
Let and consider the function such that
Since is a union of intervals, can be written as , where for some degree polynomial . Now since , there is a linear function such that for all it holds that . This in turn implies that there is a linear function such that for all we have .
Note that only if for some . Furthermore,

For satisfying , we have with probability 1.

For satisfying and , we have , and
Hence, corresponds to an Massart distribution corresponding to the halfspace . Furthermore, the flipping probability function satisfies
Therefore, for all and
The second last inequality follows from Item 2 and Item 4 of Proposition 4.4. Indeed, we have
∎
Second, any hypothesis for predicting from can be turned into one predicting from and viceversa. Hence, it is enough to show that the former is SQhard. Consider the setting of Lemma 4.3 with and given by Proposition 4.4. First, by Item 5 we know that and . Hence,
Further, let
By Proposition 4.4 we know that the first moments of and match those of a standard Gaussian up to additive error
where the equality follows from the fact that .
We claim that by choosing large enough we get . Indeed, since and , we have
Hence, by choosing large enough, we get
It follows that both and match the moments of a standard Gaussian up to additive error at most . This, in addition to the fact that , imply that the parameter in Lemma 4.3 is equal to
By choosing and recalling that , we get
which implies that
for sufficiently large (and hence sufficiently small ). Therefore, we can choose the parameter in Lemma 4.3 to be equal to .
Next, we claim that the parameter
of Lemma 4.3 is at least . In fact, recalling that , , and , we obtain
By choosing the constant in the definition of large enough, we conclude that .
Hence, by Lemma 4.3 any SQ algorithm that outputs a hypothesis such that
must either make queries of accuracy better than or make at least queries. Since
Theorem 4.1 follows.
4.1 Hard Distributions
In this section, we will construct the onedimensional momentmatching distributions. Concretely, we will show the following proposition:
Proposition 4.6 (Restatement of Proposition 4.4).
Let and let be an integer. Define and let . If , there exist probability distributions on and two unions of intervals such that

and ,

(a) on , on , and (b) for all we have ,

for all the first moments of and match those of a standard Gaussian within additive error ,

at most a fraction of the measure (respectively ) lies outside (respectively ) ,

and .
Our construction will be based on the measure of density
where is the standard Gaussian measure (and by abuse of notation, its density). Let
be the probability distribution obtained by normalizing the measure . We define
and
Since , Item 1 and Item 2 of Proposition 4.4 clearly hold.
In order to show Item 4, we bound the measure of outside by . Indeed, it then follows that
In order to do so, we will upper bound the measure of outside and will lower bound the total measure . We have:
where we used the fact that is decreasing for and that . As for , we have
(4.1)  
where we also used the fact that is decreasing for , and that . Now since , we deduce that
Next, we will bound the chisquare divergence
Lemma 4.7.
Let be defined as above, then and .
Proof.
For we get:
The term corresponding to the first sum is less than or equal to , and for the second sum we notice that
implying that
Putting everything together yields the claim. ∎
Lastly, we show that the moments match up to the desired error. We start with .
Lemma 4.8.
Let . For the distribution defined as above and all it holds that