Agnostic Learning of Monomials by Halfspaces is Hard

12/03/2010 ∙ by Vitaly Feldman, et al. ∙ ibm Georgia Institute of Technology Harvard University Carnegie Mellon University 0

We prove the following strong hardness result for learning: Given a distribution of labeled examples from the hypercube such that there exists a monomial consistent with (1-) of the examples, it is NP-hard to find a halfspace that is correct on (1/2+) of the examples, for arbitrary constants > 0. In learning theory terms, weak agnostic learning of monomials is hard, even if one is allowed to output a hypothesis from the much bigger concept class of halfspaces. This hardness result subsumes a long line of previous results, including two recent hardness results for the proper learning of monomials and halfspaces. As an immediate corollary of our result we show that weak agnostic learning of decision lists is NP-hard. Our techniques are quite different from previous hardness proofs for learning. We define distributions on positive and negative examples for monomials whose first few moments match. We use the invariance principle to argue that regular halfspaces (all of whose coefficients have small absolute value relative to the total ℓ_2 norm) cannot distinguish between distributions whose first few moments match. For highly non-regular subspaces, we use a structural lemma from recent work on fooling halfspaces to argue that they are "junta-like" and one can zero out all but the top few coefficients without affecting the performance of the halfspace. The top few coefficients form the natural list decoding of a halfspace in the context of dictatorship tests/Label Cover reductions. We note that unlike previous invariance principle based proofs which are only known to give Unique-Games hardness, we are able to reduce from a version of Label Cover problem that is known to be NP-hard. This has inspired follow-up work on bypassing the Unique Games conjecture in some optimal geometric inapproximability results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Boolean conjunctions (or monomials), decision lists, and halfspaces are among the most basic concept classes in learning theory. They are all long-known to be efficiently PAC learnable, when the given examples are guaranteed to be consistent with a function from any of these concept classes [44, 7, 41]. However, in practice data is often noisy or too complex to be consistently explained by a simple concept. A common practical approach to such problems is to find a predictor in a certain space of hypotheses that best fits the given examples. A general model for learning that addresses this scenario is the agnostic learning model  [22, 27]. An agnostic learning algorithm for a class of functions using a hypothesis space is required to perform the following task: Given examples drawn from some unknown distribution, the algorithm must find a hypothesis in

that classifies the examples nearly as well as is possible by a hypothesis from

. The algorithm is said to be a proper learning algorithm if .

In this work we address the complexity of agnostic learning of monomials by algorithms that output a halfspace as a hypothesis. Learning methods that output a halfspace as a hypothesis such as Perceptron

[42], Winnow [36]

, Support Vector Machines

[45] as well as most boosting algorithms are well-studied in theory and widely used in practical prediction systems. These classifiers are often applied to labeled data sets which are not linearly separable. Hence it is of great interest to determine the classes of problems that can be solved by such methods in the agnostic setting. In this work we demonstrate a strong negative result on agnostic learning by halfspaces. We prove that non-trivial agnostic learning of even the relatively simple class of monomials by halfspaces is an NP-hard problem.

Theorem 1.1.

For any constant , it is -hard to find a halfspace that correctly labels -fraction of given examples over even when there exists a monomial that agrees with a -fraction of the examples.

Note that this hardness result is essentially optimal since it is trivial to find a hypothesis with agreement rate — output either the function that is always or the function that is always . Also note that Theorem 1.1

measures agreement of a halfspace and a monomial with the given set of examples rather than the probability of agreement of

with an example drawn randomly from an unknown distribution. Uniform convergence results based on the VC dimension imply that these settings are essentially equivalent (see for example [22, 27]).

The class of monomials is a subset of the class of decision lists which in turn is a subset of the class of halfspaces. Therefore our result immediately implies an optimal hardness result for proper agnostic learning of decision lists.

Previous work

Before describing the details of the prior body of work on hardness results for learning, we note that our result subsumes all these results with just one exception (the hardness of learning monomials by -CNFs [34]). This is because we obtain the optimal inapproximability factor and allow learning of monomials by the much richer class of halfspaces.

The results of the paper are noteworthy in the broader context of hardness of approximation. Previously, hardness proofs based on the invariance principle were only known to give Unique-Games hardness. In this work, we are able to harness invariance principles to show NP-hardness result by working with a version of Label Cover whose projection functions are only required to be unique-on-average. This could be one potential approach to revisit the many strong inapproximability results conditioned on the Unique Games conjecture (UGC), with an eye towards bypassing the UGC assumption. Such a goal was achieved for some geometric problems recently [21]; see Section 2.3.

Agnostic learning of monomials, decision lists and halfspaces has been studied in a number of previous works. Proper agnostic learning of a class of functions is equivalent to the ability to come up with a function in which has the optimal agreement rate with the given set of examples and is also referred to as the Maximum Agreement problem for a class of function .

The Maximum Agreement problem for halfspaces is equivalent to the so-called Hemisphere problem and is long known to be NP-complete [24, 17]. Amaldi and Kann [1] showed that Maximum Agreement for halfspaces is NP-hard to approximate within factor. This was later improved by Ben-David et al. [5], and Bshouty and Burroughs [9] to approximation factors , and , respectively. An optimal inapproximability result was established independently by Guruswami and Raghavendra [20] and Feldman et al.  [15] showing NP-hardness of approximating the Maximum Agreement problem for halfspaces within for every constant . The reduction in [15] requires examples with real-valued coordinates, whereas the proof in [20] also works for examples drawn from the Boolean hypercube.

The Maximum Agreement problem for monotone monomials was shown to be NP-hard by Angluin and Laird [2], and NP-hardness for general monomials was shown by Kearns and Li [28]. The hardness of approximating the maximum agreement within was shown by Ben-David et al. [5]. The factor was subsequently improved to by Bshouty and Burroughs [9]. Finally, Feldman et al.  [14, 15] showed a tight inapproximability result, namely that it is NP-hard to distinguish between the instances where -fraction of the labeled examples are consistent with some monomial and instances where every monomial is consistent with at most -fraction of the examples. Recently, Khot and Saket [34] proved a similar hardness result even when a -CNF is allowed as output hypothesis for an arbitrary constant (a -CNF is the conjunction of several clauses, each of which has at most literals; a monomial is thus a -CNF).

For the concept class of decisions lists, APX-hardness (or hardness to approximate within some constant factor) of the Maximum Agreement problem was shown by Bshouty and Burroughs [9]. As mentioned above, our result subsumes all these results with the exception of [34].

A number of hardness of approximation results are also known for the complementary problem of minimizing disagreement for each of the above concept classes [27, 23, 3, 8, 14, 15]. Another well-known evidence of the hardness of agnostic learning of monomials is that even a non-proper agnostic learning of monomials would give an algorithm for learning DNF — a major open problem in learning theory [35]. Further, Kalai et al. 

proved that even agnostic learning of halfspaces with respect to the uniform distribution implies learning of parities with random classification noise — a long-standing open problem in learning theory and coding

[25].

Monomials, decision lists and halfspaces are known to be efficiently learnable in the presence of more benign random classification noise  [2, 26, 29, 10, 6, 12]. Simple online algorithms like Perceptron and Winnow learn halfspaces when the examples can be separated with a significant margin (as is the case if the examples are consistent with a monomial) and are known to be robust to a very mild amount of adversarial noise [16, 4, 18]. Our result implies that these positive results will not hold when the adversarial noise rate is for any constant .

Kalai et al. gave the first non-trivial algorithm for agnostic learning monomials in time [25]. They also gave a breakthrough result for agnostic learning of halfspaces with respect to the uniform distribution on the hypercube up to any constant accuracy (and analogous results for a number of other settings). Their algorithms output linear thresholds of parities as hypotheses. In contrast, our hardness result is for algorithms that output a halfspace (which is a linear threshold of single variables).

Organization of the paper:

We sketch the idea of our proof in Section 2. We define some probability and analytical tools in Section 3. In Section 4 we define the dictatorship test, which is an important gadget for the hardness reduction. For the purpose of illustration, we also show why this dictatorship test already suffices to prove Theorem 1.1 assuming the Unique Games Conjecture [30]. In Section 5, we describe a reduction from a variant of the Label Cover problem to prove Theorem 1.1 under the assumption that .

Notation:

We use to encode “False” and to encode “True”. We denote as the indicator function of whether ; i.e., when and when .

For , , and , a halfspace is a Boolean function of the form ; a monomial (conjunction) is a function of the form , where and is the literal of which can represent either or ; a disjunction is a function of the form . One special case of monomials is the function for some , also referred to as the -th dictator function.

2 Proof Overview

We prove Theorem 1.1 by exhibiting a reduction from the -Label Cover problem, which is a particular variant of the Label Cover problem. The -Label Cover problem is defined as follows:

Definition 2.1.

For positive integer that and , an instance of -Label Cover consists of a -uniform connected (multi-)hypergraph with vertex set and an edge multiset ; a set of functions . Every hyperedge is associated with a -tuple of projection functions where .

A vertex labeling is an assignment of labels to vertices . A labeling is said to strongly satisfy an edge if for every . A labeling weakly satisfies edge if for some , .

The goal in Label Cover is to find a vertex labeling that satisfies as many edges (projection constraints) as possible.

2.1 Hardness assuming the Unique Games conjecture

For the sake of clarity, we first sketch the proof of Theorem 1.1 with a reduction from the -Unique Label Cover problem which is a special case of -Label Cover where and all the projection functions are bijections. The following inapproximability result  [33] for -Unique Label Cover is equivalent to the Unique Games Conjecture of Khot [30].

Conjecture 2.2.

For every constant and a positive integer , there exists an integer such that for all positive integers , given an instance it is -hard to distinguish between,

  • strongly satisfiable instances: there exists a labeling that strongly satisfies fraction of the edges .

  • almost unsatisfiable instances: there is no labeling that weakly satisfies fraction of the edges.

Given an instance of -Unique Label Cover, we will produce a distribution over labeled examples such that the following holds: if is a strongly satisfiable instance, then there is a disjunction that agrees with the label on a randomly chosen example with probability at least , while if is an almost unsatisfiable instance then no halfspace agrees with the label on a random example from with probability more than . Clearly, such a reduction implies Theorem 1.1 assuming the Unique Games Conjecture but with disjunctions in place of conjunctions. De Morgan’s law and the fact that a negation of a halfspace is a halfspace then imply that the statement is also true for monomials (we use disjunctions only for convenience).

Let be an instance of -Unique Label Cover on hypergraph and a set of labels . The examples we generate will have coordinates, i.e., belong to . These coordinates are to be thought of as one block of coordinates for every vertex . We will index the coordinates of as .

For every labeling of the instance, there is a corresponding disjunction over given by,

Thus, using a label for a vertex is encoded as including the literal in the disjunction. Notice that an arbitrary halfspace over need not correspond to any labeling at all. The idea would be to construct a distribution on examples which ensures that any halfspace agreeing with at least fraction of random examples somehow corresponds to a labeling of weakly satisfying a constant fraction of the edges in .

Fix an edge . For the sake of exposition, let us assume is the identity permutation for every . The general case is not anymore complicated.

For the edge , we will construct a distribution on examples with the following properties:

  • All coordinates for a vertex are fixed to be zero. Restricted to these examples, the halfspace can be written as .

  • For any label , the labeling strongly satisfies the edge . Hence, the corresponding disjunction needs to have agreement with the examples from .

  • There exists a decoding procedure that given a halfspace outputs a labeling for such that, if has agreement with the examples from , then weakly satisfies the edge with non-negligible probability.

For conceptual clarity, let us rephrase the above requirement as a testing problem. Given a halfspace , consider a randomized procedure that samples an example from the distribution , and accepts if . This amounts to a test that checks if the function corresponds to a consistent labeling. Further, let us suppose the halfspace is given by Define the linear function as . Then, we have .

For a halfspace corresponding to a labeling , we will have – a dictator function. Thus, in the intended solution every linear function associated with the halfspace is a dictator function.

Now, let us again restate the above testing problem in terms of these linear functions. For succinctness, we write for the linear function . We need a randomized procedure that does the following:

Given linear functions , queries the functions at one point each (say respectively), and accepts if .

The procedure must satisfy,

  • (Completeness) If each of the linear functions is the ’th dictator function for some , then the test accepts with probability .

  • (Soundness) If the test accepts with probability , then at least two of the linear functions are close to the same dictator function.

A testing problem of the above nature is referred to as a Dictatorship Testing and is a recurring theme in hardness of approximation.

Notice that the notion of a linear function being close to a dictator function is not formally defined yet. In most applications, a function is said to be close to a dictator if it has influential coordinates. It is easy to see that this notion is not sufficient by itself here. For example, in the linear function , although the coordinate has little influence on the linear function, it has significant influence on the halfspace.

We resolve this problem by using the notion of critical index (Definition 3.1) that was introduced in [43] and has found numerous applications in the analysis of halfspaces [37, 40, 13]. Roughly speaking, given a linear function , the idea is to recursively delete its influential coordinates until there are none left. The total number of coordinates so deleted is referred to as the critical index of . Let denote the critical index of , and let denote the set of largest coordinates of . The linear function is said to be close to the ’th dictator function for every in . A function is far from every dictator if it has critical index – no influential coordinate to delete.

An important issue is that the critical index of a linear function can be much larger than the number of influential coordinates and cannot be appropriately bounded. In other words, a linear function can be close to a large number of dictator functions, as per the definition above. To counter this, we employ a structural lemma about halfspaces that was used in the recent work on fooling halfspaces with limited independence [13]. Using this lemma, we are able to prove that if the critical index is large, then one can in fact zero out the coordinates of outside the largest coordinates for some large enough , and the agreement of the halfspace only changes by a negligible amount! Thus, we first carry out the zeroing operation for all linear functions with large critical index.

We now describe the above construction and analysis of the dictatorship test in some more detail. It is convenient to think of the queries as the rows of a matrix with entries. Henceforth, we will refer to matrices and their rows and columns.

We construct two distributions on such that for , we have for (this will ensure the completeness of the reduction, i.e., certain disjunctions pass with high probability). Further, the distributions will be carefully chosen to have matching first four moments. This will be used in the soundness analysis where we will use an invariance principle to infer structural properties of halfspaces that pass the test with probability noticeably greater than .

We define the distribution on matrices by sampling columns independently according to , and then perturbing each bit with a small probability . We define the following test (or equivalently, distribution on examples): given a halfspace on , with probability we check for a sample , and with probability we check for a sample .

Completeness: By construction, each of the disjunctions passes the test with probability at least (here denotes the entry in the ’th row and ’th column of ).

Soundness: For the soundness analysis, suppose is a halfspace that passes the test with probability at least . The halfspace can be written in two ways by expanding the inner product along rows and columns, i.e., Let us denote .

First, let us see why the linear functions must be close to some dictator. Note that we need to show that two of the linear functions are close to the same dictator.

Suppose each of the linear functions is not close to any dictator. In other words, for each , no single coordinate of the vector is too large (contains more than -fraction of the mass of vector ). Clearly, this implies that no single column of the matrix is too large.

Recall that the halfspace is given by Here is a degree polynomial into which we are substituting values from two product distributions and . Further, the distributions and have matching moments up to order by design. Using the invariance principle, the distribution of is roughly the same, whether is from or . Thus, by the invariance principle, the halfspace is unable to distinguish between the distributions and with a noticeable advantage.

Further, suppose no two linear functions are close to the same dictator, i.e., . In this case, we condition on the values of for . Since , this conditions at most one value in each column. Therefore, the conditional distribution on each column in cases and still have matching first three moments. We thus apply the invariance principle using the fact that after deleting the coordinates in , all the remaining coefficients of the weight vector are small (by definition of critical index). This implies that for some two rows and finishes the proof of the soundness claim.

The above consistency-enforcing test almost immediately yields the Unique Games hardness of weak learning disjunctions by halfspaces via standard methods.

2.2 Extending to NP-hardness

To prove NP-hardness as opposed to hardness assuming the Unique Games conjecture, we reduce a version of Label Cover to our problem. This requires a more complicated consistency check, and we have to overcome several additional technical obstacles in the proof.

The main obstacle encountered in transferring the dictatorship test to a Label Cover-based hardness is one that commonly arises for several other problems. Specifically, the projection constraint on an edge maps a large set of labels corresponding to a vertex to a single label for the vertex . While composing the Label Cover constraint with the dictatorship test, all labels in have to be necessarily equivalent. In several settings including this work, this requires the coordinates corresponding to labels in to be mostly identical! However, on making the coordinates corresponding to identical, the prover corresponding to can determine the identity of edge , thus completely destroying the soundness of the composition. In fact, the natural extension of the Unique Games-based reduction for MaxCut [32] to a corresponding Label Cover hardness fails primarily for this reason.

Unlike MaxCut or other Unique Games-based reductions, in our case, the soundness of the dictatorship test is required to hold against a specific class of functions, i.e, halfspaces. Harnessing this fact, we execute the reduction starting from a Label Cover instance whose projections are unique on average. More precisely, a smooth Label Cover (introduced in [31]) is one in which for every vertex , and a pair of labels , the labels project to the same label with a tiny probability over the choice of the edge . Technically, we express the error term in the invariance principle as a certain fourth moment of the coefficients of the halfspace, and use the smoothness to bound this error term for most edges of the Label Cover instance.

2.3 Bypassing the Unique Games conjecture

Unlike previous invariance principle based proofs which are only known to give Unique-Games hardness, we are able to reduce from a version of the Label Cover problem, based on unique on average projections, that can be shown to be NP-hard. It is of great interest to find other applications where a weak uniqueness property like the smoothness condition mentioned above can be used to convert a Unique-Games hardness result to an unconditional NP-hardness result. Indeed, inspired by the success of this work in avoiding the UGC assumption and using some of our methods, follow-up work has managed to bypass the Unique Games conjecture in some optimal geometric inapproximability results [21]. To the best of our knowledge, the results of [21] are the first NP-hardness proofs showing a tight inapproximability factor that is related to fundamental parameters of Gaussian space, and among the small handful of results where optimality of a non-trivial semidefinite programming based algorithm is shown under the assumption . We hope that this paper has thus opened the avenue to convert at least some of the many tight Unique-Games hardness results to NP-hardness results.

3 Preliminaries

In this section, we define two important tools in our analysis: i) critical index, ii) invariance principle.

3.1 Critical Index

The notion of critical index was first introduced by Servedio [43] and plays an important role in the analysis of halfspaces in [37, 40, 13].

Definition 3.1.

Given any real vector . Reorder the coordinates by decreasing absolute value, i.e., and denote . For , the -critical index of the vector is defined to be the smallest index such . If no such exists (, ), the -critical index is defined to be . The vector is said to be -regular if the -critical index is .

A simple observation from [13] is that if the critical index of a sequence is large then the sequence must contain a geometrically decreasing subsequence.

Lemma 3.2.

(Lemma in [13]) Given a vector such that , if the -critical index of the vector is larger than , then for any ,

In particular, if then .

For a -regular weight vector, the following lemma bounds the probability that its weighted sum falls into a small interval under certain distributions on the points. The proof is in Appendix B.

Lemma 3.3.

Let be a -regular vector , and . is a distribution over . Define a distribution on as follows: to generate from , first sample from and then define,

Then for any interval , we have

Intuitively, by the Berry-Esseen Theorem, is

close to the Gaussian distribution if each

is a random bit; therefore we can bound the probability that falls into the interval . In above lemma, each has probability to be a random bit, then fraction of is set to be a random bit and we can similarly bound the probability that falls into the interval .

Definition 3.4.

For a vector , define set of indices as the set of indices containing the biggest coordinates of by absolute value. Suppose its -critical index is , define set of indices . In other words, is the set of indices whose deletion makes the vector to be -regular.

Definition 3.5.

For a vector and a subset of indices , define the vector as:

As suggested by Lemma 3.2, a weight vector with a large critical index has a geometrically decreasing subsequence. The following two lemmas use this fact to bound the probability that the weighted sum of a geometrically decreasing sequence of weights falls into a small interval. First, we restate Claim 5.7 from [13] here.

Lemma 3.6.

[Claim 5.7, [13]] Let be such that and for . Then for any interval of length , there is at most one point such that .

Lemma 3.7.

Let be such that and for . Let be a distribution over . Define a distribution on as follows: To generate from , sample from and set

Then for any we have

Proof.

By Lemma 3.6, we know that for the interval , there is at most one point such that . If no such exists then clearly the probability is zero. On the other hand, suppose there exists such an , then only if holds.

Conditioned on any fixing of the bits , every bit is an independent random bit with probability . Therefore, for every fixing of , for each , with probability at least , is not equal to . Therefore, . ∎

3.2 Invariance Principle

While invariance principles have been shown in various settings by [39, 11, 38], we restate a version of the principle well suited for our application. We present a self-contained proof for it in Appendix C.

Definition 3.8.

A function for which fourth-order derivatives exist everywhere on is said to be -bounded if for all .

Definition 3.9.

Two ensembles of random variables

and are said to have matching moments up to degree if for every multi-set of elements from , , we have .

Theorem 3.10.

(Invariance Principle) Let be families of ensembles of random variables with and , satisfying the following properties:

  • For each , the random variables in ensembles have matching moments up to degree . Further all the random variables in and are bounded by .

  • The ensembles are all independent of each other, similarly the ensembles are independent of each other.

Given a set of vectors , define the linear function as

Then for a -bounded function we have

for all . Further, define the spread function corresponding to the ensembles and the linear function as follows,

then for all ,

Roughly speaking, the second part of the theorem states that function can be thought of as -bounded with error parameter .

4 Construction of the Dictatorship Test

In this section we describe the construction of the dictatorship test which will be the key ingredient in the hardness reduction from -Unique Label Cover.

4.1 Distributions and

The dictatorship test is based on following two distributions and defined on .

Lemma 4.1.

For

, there exists two probability distributions

, on such that for ,

while matching moments up to degree , i.e.,

Proof.

For , take to be the following distribution:

  1. with probability , randomly set exactly one of the bit to be 1 and all the other to be 0;

  2. with probability , independently set every bit to be 1 with probability ;

  3. with probability , independently set every bit to be 1 with probability ;

  4. with probability , independently set every bit to be 1 with probability ;

  5. with probability , independently set every bit to be 1 with probability .

The distribution is defined to be the following distribution with parameter to be specified later:

  1. with probability , set every bit to be zero;

  2. with probability , independently set every bit to be 1 with probability ;

  3. with probability , independently set every bit to be 1 with probability ;

  4. with probability , independently set every bit to be 1 with probability ;

  5. with probability , independently set every bit to be 1 with probability .

From the definition of , we know that and .

It remains to determine each . Notice that the moment matching conditions can be expressed as a linear system over the parameters as follows:

We then show that such a linear system has a feasible solution and .

To prove this, by applying Cramer’s rule,

With some calculation using basic linear algebra, we get

For large enough , we have