1.1 Background and Problem Definition
Halfspaces are Boolean functions of the form , where
is the associated weight vector. (The functionis defined as if and
otherwise.) The problem of learning an unknown halfspace with a margin condition (in the sense that no example is allowed to lie too close to the separating hyperplane) is as old as the field of machine learning — starting with Rosenblatt’s Perceptron algorithm[Ros58] — and has arguably been one of the most influential problems in the development of the field, with techniques such as SVMs [Vap98] and AdaBoost [FS97] coming out of its study.
In this paper, we study the problem of learning -margin halfspaces in the agnostic PAC model [Hau92, KSS94]. Specifically, there is an unknown distribution on , where is the unit ball on , and the learning algorithm is given as input a training set of i.i.d. samples drawn from . The goal of is to output a hypothesis whose error rate is competitive with the -margin error rate of the optimal halfspace. In more detail, the error rate (misclassification error) of a hypothesis (with respect to ) is . For , the -margin error rate of a halfspace with is . We denote by the minimum -margin error rate achievable by any halfspace. We say that is an -agnostic learner, , if it outputs a hypothesis
that with probability at leastsatisfies . (For = 1, we obtain the standard notion of agnostic learning.) If the hypothesis is itself a halfspace, we say that the learning algorithm is proper. This work focuses on proper learning algorithms.
1.2 Related and Prior Work
In this section, we summarize the prior work that is directly related to the results of this paper. First, we note that the sample complexity of our learning problem (ignoring computational considerations) is well-understood. In particular, the ERM that minimizes the number of -margin errors over the training set (subject to a norm constraint) is known to be an agnostic learner (), assuming the sample size is . Specifically, samples111To avoid clutter in the expressions, we will henceforth assume that the failure probability . Recall that one can always boost the confidence probability with an multiplicative overhead in the sample complexity. are known to be sufficient and necessary for this learning problem (see, e.g., [BM02, McA03]). In the realizable case (), i.e., if the data is linearly separable with margin , the ERM rule above can be implemented in time using the Perceptron algorithm. The non-realizable setting () is much more challenging computationally.
The agnostic version of our problem () was first considered in [BS00], who gave a proper learning algorithm with runtime . It was also shown in [BS00] that agnostic proper learning with runtime is NP-hard. A question left open by their work was characterizing the computational complexity of proper learning as a function of .
Subsequent works focused on improper learning. The case was studied in [SSS09, SSS10] who gave a learning algorithm with sample complexity – i.e., exponential in – and computational complexity . The increased sample complexity is inherent in their approach, as their algorithm works by solving a convex program over an expanded feature space. [BS12] gave an -agnostic learning algorithm for all with sample complexity and computational complexity . (We note that the Perceptron algorithm is known to achieve [Ser01]. Prior to [BS12], [LS11] gave a time algorithm achieving .) [BS12] posed as an open question whether their upper bounds for improper learning can be achieved with a proper learner.
has given polynomial time robust estimators for a range of learning tasks. Specifically,[KLS09, ABL17, DKS18, DKK19] obtained efficient PAC learning algorithms for halfspaces with malicious noise [Val85, KL93], under the assumption that the uncorrupted data comes from a “tame” distribution, e.g., Gaussian or isotropic log-concave. It should be noted that the class of -margin distributions considered in this work is significantly broader and can be far from satisfying the structural properties required in the aforementioned works.
A growing body of theoretical work has focused on adversarially robust learning (e.g., [BLPR19, MHS19, DNV19, Nak19]). In adversarially robust learning, the learner seeks to output a hypothesis with small -robust misclassification error, which for a hypothesis and a norm is typically defined as . Notice that when is a halfspace and is the Euclidean norm, the -robust misclassification error coincides with the -margin error in our context. (It should be noted that most of the literature on adversarially robust learning focuses on the -norm.) However, the objectives of the two learning settings are slightly different: in adversarially robust learning, the learner would like to output a hypothesis with small -robust misclassification error, whereas in our context the learner only has to output a hypothesis with small zero-one misclassification error. Nonetheless, as we point out in Remark 1.3, our algorithms can be adapted to provide guarantees in line with the adversarially robust setting as well.
1.3 Our Contributions
We study the complexity of proper -agnostic learning of -margin halfspaces on the unit ball. Our main result nearly characterizes the complexity of constant factor approximation to this problem:
There is an algorithm that uses samples, runs in time and is an -agnostic proper learner for -margin halfspaces with confidence probability . Moreover, assuming the Randomized Exponential Time Hypothesis, any proper learning algorithm that achieves any constant factor approximation has runtime .
The reader is referred to Theorems 2.4 and 3.1 for detailed statements of the upper and lower bound respectively. A few remarks are in order: First, we note that the approximation ratio of in the above theorem statement is not inherent. Our algorithm achieves , for any , with runtime . The runtime of our algorithm significantly improves the runtime of the best known agnostic proper learner [BS00], achieving fixed polynomial dependence on , independent of . This gain in runtime comes at the expense of losing a small constant factor in the error guarantee. It is natural to ask whether there exists an -agnostic proper learner matching the runtime of our Theorem 1.1. In Theorem 3.2, we establish a computational hardness result implying that such an improvement is unlikely.
The runtime dependence of our algorithm scales as (which is nearly best possible for proper learners), as opposed to in the best known improper learning algorithms [SSS09, BS12]. In addition to the interpretability of proper learning, we note that the sample complexity of our algorithm is quadratic in (which is information-theoretically optimal), as opposed to exponential for known improper learners. Moreover, for moderate values of , our algorithm may be faster than known improper learners, as it only uses spectral methods and ERM, as opposed to convex optimization. Finally, we note that the lower bound part of Theorem 1.1 implies a computational separation between proper and improper learning for our problem.
In addition, we explore the complexity of -agnostic learning for large . The following theorem summarizes our results in this setting:
There is an algorithm that uses samples, runs in time and is an -agnostic proper learner for -margin halfspaces with confidence probability . Moreover, assuming NP RP and the Sliding Scale Conjecture, there exists an absolute constant , such that no -agnostic proper learner runs in time.
The reader is referred to Theorem 2.7 for the upper bound and Theorem 3.3 for the lower bound. In summary, we give an -agnostic proper learning algorithm with runtime exponential in , as opposed to , and we show that achieving is computationally hard. (Assuming only NP RP, we can rule out polynomial time -agnostic proper learning for .)
While not stated explicitly in the subsequent analysis, our algorithms (with a slight modification to the associated constant factors) not only give a halfspace with zero-one loss at most , but this guarantee holds for the -margin error222Here the constant can be replaced by any constant less than one, with an appropriate increase to the algorithm’s running time. of as well. Thus, our learning algorithms also work in the adversarially robust setting (under the Euclidean norm) with a small loss in the “robustness parameter” (margin) from the one used to compute the optimum (i.e., ) to the one used to measure the error of the output hypothesis (i.e., ).
1.4 Our Techniques
Overview of Algorithms.
For the sake of this intuitive explanation, we provide an overview of our algorithms when the underlying distribution is explicitly known. The finite sample analysis of our algorithms follows from standard generalization bounds (see Section 2).
Our constant factor approximation algorithm relies on the following observation: Let be the optimal weight vector. The assumption that is large for almost all (by the margin property) implies a relatively strong condition on , which will allow us to find a relatively small search space containing a near-optimal solution. A first idea is to consider the matrix and note that . This in turn implies that has a large component on the subspace spanned by the largest eigenvalues of . This idea suggests a basic algorithm that computes a net over unit-norm weight vectors on this subspace and outputs the best answer. This basic algorithm has runtime and is analyzed in Section 2.1.
To obtain our time constant factor approximation algorithm (establishing the upper bound part of Theorem 1.1), we use a refinement of the above idea. Instead of trying to guess the projection of
onto the space of large eigenvectorsall at once, we will do so in stages. In particular, it is not hard to see that has a non-trivial projection onto the subspace spanned by the top eigenvalues of . If we guess this projection, we will have some approximation to , but unfortunately not a sufficiently good one. However, we note that the difference between and our current hypothesis will have a large average squared inner product with the misclassified points. This suggests an iterative algorithm that in the
-th iteration considers the second moment matrix
of the points not correctly classified by the current hypothesis, guesses a vector in the space spanned by the top few eigenvalues of , and sets . This procedure can be shown to produce a candidate set of weights with cardinality one of which has the desired misclassification error. This algorithm and its analysis are given in Section 2.2.
Our general -agnostic algorithm (upper bound in Theorem 1.2) relies on approximating the Chow parameters of the target halfspace , i.e., the numbers , . A classical result [Cho61] shows that the exact values of the Chow parameters of a halfspace (over any distribution) uniquely define the halfspace. Although this fact is not very useful under an arbitrary distribution, the margin assumption implies a strong approximate identifiability result (Lemma 2.10). Combining this with an algorithm of [DDFS14], we can efficiently compute an approximation to the halfspace given an approximation to its Chow parameters. In particular, if we can approximate the Chow parameters to -error , we can approximate within error .
A naive approach to approximate the Chow parameters would be via the empirical Chow parameters, namely . In the realizable case, this quantity indeed corresponds to the vector of Chow parameters. Unfortunately however, this method does not work in the agnostic case and it can introduce an error of . To overcome this obstacle, we note that in order for a small fraction of errors to introduce a large error in the empirical Chow parameters, it must be the case that there is some direction in which many of these erroneous points introduce a large error. If we can guess some error that correlates well with and also guess the correct projection of our Chow parameters onto this vector, we can correct a decent fraction of the error between the empirical and true Chow parameters. We show that making the correct guesses times, we can reduce the empirical error sufficiently so that it can be used to find an accurate hypothesis. Once again, we can compute a hypothesis for each sequence of guesses and return the best one. See Section 2.3 for a detailed analysis.
Overview of Computational Lower Bounds.
Our hardness results are shown via two reductions. These reductions take as input an instance of a computationally hard problem and produce a distribution on . If the starting instance is a YES instance of the original problem, then is small for an appropriate value of . On the other hand, if the starting instance is a NO instance of the original problem, then is large333We use to denote the minimum error rate achievable by any halfspace.. As a result, if there is a “too fast” (-)agnostic proper learner for -margin halfspaces, then we would also get a “too fast” algorithm for the original problem as well, which would violate the corresponding complexity assumption.
To understand the margin parameter we can achieve, we need to first understand the problems we start with. For our reductions, the original problems can be viewed in the following form: select items from that satisfy certain “local constraints”. For instance, in our first construction, the reduction is from the -Clique problem: Given a graph and an integer , the goal is to determine whether contains a -clique as a subgraph. For this problem, correspond to the vertices of and the “local” constraints are that every pair of selected vertices induces an edge.
Roughly speaking, our reduction produces a distribution on in dimension , with the -th dimension corresponding to . The “ideal” solution in the YES case is to set iff is selected and set otherwise. In our reductions, the local constraints are expressed using “sparse” sample vectors (i.e., vectors with only a constant number of non-zero coordinates all having the same magnitude). For example, in the case of -Clique, the constraints can be expressed as follows: For every non-edge , we must have , where and denote the -th and -th vectors in the standard basis. A main step in both of our proofs is to show that the reduction still works even when we “shift” the right hand side by a small multiple of . For instance, in the case of -Clique, it is possible to show that, even if we replace with, say, , the correctness of the construction remains, and we also get the added benefit that now the constraints are satisfied with a margin of for our ideal solution in the YES case.
In the case of -Clique, the above idea yields a reduction to 1-agnostic learning -margin halfspaces with margin , where the dimension is (and ). As a result, if there is an -time algorithm for the latter for some function , then there also exists a -time algorithm for -Clique for some function . The latter statement is considered unlikely, as it would break a widely-believed hypothesis in the area of parameterized complexity.
Ruling out -agnostic learners, for , is slightly more complicated, since we need to produce the “gap” of between in the YES case and in the NO case. To create such a gap, we appeal to the PCP Theorem [AS98, ALM98], which can be thought of as an NP-hardness proof of the following “gap version” of 3SAT: given a 3CNF formula as input, distinguish between the case that the formula is satisfiable and the case that the formula is not even -satisfiable444In other words, for any assignment to the variables, at least fraction of the clauses are unsatisfied.. Moreover, further strengthened versions of the PCP Theorem [Din07, MR10] actually implies that this Gap-3SAT problem cannot even be solved in time , where denotes the number of variables in the formula, assuming the Exponential Time Hypothesis (ETH)555ETH states that the exact version of 3SAT cannot be solved in time.. Once again, (Gap-)3SAT can be viewed in the form of “item selection with local constraints”. However, the number of selected items is now equal to , the number of variables of the formula. With a similar line of reasoning as above, the margin we get is now . As a result, if there is, say, a -time -agnostic proper learner for -margin halfspaces (for an appropriate ), then there is an -time algorithm for Gap-3SAT, which would violate ETH.
Unfortunately, the above described idea only gives the “gap” that is only slightly larger than , because the gap that we start with in the Gap-3SAT problem is already pretty small. To achieve larger gaps, our actual reduction starts from a generalization of 3SAT, called constraint satisfaction problems (CSPs), whose gap problems are hard even for very large gap. This concludes the outline of the main intuitions in our reductions. The detailed proofs are given in Section 3.
For , we denote . We will use small boldface characters for vectors and capital boldface characters for matrices. For a vector , and , denotes the -th coordinate of , and denotes the -norm of . We will use for the inner product between . For a matrix , we will denote by its spectral norm and by its trace. Let be the unit ball and be the unit sphere in .
An origin-centered halfspace is a Boolean-valued function of the form , where . (Note that we may assume w.l.o.g. that .) Let denote the class of all origin-centered halfspaces on . Finally, we use to denote the -th standard basis vector, i.e., the vector whose -th coordinate is one and the remaining coordinates are zero.
2 Efficient Proper Agnostic Learning of Halfspaces with a Margin
2.1 Warm-Up: Basic Algorithm
In this subsection, we present a basic algorithm that achieves and whose runtime is . Despite its slow runtime, this algorithm serves as a warm-up for our more sophisticated constant factor approximation algorithm in the next subsection.
We start by establishing a basic structural property of this setting which motivates our basic algorithm. We start with the following simple claim:
Let and be a unit vector such that . Then, we have that .
By assumption, , which implies that . The claim follows from the fact that , for any , and the definition of the spectral norm. ∎
Claim 2.1 allows us to obtain an approximation to the optimal halfspace by projecting on the space of large eigenvalues of . We will need the following terminology: For , let be the space spanned by the eigenvalues of with magnitude at least and be its complement. Let denote the projection operator of vector on subspace . Then, we have the following:
Let and . Then, we have that
Let , where . Observe that for any , if then , unless . Hence, . By definition of and , we have that . By Markov’s inequality, we thus obtain , completing the proof of the lemma. ∎
Motivated by Lemma 2.2, the idea is to enumerate over , for , and output a vector with smallest empirical -margin error. To turn this into an actual algorithm, we work with a finite sample set and enumerate over an appropriate cover of the space . The pseudocode is as follows:
First, we analyze the runtime of our algorithm. The SVD of can be computed in time. Note that has dimension at most . This follows from the fact that is PSD and its trace is , where we used that with probability over . Therefore, the unit sphere of has a -cover of size that can be computed in output polynomial time.
We now prove correctness. The main idea is to apply Lemma 2.2 for the empirical distribution combined with the following statistical bound:
Let be a multiset of i.i.d. samples from , where , and be the empirical distribution on . Then with probability at least over , simultaneously for all unit vectors and margins , if , we have that .
We proceed with the formal proof. First, we claim that for , with probability at least over , we have that . To see this, note that
can be viewed as a sum of Bernoulli random variables with expectation. Hence, the claim follows by a Chernoff bound. By an argument similar to that of Lemma 2.2, we have that . Indeed, we can write , where , and follow the same argument.
2.2 Main Algorithm: Near-Optimal Constant Factor Approximation
In this section, we establish the following theorem, which gives the upper bound part of Theorem 1.1:
Fix . There is an algorithm that uses samples, runs in time and is a -agnostic proper learner for -margin halfspaces with confidence probability .
Our algorithm in this section produces a finite set of candidate weight vectors and outputs the one with the smallest empirical -margin error. For the sake of this intuitive description, we will assume that the algorithm knows the distribution in question supported on . By assumption, there is a unit vector so that .
We note that if a hypothesis defined by vector has -margin error at least a , then there must be a large number of points correctly classified with -margin by , but not correctly classified with -margin by . For all of these points, we must have that . This implies that the -margin-misclassified points of have a large covariance in the direction. In particular, we have:
Let be such that . Let be conditioned on . Let . Then
We claim that with probability at least over we have that and . To see this, we first note that holds by definition of . Hence, we have that
By a union bound, we obtain .
Therefore, with probability at least (since ) over we have that , which implies that . Thus, , completing the proof. ∎
Claim 2.5 says that has a large component on the large eigenvalues of . Building on this claim, we obtain the following result:
Let be as in Claim 2.5. There exists so that if is the span of the top eigenvectors of , we have that .
Note that the matrix is PSD and let be its set of eigenvalues. We will denote by the space spanned by the eigenvectors of corresponding to eigenvalues of magnitude at least . Let be the dimension of , i.e., the number of with . Since is supported on the unit ball, for , we have that . Since is PSD, we have that and we can write
where the last equality follows by changing the order of the summation and the integration. If the projection of onto the -th eigenvector of has -norm , we have that
where the first inequality uses Claim 2.5, the first equality follows by the Pythagorean theorem, and the last equality follows by changing the order of the summation and the integration.
Lemma 2.6 suggests a method for producing an approximation to , or more precisely a vector that produces empirical -margin error at most . We start by describing a non-deterministic procedure, which we will then turn into an actual algorithm.
The method proceeds in a sequence of stages. At stage , we have a hypothesis weight vector . (At stage , we start with .) At any stage , if , then is a sufficient estimator. Otherwise, we consider the matrix , where is conditioned on . By Lemma 2.6, we know that for some positive integer value , we have that the projection of onto has squared norm at least .
Let be this projection. We set . Since the projection of and its complement are orthogonal, we have
where the inequality uses the fact that (as follows from Lemma 2.6). Let be the total number of stages. We can write
where the first inequality uses that and , the second notes the telescoping sum, and the third uses (3). We thus have that . Therefore, the above procedure terminates after at most stages at some with .
We now describe how to turn the above procedure into an actual algorithm. Our algorithm tries to simulate the above described procedure by making appropriate guesses. In particular, we start by guessing a sequence of positive integers whose sum is at most . This can be done in ways. Next, given this sequence, our algorithm guesses the vectors over all stages in order. In particular, given , the algorithm computes the matrix and the subspace , and guesses the projection , which then gives . Of course, we cannot expect our algorithm to guess exactly (as there are infinitely many points in ), but we can guess it to within -error , by taking an appropriate net. This involves an additional guess of size in each stage. In total, our algorithm makes many different guesses.
We note that the sample version of our algorithm is essentially identical to the idealized version described above, by replacing the distribution by its empirical version and leveraging Fact 2.3.
The pseudo-code is given in Algorithm 2 below.