The Hardest Halfspace

02/05/2019 ∙ by Alexander A. Sherstov, et al. ∙ 0

We study the approximation of halfspaces h:{0,1}^n→{0,1} in the infinity norm by polynomials and rational functions of any given degree. Our main result is an explicit construction of the "hardest" halfspace, for which we prove polynomial and rational approximation lower bounds that match the trivial upper bounds achievable for all halfspaces. This completes a lengthy line of work started by Myhill and Kautz (1961). As an application, we construct a communication problem that achieves essentially the largest possible separation, of O(n) versus 2^-Ω(n), between the sign-rank and discrepancy. Equivalently, our problem exhibits a gap of n versus Ω(n) between the communication complexity with unbounded versus weakly unbounded error, improving quadratically on previous constructions and completing a line of work started by Babai, Frankl, and Simon (FOCS 1986). Our results further generalize to the k-party number-on-the-forehead model, where we obtain an explicit separation of n versus Ω(n/4^n) for communication with unbounded versus weakly unbounded error. This gap is a quadratic improvement on previous work and matches the state of the art for number-on-the-forehead lower bounds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Representations of Boolean functions by real polynomials play a central role in theoretical computer science. The notion of approximating a Boolean function pointwise by polynomials of given degree has been particularly fruitful. Formally, let denote the minimum error in an infinity-norm approximation of by a real polynomial of degree at most :

This quantity clearly ranges between and for any function . In more detail, we have , where the first equality holds because any such is representable exactly by a polynomial of degree at most . The study of the polynomial approximation of Boolean functions dates back to the pioneering work in the 1960s by Myhill and Kautz [59] and Minsky and Papert [57]. This line of research has grown remarkably over the decades, with numerous connections discovered to other subjects in theoretical computer science. Lower bounds for polynomial approximation have complexity-theoretic applications, whereas upper bounds are a tool in algorithm design. In the former category, polynomial approximation has enabled significant progress in circuit complexity [17, 10, 48, 49, 73, 15], quantum query complexity [13, 1, 7, 23], and communication complexity [20, 65, 22, 73, 75, 66, 52, 26, 70, 15, 79, 78]. On the algorithmic side, polynomial approximation underlies many of the strongest results obtained to date in computational learning [82, 45, 44, 37, 61, 8], differentially private data release [84, 25], and algorithm design in general [55, 36, 72].

1.1. The hardest halfspace

Myhill and Kautz’s work [59] six decades ago, and many of the papers that followed [59, 58, 81, 62, 16, 33, 76, 77, 83], focused on halfspaces. Also known as a linear threshold function, a halfspace is any function representable as for some fixed reals The fundamental question taken up in this line of research is: how well can halfspaces be approximated by polynomials of given degree? An early finding, due to Muroga [58], was the upper bound

(1.1)

for every halfspace in variables. In words, every halfspace can be approximated pointwise by a linear polynomial to error just barely smaller than the trivial bound of . Many authors pursued matching lower bounds on for specific halfspaces , culminating in an explicit construction by Håstad [33] that matches Muroga’s bound (1.1).

The study of for proved to be challenging. For a long time, essentially the only result was the lower bound due to Beigel [16], where is the so-called odd-max-bit halfspace. Paturi [62] proved the incomparable lower bound , where is the majority function on bits. Much later, the bound was obtained in [76] for an explicit halfspace. This fragmented state of affairs persisted until the question was resolved completely in [77], with an existence proof of a halfspace such that for This result is clearly as strong as one could hope for, since it essentially matches Muroga’s upper bound for approximation by linear polynomials. The work in [77] further determined the minimum error, denoted , to which this can be approximated by a degree- rational function, showing that this quantity too is as large for as it can be for any halfspace. Explicitly constructing a halfspace with these properties is our main technical contribution:

Theorem 1.1.

There is an algorithm that takes as input an integer runs in time polynomial in and outputs a halfspace with

where is an absolute constant.

Classic bounds for the approximation of the sign function imply that for any the lower bounds in Theorem 1.1 are essentially the best possible for any halfspace on variables (see Sections 5.1 and 5.2 for details). Thus, the construction of Theorem 1.1 is the “hardest” halfspace from the point of view of approximation by polynomials and rational functions.

Theorem 1.1 is not a de-randomization of the existence proof in [77], which incidentally we are still unable to de-randomize. Rather, it is based on a new and simpler approach, presented in detail at the end of this section. Given the role that halfspaces play in theoretical computer science, we see Theorem 1.1 as answering a basic question of independent interest. In addition, Theorem 1.1 has applications to communication complexity and computational learning, which we now discuss.

1.2. Discrepancy vs. sign-rank

Consider the standard model of randomized communication [50], which features players Alice and Bob and a Boolean function On input Alice and Bob receive the arguments and respectively. Their objective is to compute on any given input with minimal communication. To this end, each player privately holds an unlimited supply of uniformly random bits which he or she can use in deciding what message to send at any given point in the protocol. The cost of a protocol is the total number of bits exchanged by Alice and Bob in a worst-case execution. The -error randomized communication complexity of , denoted , is the least cost of a protocol that computes

with probability of error at most

on every input.

Our interest in this paper is in communication protocols with error probability close to that of random guessing, There are two standard ways to define the complexity of a function

in this setting, both inspired by probabilistic polynomial time for Turing machines 

[31]:

and

The former quantity, introduced by Paturi and Simon [63], is called the communication complexity of with unbounded error, in reference to the fact that the error probability can be arbitrarily close to The latter quantity, proposed by Babai et al. [11], includes an additional penalty term that depends on the error probability. We refer to as the communication complexity of with weakly unbounded error. For all functions one has the trivial bounds These two complexity measures give rise to corresponding complexity classes in communication complexity theory, defined in the seminal paper of Babai et al. [11]. Formally, is the class of families of communication problems whose unbounded-error communication complexity is at most polylogarithmic in Its counterpart is defined analogously for the complexity measure .

These two models of large-error communication are synonymous with two central notions in communication complexity: sign-rank and discrepancy, defined formally in Sections 2.8 and 2.9. In more detail, Paturi and Simon [63] proved that the communication complexity of any problem with unbounded error is characterized up to an additive constant by the sign-rank of its communication matrix, Analogously, Klauck [40, 41] showed that the communication complexity of any problem with weakly unbounded error is essentially characterized in terms of the discrepancy of . Discrepancy and sign-rank enjoy a rich mathematical life [54, 71, 74, 56] outside communication complexity, which further motivates the study of and as fundamental complexity classes.

Communication with weakly unbounded error is by definition no more powerful than unbounded-error communication, and for twenty years after the paper of Babai et al. [11] it was unknown whether this containment is proper. Buhrman et al. [22] and the author [71] answered this question in the affirmative, independently and with unrelated techniques. These papers exhibited functions with an exponential gap between communication complexity with unbounded error versus weakly unbounded error: in both works, versus in [22] and in [71]. In complexity-theoretic notation, these results show that . A simpler alternate proof of the result of Buhrman et al. [22] was given in [75] using the pattern matrix method. More recently, Thaler [83] exhibited another, remarkably simple communication problem with communication complexity and

To summarize, the strongest explicit separation of communication complexity with unbounded versus weakly unbounded error prior to our work was the separation of  versus  from twelve years ago [71]. The existence of a communication problem with a quadratically larger gap, of  versus , follows from the work in [77]. This state of affairs parallels other instances in communication complexity, such as the versus question in multiparty communication [14], where the best existential separations are much stronger than the best explicit ones. There is considerable interest in communication complexity in explicit separations because they provide a deeper and more complete understanding of the complexity classes, whereas the lack of a strong explicit separation indicates a basic gap in our knowledge. As an application of Theorem 1.1, we obtain:

Theorem 1.2.

There is a communication problem defined by

(1.2)

for some explicitly given reals such that

Moreover,

Theorem 1.2 gives essentially the strongest possible separation of the communication classes and , improving quadratically on previous constructions and matching the previous nonconstructive separation. Another compelling aspect of the theorem is the simple form (1.2) of the communication problem in question. The last two bounds in Theorem 1.2 state that has sign-rank at most and discrepancy , which is essentially the strongest possible separation. The best previous construction [71] achieved sign-rank and discrepancy .

We further generalize Theorem 1.2 to the number-on-the-forehead -party model, the standard formalism of multiparty communication. Analogous to two-party communication, the -party model has its own classes and of problems solvable efficiently by protocols with unbounded error and weakly unbounded error, respectively. Their formal definitions can be found in Section 2.8. In this setting, we prove:

Theorem 1.3.

There is a -party communication problem defined by

for some explicitly given reals such that

Theorem 1.3 gives essentially the strongest possible explicit separation of the -party communication complexity classes and for up to parties, where is an arbitrary constant. The previous best explicit separation [27, 80] of these classes was quadratically weaker, with communication complexity for unbounded error and for weakly unbounded error. The communication lower bound in Theorem 1.3 reflects the state of the art in the area, in that the strongest lower bound for any explicit communication problem to date is due to Babai et al. [12].

1.3. Computational learning

A sign-representing polynomial for a given function is any real polynomial such that for all The minimum degree of a sign-representing polynomial for is called the threshold degree of denoted Clearly for every Boolean function on variables. The reader can further verify that sign-representation is equivalent to pointwise approximation with error strictly less than, but arbitrarily close to, the trivial error of . Sign-representing polynomials are appealing from a learning standpoint because they immediately lead to efficient learning algorithms. Indeed, any function of threshold degree is by definition a linear combination of monomials and can thus be viewed as a halfspace in dimensions. As a result, can be PAC learned [86] under arbitrary distributions in time polynomial in using a variety of halfspace learning algorithms.

The study of sign-representing polynomials started fifty years ago with the seminal monograph of Minsky and Papert [57], who examined the threshold degree of several common functions. Since then, the threshold degree approach has yielded the fastest known PAC learning algorithms for notoriously hard concept classes, including DNF formulas [45] and AND-OR trees [8]. Conspicuously absent from this list of success stories is the concept class of intersections of halfspaces. While solutions are known to several restrictions of this learning problem [18, 51, 87, 9, 44, 46, 43], no algorithm has been discovered for PAC learning the intersection of even two halfspaces in time faster than Known hardness results, on the other hand, only apply to polynomially many halfspaces or to proper learning, e.g., [19, 3, 47, 39].

This state of affairs has motivated a quest to determine the threshold degree of the intersection of two halfspaces [57, 61, 42, 76, 77]. Prior to our work, the best lower bound was for an explicit intersection of two halfspaces [76], complemented by a tight but highly nonconstructive lower bound [77]. Using Theorem 1.1, we prove:

Theorem 1.4.

There is an explicitly given halfspace such that

The symbol above stands for the intersection of two copies of on disjoint sets of variables. In other words, Theorem 1.4 constructs an explicit intersection of two halfspaces whose threshold degree is asymptotically maximal, While the nonconstructive lower bound of [77] already ruled out the threshold degree approach as a way to learn intersections of halfspaces, we see Theorem 1.4 as contributing a key qualitative piece of the puzzle. Specifically, it constructs a small and simple family of intersections of two halfspaces that are off-limits to all known algorithmic approaches (namely, the family obtained by applying to different subsets of the variables ).

1.4. Proof overview

Our solution has two main components: the construction of a sparse set of integers that appear random modulo and the univariatization of a multivariate Boolean function. We describe each of these components in detail.

Discrepancy of integer sets.

Let be a given integer. Key to our work is the notion of -discrepancy, which quantifies the pseudorandomness or aperiodicity modulo of any given multiset of integers. It is largely unrelated to the notion of discrepancy in communication complexity (Section 1.2). Formally, the -discrepancy of a nonempty multiset is defined as

where is a primitive -th root of unity. This fundamental quantity arises in combinatorics and theoretical computer science, e.g., [30, 69, 2, 38, 64, 5]. The identity for any -th root of unity implies that the set achieves the smallest possible -discrepancy: Much sparser sets with small -discrepancy can be shown to exist using the probabilistic method (Fact 3.3 and Corollary 3.4). Specifically, one easily verifies for any constant the existence of a set with -discrepancy at most and cardinality an exponential improvement in sparsity compared to the trivial set We are aware of two efficient constructions of sparse sets with small -discrepancy, due to Ajtai et al. [2] and Katz [38]. The approach of Ajtai et al. is elementary except for an appeal to the prime number theorem, whereas Katz’s construction relies on deep results in number theory. Neither work appears to directly imply the kind of optimal de-randomization that we require, namely, an algorithm that runs in time polynomial in and produces a multiset of cardinality with -discrepancy bounded away from 1. We obtain such an algorithm by adapting the approach of Ajtai et al. [2].

The centerpiece of the construction of Ajtai et al. [2] is what the authors call the iteration lemma, stated in this paper as Theorem 3.6. Its role is to reduce the construction of a sparse set with small -discrepancy to the construction of sparse sets with small -discrepancy, for primes Ajtai et al. [2] proved their iteration lemma for prime, but we show that their argument readily generalizes to arbitrary moduli . By applying the iteration lemma in a recursive manner, one reaches smaller and smaller primes. The authors of [2] continue this recursive process until they reach primes so small that the trivial construction can be considered sparse. We proceed differently and terminate the recursion after just two stages, at which point the input size is small enough for brute force search based on the probabilistic method. The final set that we construct has size logarithmic in and -discrepancy a small constant, as opposed to the superlogarithmic size and discrepancy in the work of Ajtai et al. [2].

We note that this modified approach additionally gives the first explicit circulant expander on vertices of degree which is optimal and improves on the previous best degree bound of due to Ajtai et al. [2]. Background on circulant expanders, and the details of our expander construction, can be found in Section 5.6.

Univariatization.

We now describe the second major component of our proof. Consider a halfspace in Boolean variables where the coefficients can be assumed without loss of generality to be integers. Then the linear form ranges in the discrete set , for some integer proportionate to the magnitude of the coefficients. As a result, one can approximate to any given error by approximating the sign function to on This approach works for both rational approximation and polynomial approximation. We think of it as the black-box approach to the approximation of because it uses the linear form rather than the individual bits. There is no reason to expect that the black-box construction is anywhere close to optimal. Indeed, there are halfspaces [76, Section 1.3] that can be approximated to arbitrarily small error by a rational function of degree  but require a black-box approximant of degree . Surprisingly, we are able to construct a halfspace with exponentially large coefficients for which the black-box approximant is essentially optimal. As a result, tight lower bounds for the rational and polynomial approximation of follow immediately from the univariate lower bounds for approximating the sign function on . The role of is to reduce the multivariate problem taken up in this work to a well-understood univariate question, hence the term univariatization.

The construction of

involves several steps. First, we study the probability distribution of the weighted sum

modulo , where are given integers and the bits are chosen uniformly at random. We show that the distribution is exponentially close to uniform whenever the multiset has -discrepancy bounded away from . For the next step, fix any multiset with small -discrepancy and consider the linear map given by At this point in the proof, we know that for uniformly random , the probability distribution of

is exponentially close to uniform. This implies that the characteristic functions of

have approximately the same Fourier spectrum up to degree , for some constant . We substantially strengthen this conclusion by proving that there are probability distributions , supported on , respectively, such that the Fourier spectra of are exactly the same up to degree Our proof relies on a general tool from [77, Theorem 4.1], proved there using the Gershgorin circle theorem.

As our final step, we use to construct a halfspace in terms of whose approximation by rational functions and polynomials gives corresponding approximants for the sign function on the discrete set . More generally, for any tuple , we define an associated halfspace and prove a lower bound on in terms of the discrepancy of the multiset Combining this result with the efficient construction of an integer set with small -discrepancy for , we obtain an explicit halfspace whose approximation by polynomials and rational functions is equivalent to the univariate approximation of the sign function on . Theorem 1.1 now follows by appealing to known lower bounds for the polynomial and rational approximation of the sign function. To obtain the exponential separation of communication complexity with unbounded versus weakly unbounded error (Theorem 1.2), we use the pattern matrix method [73, 75] to “lift” the lower bound of Theorem 1.1 to a discrepancy bound. Finally, our result on the threshold degree of the intersection of two halfspaces (Theorem 1.4) works by combining the rational approximation lower bound of Theorem 1.1 with a structural result from [76] on the sign-representation of arbitrary functions of the form

A key technical contribution of this paper is the identification of -discrepancy as a pseudorandom property that is weak enough to admit efficient de-randomization and strong enough to allow the univariatization of the corresponding halfspace. The previous, existential result in [77]

used a completely different and more complicated pseudorandom property based on affine shifts of the Fourier transform on

which we have not been able to de-randomize. Apart from the construction of a low-discrepancy set, our proof is simpler and more intuitive than the existential proof in [77].

2. Preliminaries

We start with a review of the technical preliminaries. The purpose of this section is to make the paper as self-contained as possible, and comfortably readable by a broad audience. The expert reader should therefore skim this section for notation or skip it altogether.

2.1. Notation

There are two common arithmetic encodings for the Boolean values: the traditional encoding and the Fourier-motivated encoding Throughout this manuscript, we use the former encoding for the domain of a Boolean function and the latter for the range. With this convention, Boolean functions are mappings for some For Boolean functions and we let denote the coordinatewise composition of with Formally, is given by

(2.1)

where the linear map on the right-hand side serves the purpose of switching between the distinct arithmetizations for the domain versus range. A partial function on a set is a function whose domain of definition, denoted is a nonempty proper subset of We generalize coordinatewise composition to partial Boolean functions and in the natural way. Specifically, is the Boolean function given by (2.1), with domain the set of all inputs for which

We use the following two versions of the sign function:

For a subset we let denote the restriction of the sign function to A halfspace for us is any Boolean function given by

for some reals The majority function is the halfspace defined by

Some authors define only for odd, in which case the tiebreaker term can be omitted.

The complement and the power set of a set are denoted as usual by and , respectively. The symmetric difference of sets and is Throughout this manuscript, we use brace notation as in to specify multisets rather than sets. The cardinality of a finite multiset is defined as the total number of element occurrences in , with each element counted as many times as it occurs. The equality and subset relations on multisets are defined analogously, with the number of element occurrences taken into account. For example, but . Similarly, but

The infinity norm of a function is denoted For real-valued functions and and a nonempty finite subset of their domain, we write

We will often use this notation with a nonempty proper subset of the domain of and We let and stand for the natural logarithm of and the logarithm of to base respectively. The binary entropy function is given by and is strictly increasing on The following bound is well known [35, p. 283]:

(2.2)

For a complex number we denote the real part, imaginary part, and complex conjugate of as usual by and respectively. We typeset the imaginary unit in boldface to distinguish it from the index variable .

For an arbitrary integer and a positive integer , recall that denotes the unique element of that is congruent to modulo For an integer the symbols and refer to the ring of integers modulo and the multiplicative group of integers modulo respectively. For a multiset of integers, we adopt the standard notation

(2.3)
(2.4)
(2.5)
(2.6)

Note that the multisets in (2.3)–(2.6) each have cardinality the same as the original set . We often use these shorthands in combination, as in

For a logical condition we use the Iverson bracket

The following concentration inequality, due to Hoeffding [34], is well-known.

Fact 2.1 (Hoeffding’s Inequality).

Let

be independent random variables with

Let

Then

In Fact 2.1 and throughout this paper, we typeset random variables using capital letters.

2.2. Number-theoretic preliminaries

For positive integers and that are relatively prime, denotes the multiplicative inverse of modulo The following fact is well-known and straightforward to verify; cf. [2].

Fact 2.2.

For any positive integers and that are relatively prime,

(2.7)
Proof.

We have and analogously Thus, is divisible by both and Since and are relatively prime, we conclude that is divisible by which is equivalent to (2.7). ∎

Recall that the prime counting function for a real argument evaluates to the number of prime numbers less than or equal to In what follows, it will be clear from the context whether refers to or the prime counting function. The asymptotic growth of the latter is given by the prime number theorem, which states that Many explicit bounds on are known, such as the following theorem of Rosser [68].

Fact 2.3 (Rosser).

For

The number of distinct prime divisors of a natural number is denoted . We will need the following first-principles bound on , which is asymptotically tight for infinitely many

Fact 2.4.

The number of distinct prime divisors of obeys

(2.8)

In particular,

(2.9)
Proof.

An integer has by definition distinct prime divisors. Letting denote the -th prime, we have

where the second step uses the trivial estimate

The second step in this derivation settles (2.8), whereas the last step settles (2.9). ∎

2.3. Matrix analysis

For an arbitrary set such as or the symbol denotes the family of matrices with entries in . The symbols and stand for the order-identity matrix and the matrix of all ones, respectively. When the dimensions of the matrix are clear from the context, we omit the subscripts and write simply or The shorthand refers to the diagonal matrix with entries on the diagonal:

For a matrix recall that its complex conjugate is given by . The transpose and conjugate transpose of are denoted and

respectively. The conjugation, transpose, and conjugate transpose operations apply as a special case to vectors, which we view as matrices with a single column. We use the familiar matrix norms

and Again, these definitions carry over to vectors as a special case. A matrix is called unitary if

A circulant matrix is any matrix of the form

(2.10)

for some Thus, every row of is obtained by a circular shift of the previous row one entry to the right. We let denote the right-hand side of (2.10). In this notation, and

The eigenvalues and eigenvectors of a circulant matrix are well-known and straightforward to determine. For the reader’s convenience, we include the short derivation below in Fact 

2.5 and Corollary 2.6.

Fact 2.5.

Let be a circulant matrix. Then for every -th root of unity the vector

(2.11)

is an eigenvector of with eigenvalue

Proof.

Let denote the vector in (2.11). Then for

where the third step uses

As a corollary to Fact 2.5, one recovers the full complement of eigenvalues for any circulant matrix and furthermore learns that is unitarily similar to a diagonal matrix. In the statement below, recall that a primitive -th root of unity is any generator, such as for the multiplicative group of the roots of .

Corollary 2.6.

Let be a circulant matrix. Let be a primitive -th root of unity. Then the matrix

is unitary and satisfies

(2.12)

In particular, the eigenvalues of counting multiplicities, are

Proof.

For , we have

where the second step is valid because is primitive and in particular . We conclude that

(2.13)

Fact 2.5 implies that

which in light of (2.13) is equivalent to (2.12). ∎

2.4. Polynomial approximation

Recall that the total degree of a multivariate real polynomial , denoted is the largest degree of any monomial of We use the terms “degree” and “total degree” interchangeably in this paper. Let be a given function with domain For any define

where the infimum is over real polynomials of degree at most In words, is the least error in a pointwise approximation of by a polynomial of degree no greater than The -approximate degree of is the minimum degree of a polynomial that approximates pointwise within :

In this overview, we focus on the polynomial approximation of the sign function. We start with an elementary construction of an approximant due to Buhrman et al. [21].

Fact 2.7 (Buhrman et al.).

For any and the sign function can be approximated on pointwise to within by a polynomial of degree

The degree upper bound in Fact 2.7 is not tight. Indeed, a quadratically stronger bound of follows in a straightforward manner from Jackson’s theorem in approximation theory [67, Theorem 1.4]. Our applications do not benefit from this improvement, however, and we opt for the construction of Buhrman et al.  [21] because of its striking simplicity. For the reader’s convenience, we provide their short proof below.

Proof (adapted from Buhrman et al.).

For a positive integer consider the degree- univariate polynomial

In words, is the probability of observing at least as many heads as tails in a sequence of independent coin flips, each coming up heads with probability By Hoeffding’s inequality (Fact 2.1) for sufficiently large the polynomial sends and similarly As a result, the shifted and scaled polynomial approximates the sign function pointwise on within

On the lower bounds side, Paturi proved that low-degree polynomials cannot approximate the majority function well. He in fact obtained analogous results for all symmetric functions, but the special case of majority will be sufficient for our purposes.

Theorem 2.8 (Paturi).

For some constant and all integers

The constant in Paturi’s theorem can be replaced by any other in His result is of interest to us because along with Fact 2.7, it implies a lower bound for the approximation of the sign function on the discrete set of points for any

Proposition 2.9.

For all positive integers and

Proof.

Abbreviate and fix a polynomial of degree at most that approximates the sign function on within . Fact 2.7 gives a polynomial of degree that sends and Then the composition of these two approximants obeys

This in turn gives an approximant for the majority function on bits:

In view of Paturi’s lower bound for the majority function (Theorem 2.8), the approximant must have degree But this composition is a polynomial in of degree