Recent decades have witnessed a paradigm shift in the notion of what constitutes an “efficient algorithm” in algorithms and complexity theory. Motivated both by practical applications and theoretical considerations, the traditional gold standard of linear time as the ultimate benchmark for algorithmic efficiency has given way to the notion of sublinear-time and sublinear-query algorithms, as introduced by Blum and Kannan [BlumKannan:89] and Blum, Luby and Rubinfeld [BLR93]. The study of sublinear algorithms is flourishing, with deep connections to many other areas including PCPs, hardness of approximation, and streaming algorithms (see e.g. the surveys [Rubinfeld:06survey, Goldreich17book, Fischer, R00]).
The current paper is at the confluence of two different lines of research in the area of sublinear algorithms:
The first strand of work deals with sublinear algorithms to approximately compute (numerical-valued) functions on various combinatorial objects. Example problems of this sort include (i) estimating the weight of a minimum spanning tree [ChazelleRT05]; (ii) approximating the minimum vertex cover size in a graph [PARNAS2007183]; and (iii) approximating the number of -cliques in an undirected graph [EdenRonSeshadri]. We note that for the first two of these results, the number of local queries that are made to the input combinatorial object is completely independent of its size.
The second strand of work is on property testing of Boolean-valued functions. Given a class of Boolean-valued functions , a testing algorithm for is a query-efficient procedure which, given oracle access to an arbitrary Boolean-valued function , distinguishes between the two cases that (i) belongs to class , versus (ii) is -far from every function in . Flagship results in this area include algorithms for linearity testing [BLR93], testing of low-degree polynomials [RS96, jutpatrudzuc04], junta testing [FKRSS03, Blaisstoc09], and monotonicity testing [GGLRS, KhotMinzer15]. Here too, for the first three of these properties, the query complexity of the testing algorithms depend only on the accuracy parameter and are completely independent of the ambient dimension of the function .
In recent years, a nascent line of work has emerged at the intersection of these two strands, where the high-level goal is to approximately compute various numerical parameters of Boolean-valued functions. As an example, building on the work of Kothari et al. [KothariNOW14], Neeman [Neeman2013surface] gave an algorithm to approximate the “surface area” of a Boolean-valued function on , which is a fundamental measure of its complexity [KOS:08]. The [Neeman2013surface] algorithm has a query complexity of if the target surface area is , which is completely independent of the ambient dimension . Fitting the same motif is the work of Ron et al. [RonWinfluence] who studied the problem of approximating the “total influence” (or equivalently, “average sensitivity”) of a Boolean function. They showed that the optimal query complexity to approximate the influence of an arbitrary -variable Boolean function to constant relative error is , and that this can be strengthened to essentially for monotone functions. More recently, in closely related work Rubinfeld and Vasiliyan [DBLP:conf/approx/RubinfeldV19] have given a constant-query algorithm to approximate the “noise sensitivity” of a Boolean function.
We note that each of the above three numerical parameters — surface area, total influence, and noise sensitivity — is essentially a measure of the “smoothness” of the Boolean function in question. In contrast, in this work we are interested in the sumset size, which has a rather different flavor and, as discussed below, is intimately connected to the subspace structure of the function.
Let be an arbitrary subset (which may of course be viewed as a Boolean function by considering its
-valued characteristic function). One of the most fundamental operations on such a setis to consider the sumset , defined as
Here ‘’ is the group operation in . Note that for an affine subspace we have that , and the converse (the only sets for which are affine subspaces) is also easily seen to hold. In fact, something significantly stronger is true: The celebrated Freiman–Ruzsa theorem [freiman1973foundations, AST_1999__258__323_0, apde.2012.5.627] states that if , then is contained inside an affine subspace such that . Thus, the value of vis-a-vis can be seen as a measure the “subspace structure” of .
1.1 The Question We Consider
For , we define to be the normalized size or volume of . This paper is motivated by the following basic algorithmic problem about sumsets:
Sumset size estimation (naive formulation): Given black-box oracle access to a set (via its characteristic function ), can we estimate the while making only “few” oracle calls to ?
At first glance this seems to be a difficult problem, since to confirm that a given point does not belong to we must verify that at least one of for each of the pairs satisfying . Indeed, for the above naive problem formulation, any algorithm must make queries even to distinguish between the two extreme cases that (i.e. ) versus . To see this, suppose that is a uniform random subset of many elements from . It is clear that any algorithm will need queries to distinguish such an from the empty set, and an easy calculation shows that such a random will with extremely high probability have
This simple example already shows that some care must be taken to formulate the “right” version of the sumset size estimation problem. This situation is analogous to the surface area testing problem that was studied in [KothariNOW14, Neeman2013surface]: In that setting, given oracle access to any set , by adding a measure zero set to (which is undetectable by an algorithm with oracle access to ) it is possible to “blow up” the surface area of to an arbitrarily large value. Thus the goal in [KothariNOW14, Neeman2013surface] is to find a value such that for a set that is “close to .” Note that for surface area, it may be possible to dramatically increase the surface area of a set either by adding a small subset of new points or removing a small subset of existing points from . In contrast, for sumset size it is clear that removing points from can never cause the sumset size to increase, and moreover adding a small (random) collection of points to can always cause to become extremely close to 1. Hence for our sumset size estimation problem we only allow subsets of as the permissible “close to ” sets.
We thus arrive at the following formulation of our problem:
Sumset size estimation: Given black-box oracle access to a set and an accuracy parameter , compute to additive accuracy for some subset which has
Given the importance of sumsets in additive combinatorics, we feel that it is natural to investigate algorithmic questions dealing with basic properties of sumsets; estimating the size of a sumset is a natural algorithmic question of this sort. We further remark that while there is no direct technical connection to the present work, the path which led us to the sumset size estimation problem originated in an effort to develop a query-efficient algorithm for convexity testing (i.e. testing whether a subset
is convex versus far from convex, where the standard Normal distributionprovides the underlying distance measure on ). In particular, the recent characterization by Shenfeld and van Handel of equality cases for the Ehrhard–Borell inequality (see Theorem 1.2 of [rvh-equality]) implies that a closed symmetric set is convex if and only if the Gaussian volume of equals the Gaussian volume of . We believe that a robust version of this theorem might be useful for convexity testing; this naturally motivates a Gaussian space version of the sumset size estimation question, where now the Minskowski sum of sets in plays the role of sumsets over . We hope that the ideas and ingredients in the current work may eventually be of use for the Gaussian space Minkowski sum size estimation problem, and perhaps ultimately for convexity testing.
1.3 Our Main Result
Our main result is an algorithm for the subset size estimation problem which makes only constantly many queries, independent of the ambient dimension . We state our main result informally below:
Given oracle access to any set and an error parameter , there is an algorithm making queries to with the following guarantee: with high probability, the algorithm outputs a value such that for some set such that .
In fact, as we describe in more detail later, our algorithm does more than just approximate the volume of : it outputs a high-accuracy approximate oracle for the set , given which it is trivially easy to approximate by random sampling. (As we will see, our algorithm also outputs an exact oracle for the set .) Later we will give a formal definition of what it means to “output an oracle” for a set ; informally, it means we give a description of an oracle algorithm (which uses a black-box oracle to ) which, on any input , (i) determines whether , and (ii) makes few invocations to the oracle for . We further note that the running time of our algorithm is linear in (note that even writing down an -bit string as a query input to takes linear time).
1.4 Technical Overview
1.4.1 A Conceptual Overview of the Algorithm
In this subsection we give a technical overview of our algorithm. At a high level, our approach is based on the structure versus randomness paradigm that has proven to be very influential in additive combinatorics [TaoVu:06] and property testing. Our algorithm relies on two main ingredients, which we describe below.
To explain the key ingredients we need the notion of quasirandomness from additive combinatorics. For a set , we say is -quasirandom if each non-empty Fourier coefficient , satisfies where we are viewing as a characteristic function over the domain
The definition of the Fourier transform extends to the more general setting in whichis a characteristic function whose domain is some coset (of size ) of . This is done by identifying with via a homomorphism; we give details later in Section 2.4.
The first ingredient is the following: Let be a linear subspace of , and let , be subsets of cosets and respectively. Suppose that both and are at least , and that both and are -quasirandom (viewed as characteristic functions whose domains are the cosets and respectively). Our first ingredient is the simple but useful observation that if , then the set (which is easily seen to be a subset of the coset ) must be almost the entire coset (see Section 3.3).
The second ingredient is Green’s well-known “regularity lemma” for Boolean functions [Green:05]. To explain this, for any set , subspace of , and coset , let be the intersection of with the coset . Roughly speaking, Green’s regularity lemma shows that for any , there is a subspace of codimension at most such that the following holds: With probability over a uniform random choice of cosets , the set is -quasirandom (viewed as a subset of the coset ). Moreover, the proof of the regularity lemma gives an iterative procedure to identify
; very roughly speaking, until the procedure terminates, at each stage it identifies a vectorsuch that is large, and sets to be the span of the vectors identified so far.
With these two ingredients in place, we are ready to explain (at least at a qualitative level; we defer discussion of how to achieve the desired query complexity to the next subsection) the algorithm for simulating an oracle to . First, we run the algorithmic version of Green’s regularity lemma; having done so, we have a subspace and we know that for most cosets , the set is -quasirandom. Let be the codimension of and let be a set of many coset representatives for the cosets of . Let be the subset consisting of those coset representatives for which the set (i) is -quasirandom and (ii) has density at least when viewed as a subset of (where is some carefully chosen parameter that we do not specify here). We note that given any coset , condition (ii) can be checked using simple random sampling. Condition (i) is equivalent to checking that the set has no Fourier coefficient larger than . This can be done using the celebrated Goldreich-Levin algorithm [goldreich-levin].111To be more accurate, this requires a slight adaptation of the Goldreich-Levin algorithm because the domain here is a coset rather than the more familiar domain for Goldreich-Levin. Thus, at this point our algorithm has determined the set
The set is defined to be
i.e. is obtained from by removing for each , or equivalently, “zeroing out” on every coset where either is not -quasirandom or has density smaller than . (Since the algorithm knows and , it is clear from this definition of that, as mentioned after the informal theorem statement given earlier, the algorithm can simulate an exact oracle for the set .) Turning to , we have that
where the last line follows from Section 3.3 (that we informally stated as the first ingredient mentioned above). As above, since the algorithm knows and , it is clear from that the algorithm can simulate an approximate oracle for .
1.4.2 Achieving Constant Query Complexity
The above description essentially gives the high level description of our algorithm, at least at a conceptual level. However, there is a significant caveat, which arises when we consider the query complexity of the algorithm. Our goal is to achieve query complexity , but explicitly obtaining a description of the subspace necessarily requires a number of queries that scales at least linearly in ; indeed, even explicitly describing a single vector in requires bits of information (and thus this many queries). Similarly, obtaining an explicit description of even a single vector would be prohibitively expensive using only constantly many queries. To circumvent these obstacles and achieve constant (rather than linear or worse) query complexity, we need to develop “implicit” versions of the procedures described above.
As an example, we recall that the standard Goldreich-Levin algorithm, given oracle access to any set , outputs a list of parity functions such that the Fourier coefficient is “large” (roughly, at least ) for each . However, explicitly outputting the label of even a single parity would require bits of information. To avoid this, we slightly modify the standard Goldreich-Levin procedure to show that with queries, we can output oracles to the parity functions . In turn, each such oracle can be computed on any point with just many queries to the set ; thus, we have implicit access to the parity functions rather than explicit descriptions of the parities. In the language of coding theory, this amounts to an analysis showing that the Goldreich-Levin algorithm can be used to achieve constant-query “local list correction” of the Hadamard code. We view this as essentially folklore [Sudan21]; it is implicit in a number of previous works [sudtrevad01, KS13], but the closest explicit statements we have been able to find in the literature essentially say that Goldreich-Levin is a constant-query local list decoder (rather than local list corrector) for the Hadamard code.
With an “implicit” version of the Goldreich-Levin algorithm in hand, we show how to carefully use this implicit Goldreich-Levin to obtain an “implicit” algorithmic version of Green’s regularity lemma. This implicit version is sufficient to carry out the steps mentioned above with overall constant query complexity. We hope that the implicit (query-efficient) versions of these algorithms may be useful in other settings beyond the current work.
1.5 Related Work
As noted earlier, our sumset size estimation problem has a similar flavor to the work of [KothariNOW14, Neeman2013surface] on testing surface area, but the technical details are entirely different.
We note that for any invertible affine transformation , we have that (but clearly this need not hold for noninvertible affine transformations). Starting with the influential paper of Kaufman and Sudan [KaufmanSudan:08], a number of works have studied the testability of affine-invariant properties, see e.g. [bhattacharyya2015unified, bhattacharyya2013testing, hatami2013estimating, yoshida2014characterization, hatami2016general, bhattacharyya2013guest] These works consider properties that are invariant under all affine transformations (not just invertible ones), which makes them inapplicable to our setting. However, we note that there are thematic similarities between the approaches in those works and our approach (in particular, the use of the “structure versus randomness” paradigm).
In this section, we set notation and briefly recall preliminaries from additive combinatorics and Fourier analysis of Boolean functions. Given arbitrary , we define
We will sometimes identify a set with its indicator function , defined as
for . When for some coset , we similarly identify with its indicator function . We write to denote the vector with a in the position and everywhere else. The function denotes an exponential tower of ’s of height and the function denotes its inverse.
2.1 Analysis of Boolean Functions
Our notation and terminology follow [odonnell-book]. We will view the vector space of functions as a real inner product space, with inner product . It is easy to see that the collection of parity functions where forms an orthonormal basis for this vector space. In particular, every function can be uniquely expressed by its Fourier transform, given by
The real number is called the Fourier coefficient of on , and the collection of all Fourier coefficients of is called the Fourier spectrum of . We recall Parseval’s and Plancherel’s formulas: for all , we have
It follows that . Given , their convolution is the function defined by
2.2 Subspaces and Functions on Subspaces
Throughout this subsection, let and let be a linear subspace of codimension (so ). We can write
for some linearly independent collection of vectors .
A coset , which is an affine subspace or, equivalently, a “translate” for some , can be expressed as a set of the form
for some ; we will often identify with the vector . Note that if , then .
Any coset of is affinely isomorphic to a copy of , and this lets us define the Fourier transform of a function restricted to a coset . More formally, consider the function defined as . Its Fourier spectrum is indexed by the elements of ; in particular, for each we have
We can alternatively restrict a function to a coset , but treat it as a function on that takes value 0 on all points in ; this viewpoint will be notationally cleaner to work with going forward so we elaborate on it here. We define the function as
The Fourier coefficients of and are related by the following simple fact.
Let , be as in Equation 5, and let be a coset of . Let be a collection of coset representatives for (so every vector in has a unique representation as for some ). For any , we have
For ease of notation we first consider the case that . Suppose that is given by
where we write . We may take to be the set of all vectors in whose last coordinates are all 0, and we note that
2.3 Parity Decision Trees
We will only need the notion of a “nonadaptive” parity decision tree:
[nonadaptive parity decision tree] A nonadaptive parity decision tree is a representation of a function . It consists of a rooted binary tree of depth with leaves, so every root-to-leaf path has length exactly . Each internal node at depth is is labeled by a vector corresponding to the parity function , and the vectors are linearly independent. (Having all nodes at level be labeled with the same vector is the sense in which the tree is “nonadaptive.”) The outgoing edges of each internal node are labeled and , and the leaves of are labeled by functions (which are restrictions of ). The size of is the number of leaf nodes of .
In more detail, a root-to-leaf path can be written as where we follow the outgoing edge from the internal node , with . On an input , the parity decision tree follows the root-to-leaf path and outputs the value of the function associated to the leaf at .
Note that given and a subspace of codimension as in Equation 5, we can associate a natural parity decision tree in which each level- internal node is labeled by and each leaf node (corresponding to some coset of ) is labeled by .
2.4 Quasirandomness and Green’s Regularity Lemma
The following definition of quasirandomnesss has been well-studied as a notion of pseudorandomness in additive combinatorics; we refer the interested reader to [Chung1992] for more details.
[-quasirandomness] We say that is -quasirandom if
[-quasirandom when restricted to coset] Let , as in Equation 5, and let be a coset of . We say that is -quasirandom if
where is as defined in Section 2.2.
In Sections 2.4 and 2.4, the function of interest will often be the indicator of a subset . We next state Green’s regularity lemma for Boolean functions, which is analogous to Szemerédi’s celebrated graph regularity lemma [Szemeredi:78].
[Green’s regularity lemma in ] Let and . There exists a subspace with cosets such that
the codimension of is at most ; and
for all but -fraction of cosets of , the function is -quasirandom.
In Section 4, we will see the proof of Green’s regularity lemma (in the course of providing a constructive and highly query-efficient version of the lemma).
2.5 The Goldreich–Levin Theorem
Given query access to a function , the Goldreich–Levin algorithm [goldreich-levin] allows us to find all linear (parity) functions that are well-correlated with (equivalently, it allows us to find all the “significant” Fourier coefficients of ). More formally, we have the following result.
[Goldreich–Levin algorithm] Let be arbitrary and let be fixed. There is an algorithm that, given query access to , outputs a subset of size such that with probability at least , we have
if , then ; and
if , then .
Furthermore, runs in time and makes queries to .
2.6 Oracles and Oracle Machines
As stated in the introduction, the outputs of our algorithmic procedures—Algorithms 2 and 1—will be oracles to the indicator functions of specific subsets of . We first recall the definition of a probabilistic oracle machine:
Let . A randomized algorithm with black-box query access to , denoted , is said to be a probabilistic oracle machine for if for any input , the algorithm outputs a bit that satisfies
where the probability is taken over the internal coin tosses of . The query complexity of the machine is the number of oracle calls made by to and the running time of the machine is the number of time steps it takes in the worst case (counting each oracle call as a single time step).
Of course, the 2/3 in the above definition can be upgraded to at a cost of increasing the query complexity by a factor of . We next define what it means for an algorithm to “output an (approximate) oracle” for a function.
Let be two functions . An algorithm with query access to , denoted by , is said to output a -oracle for the function if it outputs a representation of a probabilistic oracle machine for which the following hold:
We have (i.e. );
The query complexity of is at most and the running time of is at most .
If , then we say that is an exact oracle for .
3 A Query-Inefficient Version of the Main Result
In this section, we prove a query-inefficient “non-implicit” version of our main result, which has a polynomial query complexity dependence on the ambient dimension . In particular, we will prove the following theorem.
[Main result, query-inefficient version] Let be an arbitrary subset, and let . Given query access to , there exists an algorithm that makes queries to and does a time computation and outputs with probability at least :
A -oracle to the indicator function of where ; and
A -oracle to the indicator function of the sumset .
We start by recording a corollary of Green’s regularity lemma in Section 3.1, which (informally), given an arbitrary set , establishes the existence of a “structured” set capturing “almost all” of . Section 3.2 then presents a procedure—ConstructDT—that constructs an exact oracle to this structured set , giving item (1) of the above theorem. In Section 3.3, we present a procedure—Simulate-Sumset—that constructs an approximate oracle to the sumset , giving item (2).
3.1 Partitioning Arbitrary Sets into Dense Quasirandom Cosets
Green’s regularity lemma in says that given an arbitrary set and an error parameter , we can partition into (independent of ) many sets such that is “random-like” on almost all of these sets. Moreover, all these sets have a convenient structure: they are cosets of a common subspace of constant codimension.
We will use the following easy consequence of Green’s lemma:
Given and , there exists a subspace of codimension at most and a set such that
For any coset , either or ; and
is -quasirandom for all cosets .
Let be the subspace of of codimension at most guaranteed to exist by Section 2.4, and let be an enumeration of the cosets of where . We know from Section 2.4 that for all but -fraction of , the function is -quasirandom.
Define disjoint subsets , where each , as follows:
If is not -quasirandom, then ;
If , set ;
Otherwise, set .
We now define as
We clearly have and that is -quasirandom for all . (Note that is trivially -quasirandom.) ∎
Informally, Section 3.1 modifies to obtain a structured set that contains “most” of and has either empty or “large” intersection with all of the cosets guaranteed to exists by Green’s regularity lemma. Furthermore, is “random-like” on all—as opposed to almost all—of these cosets.
3.2 A Constructive Regularity Lemma via the Goldreich–Levin Theorem
In this section, we make Section 3.2 constructive via the Goldreich–Levin algorithm. The procedure ConstructDT presented in Algorithm 1 closely follows the structure of Green’s original proof of the regularity lemma itself [Green:05].
Initialize the decision tree to contain no internal nodes and one leaf labelled by . Define
At each stage of growing , do the following:
Let denote the cosets corresponding to the leaves of the decision tree at the current stage. The leaf node is labelled by the function .
For each coset , call
For each non-empty , for each , estimate up to additive error with confidence . If the estimate is less than , then remove from .
If for at least -fraction of the , go to Step 3.
Let the collection of labels of all internal nodes be . For each non-empty :
Choose . Check if the collection is linearly independent.
If so, then add to and split all nodes at the current stage on .
Repeat Step 2.
For each leaf node—say, corresponding to the coset —estimate up to an additive error of with confidence .
If , set the function associated to the leaf node to be the identically- function.
Else set it to be the identically- function.
Define the oracle to be the function
Let be an arbitrary subset. Given query access to and , the procedure ConstructDT described in Algorithm 1:
Makes queries to and does a time computation; and
With probability outputs a deterministic -oracle for where is as in Section 3.1.
We note that the procedure ConstructDT makes queries to the oracle in the course of running the Goldreich–Levin algorithm.
We first argue that Step 2 in the procedure ConstructDT terminates; this essentially follows from Green’s original proof of the regularity lemma in . In particular, suppose, at the current stage, the subspace given by the internal nodes of the parity decision tree is , and let denote the cosets corresponding to the leaves. Consider the potential function
where we recall that . Note that . Informally, captures the “expected imbalance” of restricted to the leaf nodes of the tree at the current stage.
Lemma 2.2 of [Green:05] (alternatively, see [ROD-green-regularity]) states if there exists a leaf node and a parity such that , then upon splitting all nodes at the current level on the parity —with being the subspace corresponding to the resulting tree—we have
It follows that if the condition in Line 2(d) of ConstructDTdoesn’t hold, then after Step 2(e), the value of increases by at least . It follows that Step 2 can be repeated at most times.
Next, note that the Goldreich–Levin call in Step 2(b) makes at most queries to over the run of ConstructDT, and each call to Step 2(e) and Step 3 makes and many queries (via a standard application of the Chernoff bound). The overall query complexity of ConstructDT follows. The runtime is similarly clear.
Note that we run the Goldreich–Levin algorithm in Step 2(b) on the function as opposed to . It follows from Section 2.2 that is -quasirandom if and only if is -quasirandom (where is the number of cosets at a particular stage of the algorithm). We also note that given query access to , we can simulate query access to by checking whether an input belongs to the coset by querying it on the parity decision tree .
In the pruning procedure in Step 2(e), the size of each is at most . A union bound over the Goldreich–Levin and estimation procedures implies that with probability , the function computed by indicates whether a point is in a coset for which is -quasirandom and also . It follows that is an exact oracle for ; it also clearly makes exactly query to on any input. ∎
3.3 Approximately Simulating Sumsets
Note that Section 3.1 asserts, for arbitrary , the existence of a structured subset (which is “almost all of ”) and a subspace such that is -quasirandom for all . The following lemma indicates why such a decomposition is useful towards our goal of (approximately) simulating sumsets.
Let be arbitrary and let be a subspace. Suppose, for ,
are -quasirandom (in the sense of Section 2.4); and
for some .
Then we have
For ease of notation, define the -quasirandom functions and as
Consider and note that . From Equation 4, we have that
Note that and