1 Introduction
Recent decades have witnessed a paradigm shift in the notion of what constitutes an βefficient algorithmβ in algorithms and complexity theory. Motivated both by practical applications and theoretical considerations, the traditional gold standard of linear time as the ultimate benchmark for algorithmic efficiency has given way to the notion of sublineartime and sublinearquery algorithms, as introduced by Blum and KannanΒ [BlumKannan:89] and Blum, Luby and RubinfeldΒ [BLR93]. The study of sublinear algorithms is flourishing, with deep connections to many other areas including PCPs, hardness of approximation, and streaming algorithms (see e.g.Β the surveysΒ [Rubinfeld:06survey, Goldreich17book, Fischer, R00]).
The current paper is at the confluence of two different lines of research in the area of sublinear algorithms:

The first strand of work deals with sublinear algorithms to approximately compute (numericalvalued) functions on various combinatorial objects. Example problems of this sort include (i) estimating the weight of a minimum spanning treeΒ [ChazelleRT05]; (ii) approximating the minimum vertex cover size in a graphΒ [PARNAS2007183]; and (iii) approximating the number of cliques in an undirected graphΒ [EdenRonSeshadri]. We note that for the first two of these results, the number of local queries that are made to the input combinatorial object is completely independent of its size.

The second strand of work is on property testing of Booleanvalued functions. Given a class of Booleanvalued functions , a testing algorithm for is a queryefficient procedure which, given oracle access to an arbitrary Booleanvalued function , distinguishes between the two cases that (i) belongs to class , versus (ii) is far from every function in . Flagship results in this area include algorithms for linearity testingΒ [BLR93], testing of lowdegree polynomials Β [RS96, jutpatrudzuc04], junta testingΒ [FKRSS03, Blaisstoc09], and monotonicity testingΒ [GGLRS, KhotMinzer15]. Here too, for the first three of these properties, the query complexity of the testing algorithms depend only on the accuracy parameter and are completely independent of the ambient dimension of the function .
In recent years, a nascent line of work has emerged at the intersection of these two strands, where the highlevel goal is to approximately compute various numerical parameters of Booleanvalued functions. As an example, building on the work of Kothari etΒ al.Β [KothariNOW14], NeemanΒ [Neeman2013surface] gave an algorithm to approximate the βsurface areaβ of a Booleanvalued function on , which is a fundamental measure of its complexityΒ [KOS:08]. TheΒ [Neeman2013surface] algorithm has a query complexity of if the target surface area is , which is completely independent of the ambient dimension . Fitting the same motif is the work of Ron etΒ al.Β [RonWinfluence] who studied the problem of approximating the βtotal influenceβ (or equivalently, βaverage sensitivityβ) of a Boolean function. They showed that the optimal query complexity to approximate the influence of an arbitrary variable Boolean function to constant relative error is , and that this can be strengthened to essentially for monotone functions. More recently, in closely related work Rubinfeld and VasiliyanΒ [DBLP:conf/approx/RubinfeldV19] have given a constantquery algorithm to approximate the βnoise sensitivityβ of a Boolean function.
We note that each of the above three numerical parameters β surface area, total influence, and noise sensitivity β is essentially a measure of the βsmoothnessβ of the Boolean function in question. In contrast, in this work we are interested in the sumset size, which has a rather different flavor and, as discussed below, is intimately connected to the subspace structure of the function.
Sumsets.
Let be an arbitrary subset (which may of course be viewed as a Boolean function by considering its
valued characteristic function). One of the most fundamental operations on such a set
is to consider the sumset , defined asHere ββ is the group operation in . Note that for an affine subspace we have that , and the converse (the only sets for which are affine subspaces) is also easily seen to hold. In fact, something significantly stronger is true: The celebrated FreimanβRuzsa theoremΒ [freiman1973foundations, AST_1999__258__323_0, apde.2012.5.627] states that if , then is contained inside an affine subspace such that . Thus, the value of visavis can be seen as a measure the βsubspace structureβ of .
1.1 The Question We Consider
For , we define to be the normalized size or volume of . This paper is motivated by the following basic algorithmic problem about sumsets:
Sumset size estimation (naive formulation): Given blackbox oracle access to a set (via its characteristic function ), can we estimate the while making only βfewβ oracle calls to ?
At first glance this seems to be a difficult problem, since to confirm that a given point does not belong to we must verify that at least one of for each of the pairs satisfying . Indeed, for the above naive problem formulation, any algorithm must make queries even to distinguish between the two extreme cases that (i.e.Β ) versus . To see this, suppose that is a uniform random subset of many elements from . It is clear that any algorithm will need queries to distinguish such an from the empty set, and an easy calculation shows that such a random will with extremely high probability have
This simple example already shows that some care must be taken to formulate the βrightβ version of the sumset size estimation problem. This situation is analogous to the surface area testing problem that was studied inΒ [KothariNOW14, Neeman2013surface]: In that setting, given oracle access to any set , by adding a measure zero set to (which is undetectable by an algorithm with oracle access to ) it is possible to βblow upβ the surface area of to an arbitrarily large value. Thus the goal inΒ [KothariNOW14, Neeman2013surface] is to find a value such that for a set that is βclose to .β Note that for surface area, it may be possible to dramatically increase the surface area of a set either by adding a small subset of new points or removing a small subset of existing points from . In contrast, for sumset size it is clear that removing points from can never cause the sumset size to increase, and moreover adding a small (random) collection of points to can always cause to become extremely close to 1. Hence for our sumset size estimation problem we only allow subsets of as the permissible βclose to β sets.
We thus arrive at the following formulation of our problem:
Sumset size estimation: Given blackbox oracle access to a set and an accuracy parameter , compute to additive accuracy for some subset which has
1.2 Motivation
Given the importance of sumsets in additive combinatorics, we feel that it is natural to investigate algorithmic questions dealing with basic properties of sumsets; estimating the size of a sumset is a natural algorithmic question of this sort. We further remark that while there is no direct technical connection to the present work, the path which led us to the sumset size estimation problem originated in an effort to develop a queryefficient algorithm for convexity testing (i.e.Β testing whether a subset
is convex versus far from convex, where the standard Normal distribution
provides the underlying distance measure on ). In particular, the recent characterization by Shenfeld and van Handel of equality cases for the EhrhardβBorell inequality (see TheoremΒ 1.2 of [rvhequality]) implies that a closed symmetric set is convex if and only if the Gaussian volume of equals the Gaussian volume of . We believe that a robust version of this theorem might be useful for convexity testing; this naturally motivates a Gaussian space version of the sumset size estimation question, where now the Minskowski sum of sets in plays the role of sumsets over . We hope that the ideas and ingredients in the current work may eventually be of use for the Gaussian space Minkowski sum size estimation problem, and perhaps ultimately for convexity testing.1.3 Our Main Result
Our main result is an algorithm for the subset size estimation problem which makes only constantly many queries, independent of the ambient dimension . We state our main result informally below:
Given oracle access to any set and an error parameter , there is an algorithm making queries to with the following guarantee: with high probability, the algorithm outputs a value such that for some set such that .
In fact, as we describe in more detail later, our algorithm does more than just approximate the volume of : it outputs a highaccuracy approximate oracle for the set , given which it is trivially easy to approximate by random sampling. (As we will see, our algorithm also outputs an exact oracle for the set .) Later we will give a formal definition of what it means to βoutput an oracleβ for a set ; informally, it means we give a description of an oracle algorithm (which uses a blackbox oracle to ) which, on any input , (i) determines whether , and (ii) makes few invocations to the oracle for . We further note that the running time of our algorithm is linear in (note that even writing down an bit string as a query input to takes linear time).
1.4 Technical Overview
1.4.1 A Conceptual Overview of the Algorithm
In this subsection we give a technical overview of our algorithm. At a high level, our approach is based on the structure versus randomness paradigm that has proven to be very influential in additive combinatoricsΒ [TaoVu:06] and property testing. Our algorithm relies on two main ingredients, which we describe below.
To explain the key ingredients we need the notion of quasirandomness from additive combinatorics. For a set , we say is quasirandom if each nonempty Fourier coefficient , satisfies where we are viewing as a characteristic function over the domain
The definition of the Fourier transform extends to the more general setting in which
is a characteristic function whose domain is some coset (of size ) of . This is done by identifying with via a homomorphism; we give details later inΒ Section 2.4.The first ingredient is the following: Let be a linear subspace of , and let , be subsets of cosets and respectively. Suppose that both and are at least , and that both and are quasirandom (viewed as characteristic functions whose domains are the cosets and respectively). Our first ingredient is the simple but useful observation that if , then the set (which is easily seen to be a subset of the coset ) must be almost the entire coset (see Section 3.3).
The second ingredient is Greenβs wellknown βregularity lemmaβ for Boolean functionsΒ [Green:05]. To explain this, for any set , subspace of , and coset , let be the intersection of with the coset . Roughly speaking, Greenβs regularity lemma shows that for any , there is a subspace of codimension at most such that the following holds: With probability over a uniform random choice of cosets , the set is quasirandom (viewed as a subset of the coset ). Moreover, the proof of the regularity lemma gives an iterative procedure to identify
; very roughly speaking, until the procedure terminates, at each stage it identifies a vector
such that is large, and sets to be the span of the vectors identified so far.With these two ingredients in place, we are ready to explain (at least at a qualitative level; we defer discussion of how to achieve the desired query complexity to the next subsection) the algorithm for simulating an oracle to . First, we run the algorithmic version of Greenβs regularity lemma; having done so, we have a subspace and we know that for most cosets , the set is quasirandom. Let be the codimension of and let be a set of many coset representatives for the cosets of . Let be the subset consisting of those coset representatives for which the set (i) is quasirandom and (ii) has density at least when viewed as a subset of (where is some carefully chosen parameter that we do not specify here). We note that given any coset , condition (ii) can be checked using simple random sampling. Condition (i) is equivalent to checking that the set has no Fourier coefficient larger than . This can be done using the celebrated GoldreichLevin algorithmΒ [goldreichlevin].^{1}^{1}1To be more accurate, this requires a slight adaptation of the GoldreichLevin algorithm because the domain here is a coset rather than the more familiar domain for GoldreichLevin. Thus, at this point our algorithm has determined the set
The set is defined to be
i.e.Β is obtained from by removing for each , or equivalently, βzeroing outβ on every coset where either is not quasirandom or has density smaller than . (Since the algorithm knows and , it is clear from this definition of that, as mentioned after the informal theorem statement given earlier, the algorithm can simulate an exact oracle for the set .) Turning to , we have that
(1)  
where the last line follows from Section 3.3 (that we informally stated as the first ingredient mentioned above). As above, since the algorithm knows and , it is clear from that the algorithm can simulate an approximate oracle for .
1.4.2 Achieving Constant Query Complexity
The above description essentially gives the high level description of our algorithm, at least at a conceptual level. However, there is a significant caveat, which arises when we consider the query complexity of the algorithm. Our goal is to achieve query complexity , but explicitly obtaining a description of the subspace necessarily requires a number of queries that scales at least linearly in ; indeed, even explicitly describing a single vector in requires bits of information (and thus this many queries). Similarly, obtaining an explicit description of even a single vector would be prohibitively expensive using only constantly many queries. To circumvent these obstacles and achieve constant (rather than linear or worse) query complexity, we need to develop βimplicitβ versions of the procedures described above.
As an example, we recall that the standard GoldreichLevin algorithm, given oracle access to any set , outputs a list of parity functions such that the Fourier coefficient is βlargeβ (roughly, at least ) for each . However, explicitly outputting the label of even a single parity would require bits of information. To avoid this, we slightly modify the standard GoldreichLevin procedure to show that with queries, we can output oracles to the parity functions . In turn, each such oracle can be computed on any point with just many queries to the set ; thus, we have implicit access to the parity functions rather than explicit descriptions of the parities. In the language of coding theory, this amounts to an analysis showing that the GoldreichLevin algorithm can be used to achieve constantquery βlocal list correctionβ of the Hadamard code. We view this as essentially folklore [Sudan21]; it is implicit in a number of previous works [sudtrevad01, KS13], but the closest explicit statements we have been able to find in the literature essentially say that GoldreichLevin is a constantquery local list decoder (rather than local list corrector) for the Hadamard code.
With an βimplicitβ version of the GoldreichLevin algorithm in hand, we show how to carefully use this implicit GoldreichLevin to obtain an βimplicitβ algorithmic version of Greenβs regularity lemma. This implicit version is sufficient to carry out the steps mentioned above with overall constant query complexity. We hope that the implicit (queryefficient) versions of these algorithms may be useful in other settings beyond the current work.
1.5 Related Work
As noted earlier, our sumset size estimation problem has a similar flavor to the work of [KothariNOW14, Neeman2013surface] on testing surface area, but the technical details are entirely different.
We note that for any invertible affine transformation , we have that (but clearly this need not hold for noninvertible affine transformations). Starting with the influential paper of Kaufman and SudanΒ [KaufmanSudan:08], a number of works have studied the testability of affineinvariant properties, see e.g.Β [bhattacharyya2015unified, bhattacharyya2013testing, hatami2013estimating, yoshida2014characterization, hatami2016general, bhattacharyya2013guest] These works consider properties that are invariant under all affine transformations (not just invertible ones), which makes them inapplicable to our setting. However, we note that there are thematic similarities between the approaches in those works and our approach (in particular, the use of the βstructure versus randomnessβ paradigm).
2 Preliminaries
In this section, we set notation and briefly recall preliminaries from additive combinatorics and Fourier analysis of Boolean functions. Given arbitrary , we define
We will sometimes identify a set with its indicator function , defined as
for . When for some coset , we similarly identify with its indicator function . We write to denote the vector with a in the position and everywhere else. The function denotes an exponential tower of βs of height and the function denotes its inverse.
2.1 Analysis of Boolean Functions
Our notation and terminology follow [odonnellbook]. We will view the vector space of functions as a real inner product space, with inner product . It is easy to see that the collection of parity functions where forms an orthonormal basis for this vector space. In particular, every function can be uniquely expressed by its Fourier transform, given by
(2) 
The real number is called the Fourier coefficient of on , and the collection of all Fourier coefficients of is called the Fourier spectrum of . We recall Parsevalβs and Plancherelβs formulas: for all , we have
(3) 
It follows that . Given , their convolution is the function defined by
which satisfies
(4) 
2.2 Subspaces and Functions on Subspaces
Throughout this subsection, let and let be a linear subspace of codimension (so ). We can write
(5) 
for some linearly independent collection of vectors .
A coset , which is an affine subspace or, equivalently, a βtranslateβ for some , can be expressed as a set of the form
for some ; we will often identify with the vector . Note that if , then .
Any coset of is affinely isomorphic to a copy of , and this lets us define the Fourier transform of a function restricted to a coset . More formally, consider the function defined as . Its Fourier spectrum is indexed by the elements of ; in particular, for each we have
(6) 
We can alternatively restrict a function to a coset , but treat it as a function on that takes value 0 on all points in ; this viewpoint will be notationally cleaner to work with going forward so we elaborate on it here. We define the function as
(7) 
The Fourier coefficients of and are related by the following simple fact.
Let , be as in Equation 5, and let be a coset of . Let be a collection of coset representatives for (so every vector in has a unique representation as for some ). For any , we have
Proof.
For ease of notation we first consider the case that . Suppose that is given by
where we write . We may take to be the set of all vectors in whose last coordinates are all 0, and we note that
For , we have
where we have abused notation in the last line and viewed . In turn the above is equal to  
as (and hence ) if , and so  
which by Equation 6 gives us  
The result in the general case follows by applying an invertible linear transformation mapping
(see ExerciseΒ 3.1 of [odonnellbook]). β2.3 Parity Decision Trees
We will only need the notion of a βnonadaptiveβ parity decision tree:
[nonadaptive parity decision tree] A nonadaptive parity decision tree is a representation of a function . It consists of a rooted binary tree of depth with leaves, so every roottoleaf path has length exactly . Each internal node at depth is is labeled by a vector corresponding to the parity function , and the vectors are linearly independent. (Having all nodes at level be labeled with the same vector is the sense in which the tree is βnonadaptive.β) The outgoing edges of each internal node are labeled and , and the leaves of are labeled by functions (which are restrictions of ). The size of is the number of leaf nodes of .
In more detail, a roottoleaf path can be written as where we follow the outgoing edge from the internal node , with . On an input , the parity decision tree follows the roottoleaf path and outputs the value of the function associated to the leaf at .
Note that given and a subspace of codimension as in Equation 5, we can associate a natural parity decision tree in which each level internal node is labeled by and each leaf node (corresponding to some coset of ) is labeled by .
2.4 Quasirandomness and Greenβs Regularity Lemma
The following definition of quasirandomnesss has been wellstudied as a notion of pseudorandomness in additive combinatorics; we refer the interested reader to [Chung1992] for more details.
[quasirandomness] We say that is quasirandom if
[quasirandom when restricted to coset] Let , as in Equation 5, and let be a coset of . We say that is quasirandom if
where is as defined in Section 2.2.
In Sections 2.4 andΒ 2.4, the function of interest will often be the indicator of a subset . We next state Greenβs regularity lemma for Boolean functions, which is analogous to SzemerΓ©diβs celebrated graph regularity lemma [Szemeredi:78].
[Greenβs regularity lemma in ] Let and . There exists a subspace with cosets such that

the codimension of is at most ; and

for all but fraction of cosets of , the function is quasirandom.
In Section 4, we will see the proof of Greenβs regularity lemma (in the course of providing a constructive and highly queryefficient version of the lemma).
2.5 The GoldreichβLevin Theorem
Given query access to a function , the GoldreichβLevin algorithm [goldreichlevin] allows us to find all linear (parity) functions that are wellcorrelated with (equivalently, it allows us to find all the βsignificantβ Fourier coefficients of ). More formally, we have the following result.
[GoldreichβLevin algorithm] Let be arbitrary and let be fixed. There is an algorithm that, given query access to , outputs a subset of size such that with probability at least , we have

if , then ; and

if , then .
Furthermore, runs in time and makes queries to .
2.6 Oracles and Oracle Machines
As stated in the introduction, the outputs of our algorithmic proceduresβAlgorithms 2 andΒ 1βwill be oracles to the indicator functions of specific subsets of . We first recall the definition of a probabilistic oracle machine:
Let . A randomized algorithm with blackbox query access to , denoted , is said to be a probabilistic oracle machine for if for any input , the algorithm outputs a bit that satisfies
where the probability is taken over the internal coin tosses of . The query complexity of the machine is the number of oracle calls made by to and the running time of the machine is the number of time steps it takes in the worst case (counting each oracle call as a single time step).
Of course, the 2/3 in the above definition can be upgraded to at a cost of increasing the query complexity by a factor of . We next define what it means for an algorithm to βoutput an (approximate) oracleβ for a function.
Let be two functions . An algorithm with query access to , denoted by , is said to output a oracle for the function if it outputs a representation of a probabilistic oracle machine for which the following hold:

We have (i.e. );

The query complexity of is at most and the running time of is at most .
If , then we say that is an exact oracle for .
3 A QueryInefficient Version of the Main Result
In this section, we prove a queryinefficient βnonimplicitβ version of our main result, which has a polynomial query complexity dependence on the ambient dimension . In particular, we will prove the following theorem.
[Main result, queryinefficient version] Let be an arbitrary subset, and let . Given query access to , there exists an algorithm that makes queries to and does a time computation and outputs with probability at least :

A oracle to the indicator function of where ; and

A oracle to the indicator function of the sumset .
In Section 4, we will present an βimplicitβ version of Section 3 that makes only queries, independent of the ambient dimension , and thereby prove our main result.
We start by recording a corollary of Greenβs regularity lemma in Section 3.1, which (informally), given an arbitrary set , establishes the existence of a βstructuredβ set capturing βalmost allβ of . Section 3.2 then presents a procedureβConstructDTβthat constructs an exact oracle to this structured set , giving item (1) of the above theorem. In Section 3.3, we present a procedureβSimulateSumsetβthat constructs an approximate oracle to the sumset , giving item (2).
3.1 Partitioning Arbitrary Sets into Dense Quasirandom Cosets
Greenβs regularity lemma in says that given an arbitrary set and an error parameter , we can partition into (independent of ) many sets such that is βrandomlikeβ on almost all of these sets. Moreover, all these sets have a convenient structure: they are cosets of a common subspace of constant codimension.
We will use the following easy consequence of Greenβs lemma:
Given and , there exists a subspace of codimension at most and a set such that

;

For any coset , either or ; and

is quasirandom for all cosets .
Proof.
Let be the subspace of of codimension at most guaranteed to exist by Section 2.4, and let be an enumeration of the cosets of where . We know from Section 2.4 that for all but fraction of , the function is quasirandom.
Define disjoint subsets , where each , as follows:

If is not quasirandom, then ;

If , set ;

Otherwise, set .
We now define as
(8) 
We clearly have and that is quasirandom for all . (Note that is trivially quasirandom.) β
Informally, Section 3.1 modifies to obtain a structured set that contains βmostβ of and has either empty or βlargeβ intersection with all of the cosets guaranteed to exists by Greenβs regularity lemma. Furthermore, is βrandomlikeβ on allβas opposed to almost allβof these cosets.
3.2 A Constructive Regularity Lemma via the GoldreichβLevin Theorem
In this section, we make Section 3.2 constructive via the GoldreichβLevin algorithm. The procedure ConstructDTΒ presented in Algorithm 1 closely follows the structure of Greenβs original proof of the regularity lemma itself [Green:05].

[rightmargin=1cm]

Initialize the decision tree to contain no internal nodes and one leaf labelled by . Define

At each stage of growing , do the following:

Let denote the cosets corresponding to the leaves of the decision tree at the current stage. The leaf node is labelled by the function .

For each coset , call

For each nonempty , for each , estimate up to additive error with confidence . If the estimate is less than , then remove from .

If for at least fraction of the , go to Step 3.

Let the collection of labels of all internal nodes be . For each nonempty :

Choose . Check if the collection is linearly independent.

If so, then add to and split all nodes at the current stage on .


Repeat Step 2.


For each leaf nodeβsay, corresponding to the coset βestimate up to an additive error of with confidence .

If , set the function associated to the leaf node to be the identically function.

Else set it to be the identically function.


Define the oracle to be the function
Let be an arbitrary subset. Given query access to and , the procedure ConstructDT described in Algorithm 1:

Makes queries to and does a time computation; and

With probability outputs a deterministic oracle for where is as in Section 3.1.
We note that the procedure ConstructDTΒ makes queries to the oracle in the course of running the GoldreichβLevin algorithm.
Proof.
We first argue that StepΒ 2 in the procedure ConstructDTΒ terminates; this essentially follows from Greenβs original proof of the regularity lemma in . In particular, suppose, at the current stage, the subspace given by the internal nodes of the parity decision tree is , and let denote the cosets corresponding to the leaves. Consider the potential function
where we recall that . Note that . Informally, captures the βexpected imbalanceβ of restricted to the leaf nodes of the tree at the current stage.
LemmaΒ 2.2 of [Green:05] (alternatively, see [RODgreenregularity]) states if there exists a leaf node and a parity such that , then upon splitting all nodes at the current level on the parity βwith being the subspace corresponding to the resulting treeβwe have
It follows that if the condition in LineΒ 2(d) of ConstructDTdoesnβt hold, then after StepΒ 2(e), the value of increases by at least . It follows that StepΒ 2 can be repeated at most times.
Next, note that the GoldreichβLevin call in StepΒ 2(b) makes at most queries to over the run of ConstructDT, and each call to StepΒ 2(e) and StepΒ 3 makes and many queries (via a standard application of the Chernoff bound). The overall query complexity of ConstructDTΒ follows. The runtime is similarly clear.
Note that we run the GoldreichβLevin algorithm in StepΒ 2(b) on the function as opposed to . It follows from Section 2.2 that is quasirandom if and only if is quasirandom (where is the number of cosets at a particular stage of the algorithm). We also note that given query access to , we can simulate query access to by checking whether an input belongs to the coset by querying it on the parity decision tree .
In the pruning procedure in StepΒ 2(e), the size of each is at most . A union bound over the GoldreichβLevin and estimation procedures implies that with probability , the function computed by indicates whether a point is in a coset for which is quasirandom and also . It follows that is an exact oracle for ; it also clearly makes exactly query to on any input. β
3.3 Approximately Simulating Sumsets
Note that Section 3.1 asserts, for arbitrary , the existence of a structured subset (which is βalmost all of β) and a subspace such that is quasirandom for all . The following lemma indicates why such a decomposition is useful towards our goal of (approximately) simulating sumsets.
Let be arbitrary and let be a subspace. Suppose, for ,

are quasirandom (in the sense of Section 2.4); and

for some .
Then we have
(9) 
Proof.
For ease of notation, define the quasirandom functions and as
Consider and note that . From Equation 4, we have that
(10) 
Note that and