Orthogonal range searching in the semigroup model is one of the most fundamental data structure problems in computational geometry. In the problem, we are given an input set of points to store in a data structure where each point is associated with a weight from a semigroup and the goal is to compute the (semigroup) sum of all the weights inside an axis-aligned box given at the query time. Disallowing the “inverse” operation in makes the data structure very versatile as it is then applicable to a wide range of situations (from computing weighted sum to computing the maximum or minimum inside the query). In fact, the semigroup variant is the primary way the family of range searching problems are introduced, (see the survey ).
Here, we focus only on static data structures. We use the convention that , the query time, refers to the worst-case number of semigroup additions required to produce the query answer , space, refers to the number of semigroup sums stored by the data structure. By storage, denoted by , we mean space but not counting the space used by the input, i.e., . So we can talk about data structures with sublinear space, e.g., with 0 storage the data structure has to use the input weights only, leading to the worst-case query time of .
1.1 The Previous Results
Orthogonal range searching is a fundamental problem with a very long history. The problem we study is also very interesting from a lower bound point of view where the goal is to understand the fundamental barriers and limitations of performing basic data structure operations. Such a lower bound approach was initiated by Fredman in early 80s and in a series of very influential papers (e.g., see [9, 10, 11]). Among his significant results, was the lower bound [10, 11] that showed a sequence of insertions, deletions, and queries requires time to run.
Arguably, the most surprising result of these early efforts was given by Andrew Yao who in 1982 showed that even in one dimension, the static case of the problem contains a very non-trivial, albeit small, barrier. In one dimension, the problem essentially boils down to adding numbers: store an input array of numbers in a data structure s.t., we can add up the numbers from to for and given at the query time. The only restriction is that we should use only additions and not subtractions (otherwise, the problem is easily solved using prefix sums). Yao’s significant result was that answering queries requires additions, where is the inverse Ackermann function. This bound implies that if one insists on using storage, the query bound cannot be reduced to constant, but even using a miniscule amount of extra storage (e.g., a factor extra storage) can reduce the query bound to constant. Furthermore, using a bit less than storage, e.g., by a factor, will once again yield a more natural (and optimal) bound of . Despite its strangeness, it turns out there are data structures that can match the exact lower bound (see also ). After Tarjan’s famous result on the union-find problem , this was the second independent appearance of the inverse Ackermann function in the history of algorithms and data structures.
Despite the previous attempts, the problem is still open even in two dimensions. At the moment, using range trees[6, 7] on the 1D structures is the only way to get two or higher dimensional results. In 2D for instance, we can have with query bound , or with query bound , or with query bound , for any constant . In general and in dimensions, we can build a structure with units of storage and with query bound, for any constant . We can reduce the space complexity by any factor by increasing the query bound by another factor . Also, strangely, if is asymptotically larger than , then the inverse Ackermann term in the query bound disappears. Nonetheless, a surprising result of Chazelle  shows that the reverse is not true: the query bound must obey which implies using polylogarithmic extra storage only reduces the query bound by a factor. Once again, using range tree with large fan out, one can build a data structure that uses storage, for any positive constant , and achieves the query bound of . This, however leaves a very natural and important open problem: Is Chazelle’s lower bound the only barrier? Is it possible to achieve space and query time?
Idempotence and random point sets.
A semigroup is idempotent if for every , we have . All the previous lower bounds are in fact valid for idempotent semigroups. Furthermore, Chazelle’s lower bound uses a uniform (or randomly placed) set of points which shows the lower bound does not require pathological or fragile input constructions. Furthermore, his lower bound also holds for dominance ranges, i.e., -dimensional boxes in the form of . These little perks result in a very satisfying statement: problem is still difficult even when is “nice” (idempotent), and when the point set is “nice” (uniformly placed) and when the queries are simple (“dominance queries”).
1.2 Our Results
We show that for any data structure that uses storage and has query bound of , we must have . This is the first improvement to the storage-time trade-off curve for the problem since Chazelle’s result in 1990. It also shows that Chazelle’s lower bound is not the only barrier. Observe that our lower bound is strong at a different corner of parameter space compared to Chazelle’s: ours is strongest when storage is small whereas Chazelle’s is strongest when the storage is large. Furthermore, we also keep most of the desirable properties of Chazelle’s lower bound: our lower bound also holds for idempotent semigroups and uniformly placed point sets. However, we have to consider more complicated queries than just dominance queries which ties to our second main result. We show that our analysis is tight: given a “uniformly placed” point set and an idempotent semigroup , we can construct a data structure that uses storage and has the query bound of . As a corollary, we provide an almost complete understanding of orthogonal range searching queries with respect to a uniformly placed point set in an idempotent semigroup.
Our results and specially our lower bound require significantly new ideas. To surpass Chazelle’s lower bound, we need to go beyond dominance queries which requires wrestling with complications that ideas such as range trees can introduce. Furthermore, in our case, the data structure can actually improve the query time by a factor by spending a factor extra space. This means, we are extremely sensitive to how the data structure can “use” its space. As a result, we need to capture the limits of how intelligently the data structure can spend its budge of “space” throughout various subproblems.
It is natural to conjecture that the uniformly randomly placed point set should be the most difficult point set for orthogonal queries. Because of this, we conjecture that our lower bounds are almost tight. This opens up a few very interesting open problems. See Section 5.
The Model of Computation.
Let be an input set of points with weights from a semigroup . Our model of computation is the same as the one used by the previous lower bounds, e.g., . There has been quite some work dedicated to building a proper model for lower bounds in the semigroup model. We will not delve into those details and we only mention the final consequences of the efforts. The data structure stores a number of sums where each sum is the sum of the weights of a subset . With a slight abuse of the notation, we will use to refer both to the sum as well as to the subset . The number of stored sums is the space complexity of the data structure. If a sum contains only one point, then we call it a singleton and we use to denote the storage occupied by sums that are not singletons. Now, consider a query range containing a subset . The query algorithm must find stored subsets such that . For a given query , the smallest such integer is the query bound of the query. The query bound of the data structure is the worst-case query bound of any query. Observe that the data structure does not disallow covering any point more than once and in fact, for idempotent semigroups this poses no problem. All the known lower bounds work in this way, i.e., they allow covering a point inside the query multiple times. However, if the semigroup is not idempotent, then covering a point more than once could lead to incorrect results. Since data structures work for general semigroups, they ensure that are disjoint.
Definitions and Notations.
A -dimensional dominance query is determined by one point and it is defined as .
We call a set well-distributed if the following properties hold: (i) is contained in the -dimensional unit cube. (ii) The volume of any rectangle that contains points of is at least for some constant that only depends on the dimension. (iii) Any rectangle that has volume , contains at most points of .
3 The Lower Bound
This section is devoted to the proof of our main theorem which is the following.
If is a well-distributed point set of points in , any data structure that uses storage, and answers -sided queries in query bound requires that .
Let be the unit cube in . Throughout this section, the input point set is a set of well-distributed points in . Let be a data structure that answers semigroup orthogonal range searching queries on .
3.1 Definitions and Set up
We consider queries that have two boundaries in dimensions 1 to but only have an upper bound in dimension . For simplicity, we rename the axes such that the -th axis is denoted by and the first axes are denoted by . Thus, each query is in the form of . The point is defined as the dot of and is denoted by . For every , the line segment that connects to the point is called the -th marker of and it is denoted by .
The tree .
For each dimension , we define a balanced binary tree of height as follows. Informally, we cut into
congruent boxes with hyperplanes perpendicular to axiswhich form the leaves of . To be more specific, every node in is assigned a box . The root of is assumed to have depth and it is assigned . For every node , we divide into two congruent “left” and “right” boxes with a hyperplane , perpendicular to axis. The left box is assigned to left child of and similarly the right box is assigned to the right child of . We do not do this if has volume less than ; these nodes become the leaves of . Observe that all trees , have the same height . The volume of for a node at depth is .
Embedding the problem in .
The next idea is to embed our problem in . Consistent with the previous notation, the first axes are and . We label the next axis . We now represent geometrically as follows. Consider the height of . For each , , we now define a representative diagram which is a axis-aligned decomposition of the unit (planar) square in a coordinate system where the horizontal axis is and the vertical axis is . As the first step of the decomposition, cut into equal-sized sub-rectangles using horizontal lines. Next, we will further divide each sub-rectangle into small regions and we will assign every node of to one of these regions. This is done as follows. The root of is assigned the topmost sub-rectangle as its region, . Assume is assigned a rectangle as its region. We create a vertical cut starting from the middle point of the lower boundary of all the way down to the bottom of the rectangle . The children of are assigned to the two rectangles that lie immediately below . See Figure 1.
Placing the Sums.
Consider a semigroup sum stored by the data structure . Our lower bound will also apply to semigroups that are idempotent which means without loss of generality, we can assume that our semigroup is idempotent. As a result, we can assume that each semigroup sum stored by the data structure has the same shape as the query. Let be the smallest box that is unbounded from below (along the axis) that contains all the points of . If does not include a point, , inside , we can just to . Any query that can use must contain the box which means adding to can only improve things. Each sum is placed in one node of for every . The details of this placement are as follows.
A node in stores any sum such that the -th marker of , , intersects with being the highest node with this property. Geometrically, this is equivalent to the following: we place at a node if is the lowest region that fully contains the segment (or to be precise, the projection of onto the plane). For example, in Figure 1(right), the sum is placed at in since , the green line segment, is completely inside with being the lowest node of this property. Remember that is placed at some node in each tree , (i.e., it is placed times in total).
Notations and difficult queries.
We will adopt the convention that random variables are denoted with bold math font. The difficult query is asided query chosen randomly as follows. The query is defined as where and are also random variables (to be described). is chosen uniformly in . To choose the remaining coordinates, we do the following. We place a random point uniformly inside the representative plane (i.e., choose and uniformly in ). Let be the random variable denoting the node in s.t., region contains the point . is the -coordinate of the left boundary of . Let be the depth of in . We denote the point by and denote the sided query by . See Figure 1(right). Note that a query is equivalent to a dominance query defined by point in . To simplify the presentation and to stop redefining these concepts, we will reserve the notations introduced in this paragraph to only represent the concepts introduced here.
A necessary condition for being able to use a sum to answer is that is stored at the subtree of , for every .
Due to how we have placed the sums, the sums stored at the ancestors of contain at least one point that lies outside and since is entirely contained inside those sums cannot be used to answer the query. ∎
Consider a query . We now define subproblems of . A subproblem is represented by an array of integral indices and it is denoted as -subproblem. The state of -subproblem of a query could either be undefined, or it could refer to covering a particular subset of points inside the query. In particular, given , a -subproblem is undefined if for some , there is no node with the following properties: has depth , has a right sibling with containing the query point . See Figure 2. However, if such nodes exist for all , then the -subproblem of is well-defined and it refers to the problem of covering all the points inside the region ; observe that this is equivalent to covering all the points inside the region that have -coordinate at most . Further observe that for to exist in , it needs to pass two checks: (check I) as otherwise, there are no nodes with depth and (check II) a node at depth has a right sibling with containing . The nodes are called the defining nodes of the -subproblem. Thus, the random variable defines the random variable where could be either undefined or it could be a node in . Clearly, the distribution of is independent of the distributions of and for as only depends on .
Consider a well-defined subproblem of a query and its defining nodes . To solve the -subproblem (i.e., to cover the points inside the subproblem), the data structure can use a sum only if for every , we either have case (i) where is stored at ancestors of but not the ancestors of or case (ii) where is stored at the subtree of . If a sum violates one of these two conditions for some , then it cannot be used to answer the -subproblem. See Figure 2.
First we use Observation 1. must be stored at the subtree of . Let be the node that stores . If is in the subtree of , then we are done. Otherwise, let be the least common ancestor of and . If then we are done again but otherwise, belongs to the subtree of one child of while belongs to the subtree of the other child of . By our placement rules, this implies that is entirely outside and thus it cannot be used to answer the -subproblem. ∎
3.2 The Main Lemma
In this subsection, we prove a main lemma which is the heart of our lower bound proof. To describe this lemma, we first need the following notations. Consider a well-defined -subproblem of a query where for . As discussed, this subproblem corresponds to covering all the points in the region whose -coordinate is below , the -coordinate of point ; thus, the -subproblem of the query can be represented as the problem of covering all the points inside the box where and correspond to the left and the right boundaries of the slab . Let be a parameter. Consider the region in which is chosen such that the region contains points; as our pointset is well-distributed, this implies that the volume of the region is . We call this region the -top box. The -top, denoted by , is then the problem of covering all the points inside the -top box of the -subproblem. With a slight abuse of the notation, we will use to refer also to the set of points inside the -top box. If there are not enough points in the -top box, the -top is undefined, otherwise, it is well-defined. These of course also depend on the query but we will not write the dependency on the query as it will clutter the notation. Furthermore, observe that when the query is random, then becomes a random variable which is either undefined or it is some subset of points.
Extensions of sums.
Due to technical issues, we slightly extend the number of points each sum covers. Consider a sum stored at a subtree of such that can be used to answer the -subproblem. By Observation 1, is either placed at the subtree of or on the path connecting to . We extend the range of the sum (i.e., the projection of on the ) to include the left and the right boundary of the node along the -dimension. We do this for all first dimensions to obtain an extension of sum . We allow the data structure to cover any point in using .
[The Main Lemma] Consider a subproblem of a random query , for . Let where is a small enough constant and is the storage of the data structure. Let be the set of sums such that (i) is contained inside the query , and (ii) covers at least points from , meaning, where is a large enough constant.
With probability, the -subproblem and the are well-defined. Furthermore conditioned on both of these being well-defined, with probability , the nodes will be sampled as nodes , s.t., the following holds: where the expectation is over the random choices of and is another large constant.
Let us give some intuition on what this lemma says and why it is critical for our lower bound. For simplicity assume and assume we sample as the first step, and and then sample as the last step. The above lemma implies that if we focus on one particular subproblem, the sums in the data structure cannot cover too many points; to see this consider the following. The lemma first says that after the first step, with positive constant probability, -subproblem and are well-defined. Furthermore, here is a very high chance that our random choices will “lock us” in a “doomed” state, after sampling . Then, when considering the random choices of , sums that cover at least points in total cover a very small fraction of the points. As a result, we will need sums to cover the points inside the -top of the subproblem. Summing these values over all possible subproblems, , will create a lot of Harmonic sums of the type which will eventually lead to our lower bound. In particular, we will have There is however, one very big technical issue that we will deal with later: a sum can cover very few points from each subproblem but from very many subproblems! Without solving this technical issue, we only get the bound which offers no improvements over Chazelle’s lower bound. Thus, while solving this technical issue is important, nonetheless, it is clear that the lemma we will prove in this section is also very critical.
As this subsection is devoted to the proof of the above lemma, we will assume that we are considering a fixed -subproblem and thus the indices are fixed.
3.2.1 Notation and Setup
By Observation 2, only a particular set of sums can be used to answer the -subproblem of a query. Consider a sum that can be used to answer the subproblem of some query. By the observation, we must have that must either satisfy case (i) or case (ii) for every tree , . Over all indices , , they describe different cases. This means that we can partition into different equivalent classes s.t., for any two sums and in an equivalent class, either they both satisfy case (i) or they both satisfy case (ii) in Observation 2 and for any dimension . Since is a constant, it suffices to show that our lemma holds when only considering sums of particular equivalent class. In particular, let be the subset of eligible sums that all belong to one equivalent class. Now, it suffices to show that . since summing these over all equivalent classes will yield the lemma. Furthermore, w.l.o.g and by renaming the -axes, we can assume that there exists a fixed value , , such that for every sum , for dimensions , satisfies case (i) in and for , is within case (ii). Note that if , then it implies that we have no instances of case (i) and for we have no instances of case (ii).
The probability distribution of subproblems.
To proceed, we need to understand the distribution of the subproblems. This is done by the following observation.
Consider a subproblem of a random query
defined by random variables .
We can make the following observations.
(i) the distribution of the random variable is uniform among the
(ii) With probability , will be undefined because it fails (Check I).
(iii) If (Check I) does not fail for , there is exactly 0.5 probability that
(iv) For a fixed , the probability distribution,
, the probability distribution,, of is as follows: with probability , is undefined. Otherwise, is a node in sampled in the following way: sample a random integer (depth) uniformly among integers in and select a random node uniformly among all the nodes at depth that have a right sibling.
(i) follows directly from our definition: first, note that each coordinate of the query point is chosen independently of other coordinates, and second, is sampled by placing a random point inside which by construction implies the depth of is a uniform random integer in . (ii) This directly follows from (i): with probability , the random variable is larger than which implies we fail Check I. (iii) We need to make two observations: one is that at all times since and second that at any depth of , except for the top level (i.e., the root), exactly half the nodes have a right sibling. (iv) This is simply a consequence of parts (i-iii). ∎
Observe that w.l.o.g., we can assume that we first generate the dimensions to of the query, and then the dimensions to of the query, and then the value . A partial query is one where only the dimensions to have been generated. This is equivalent to only sampling random points for . To be more specific, assume we have set , for where each is a node in . Then, the partial query is equivalent to the random query and in which the first coordinates of are known (not random). Thus, we can still talk about the -subproblem of a partial query; it could be that the -subproblem is already known to be undefined (this happens when one of the nodes , is known to be undefined) but otherwise, it is defined by defining nodes and the random variables ; these latter random variables could later turn out to be undefined and thus rendering the -subproblem of the query undefined.
After sampling a partial query, we can then talk about eligible sums: a sum is eligible if it could potentially be used to answer the -subproblem once the full query has been generated. Note that the emphasis is on answering the -subproblem. This means, there are multiple ways for a sum to be ineligible: if -subproblem is already known to be undefined then there are no eligible sums. Otherwise, the defining nodes are well-defined. In this case, if it is already known that is outside the query, or it is already known that cannot cover any points from the -subproblem then becomes ineligible. Final and the most important case of ineligibility is when is placed at a node which is a descendant of node for some . If this happens, even though can be potentially used to answer the -subproblem, it can do so from a different equivalent class, as the reader should remember that we only consider sums that are stored in the path that connects to for . If a sum passes all these, then it is eligible. Clearly, once the final query is generated, the set is going to be a subset of the eligible sums.
Given a partial query , and considering a fixed -subproblem, we define the potential function to be the number of eligible sums.
To prove the above lemma, we need the following definitions and observations. Consider the -subproblem for a partial query together with corresponding nodes . In the representative diagram , the Type I region of is defined as a rectangular region whose bottom and left boundary are the same the bottom and the left boundary of , its right boundary is the right boundary of , and its top boundary is the top boundary of . We denote this region by . See Figure 4.
Consider a partial query and assume the nodes that correspond to the -subproblem of the query exist. A necessary condition for a sum to be eligible is that must lie inside for .
As is a defined node in , it means that we can identify the node , the sibling of , and the node , the node at depth that is the ancestor of . If is not inside , then we have a few cases:
is to the left of the left boundary of : well in this case, cannot contain any point from the points in the subtree of so clearly it cannot be used to answer the -subproblem.
is to the right of the right boundary of : Observe that satisfies Check II, which means the -th coordinates of the query is within the -th coordinates of . Thus, in this case, it follows that is outside the query region and so cannot be used to answer the query.
is below the lower boundary of : This violates the assumption that is an eligible sum. In particular, this implies that is stored at the subtree of in .
is above the top boundary of : This violates Observation 1 as it implies is stored at a node which is not in the subtree of .
If the query does not have a -subproblem then the potential is zero and thus there is nothing left to prove. So in the rest of the proof, we will assume -subproblem is defined.
Consider an eligible sum and assume has been placed at nodes of , for . Let be the depth of . We now focus on the distribution of the random variables , instead of using Observation 4: for to be eligible, it is necessary that is selected to be a node with depth such that as otherwise, will either be below or above . Furthermore, by Observation 4, it follows that for every depth such that , there exists exactly one node of depth for which it holds that is inside . Consider nodes , such that . By Observation 3, the probability that for every is . Note that the event also uniquely determines the nodes , . Furthermore, in this case, the depth of the node is which means the contribution of to the expected value claimed in the lemma is
Summing this over all the choices of yields that the contribution of to the expected value is . Summing this over all sums yields the lemma. ∎
By the above lemma, we except only few eligible sums for a random partial query. Let be the “bad” event that the nodes are sampled to be nodes such that . By Markov’s inequality and Lemma 3, .
Now, fix . In the rest of the proof we will assume these values are fixed and we are going to generate the rest of the query. Next, we define another potential function.
The potential for , where the depth of in is is defined as follows. First define for nodes to be the number of eligible sums such that is placed at for . Given the nodes , and for non-negative integers , we define as the sum of all over all nodes where has depth in and is a descendant of . We define the potential function as follows.
Having fixed the nodes , we have,
where is the depth of , is the defining node of the -subproblem, the expectation is taken over the random choices of , and the potential is defined to be zero if any of the nodes is undefined.
We consider the definition of the potential function . We observe that we can look at this potential function from a different angle. This potential is defined on the tuples of vertices. We first initialize to for every , . Then, every tuple “dispatches” some potential to some other tuples in the following way: the tuple dispatches potential to the tuple in which is the ancestor of in that is placed levels higher than . This is done for all integers , for and it is clear that by rearrnging the terms in the sum, it gives the same sum that was used to define the potential.
Observe that total amount of potential dispatched from a tuple is
Thus, the total amount of potential is bounded by
where the last step follows from the definition of potential as it counts all the eligible sums.
Or in other words, the total amount of potential is no more than the potential. However, remember that the vertices are not sampled uniformly. Thus, to evaluate the expected value claimed in the lemma, we need to consider the exact distribution of the random variables . We use Observation 3. Define and . Thus,
Now we define the second bad event to be the event that . By Markov’s inequality and Lemma 4, .
3.3 Proof of the main lemma.
Remember that we will focus on one equivalent class of . Observe that the summation counts how many times a point in is covered by extensions of sums that cover at least points of the and this only takes into account the random choices of as the nodes have been fixed. As a result, is a random variable that only depends on . To make this clear, let be the set that includes all the sums that can be part of over all the random choices of . As a result, is a random subset of . Observe that every sum has the property that it is stored in some node on the path from to for and at the subtree of for . Since has exactly, points, we can label them from one to under some global ordering of the points (e.g., lexicographical ordering). Thus, let be the -th point in , . Also, let be the number of sums s.t., contains . Then, we can do the following rewriting:
By linearity of expectation,