Derandomization of Cell Sampling

Since 1989, the best known lower bound on static data structures was Siegel's classical cell sampling lower bound. Siegel showed an explicit problem with n inputs and m possible queries such that every data structure that answers queries by probing t memory cells requires space s≥Ω(n·(m/n)^1/t). In this work, we improve this bound to s≥Ω(n·(m/n)^1/(t-1)) for all t ≥ 2. For the case of t = 2, we show a tight lower bound, resolving an open question repeatedly posed in the literature. Specifically, we give an explicit problem such that any data structure that probes t=2 memory cells requires space s>m-o(m).

Authors

• 17 publications
• 10 publications
• 8 publications
04/13/2020

Lower Bound for Succinct Range Minimum Query

Given an integer array A[1..n], the Range Minimum Query problem (RMQ) as...
05/05/2020

Lower Bounds for Semi-adaptive Data Structures via Corruption

In a dynamic data structure problem we wish to maintain an encoding of s...
11/05/2018

Optimal Succinct Rank Data Structure via Approximate Nonnegative Tensor Decomposition

Given an n-bit array A, the succinct rank data structure problem asks to...
12/26/2019

Efficient processing of raster and vector data

In this work, we propose a framework to store and manage spatial data, w...
10/29/2019

An Adaptive Step Toward the Multiphase Conjecture

In 2010, Pǎtraşcu proposed the following three-phase dynamic problem, as...
04/18/2018

Nearly Optimal Separation Between Partially And Fully Retroactive Data Structures

Since the introduction of retroactive data structures at SODA 2004, a ma...
04/09/2019

Lower Bounds for Oblivious Near-Neighbor Search

We prove an Ω(d n/ ( n)^2) lower bound on the dynamic cell-probe comple...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For a field , a static data structure problem with inputs and possible queries is given by a function . A static data structure (in the cell probe model) consists of two algorithms. The preprocessing algorithm takes an input and preprocesses it into memory cells . The query algorithm takes an index , then (non-adaptively) probes at most  memory cells from , and has to compute . Here we assume that each input, memory cell, and query stores an element of the field . We remark that in the cell probe model both the preprocessing and query algorithms are computationally unbounded.

Every data structure problem admits two trivial solutions:

• and , where in the preprocessing stage one precomputes the answers to all  queries. (This solution uses prohibitively large space.)

• and , where one does not use preprocessing, but rather just stores the input. (This solution uses prohibitively large query time.)

A simple counting argument shows that a random data structure problem requires either almost trivial space or almost trivial query time . The main challenge in this area is to prove a lower bound for an explicit problem. The best known explicit lower bound was proven by Siegel [Sie89] in 1989, and his technique was further developed in [Pǎt08, PTW10, Lar12]. This technique is now called cell sampling, and it will be discussed in greater detail later in this section. For an explicit problem, cell sampling gives us a lower bound of

 s≥˜Ω(n⋅(mn)1/t).

In particular, for , Siegel’s result provides a problem that for linear space requires logarithmic query time. Alas, for super-linear space , this best known lower bound only gives us the trivial bound. It is a major challenge in this area to improve on Siegel’s bound.

While for the case of , every non-trivial problem with queries requires space , even the case of is not well understood. The cell sampling technique for gives a lower bound of , but this is still far from the optimal bound of for . Only recently for the binary field , Viola [Vio19] proved a strong lower bound of on the space complexity for the case of . Moreover, Viola [Vio19] showed that a better understanding of high lower bounds on the space complexity even for low values of will lead to resolving a long-standing open problem in circuit complexity.

Our results.

In this work, we further develop the cell sampling technique and improve its bound to

 s≥˜Ω(n⋅(mn)1/(t−1)).

We also show that this bound is tight for cell sampling. On the one hand, this new bound does not improve asymptotic lower bounds on the query time  for any value of . On the other hand, for every fixed value of , the new bound gives an asymptotically stronger lower bound on . Furthermore, this bound essentially resolves the question for the case of : for every field, and every number of queries , we give an explicit problem such that any data structure that probes memory cells requires memory . This improves on the bound of Viola [Vio19], and answers a question posed by Rao [Rao20].

Theorem 1.

Fix a finite field  and a parameter .

1. There exists an explicit problem with inputs and queries such that every static data structure solving it with query time  requires space .

2. For every , there exists an explicit problem with inputs and queries such that every static data structure solving it with query time  requires space

 s≥Ω(n⋅(mn)1/(t−1)⋅12tlog(n)log(m)).

Theorem 1 follows from the following result about hypergraphs, together with standard techniques. In Theorem 2, we prove that every dense enough hypergraph contains a small dense subgraph.

Theorem 2.

Let be a multigraph with vertices and edges for some . There exists a set of size vertices spanning at least edges.

Let be an integer, and be a -hypergraph with vertices and hyperedges. Let be a parameter such that . If

 m≥3s(2t+1⋅s⋅log(s)k)t−2, (1)

then there exists a subset of size that spans at least hyperedges.

We are now ready to prove the main theorem of the paper by applying Theorem 2. We will prove our data structure lower bound for -wise independent functions. It is well known that the parity check matrix of a linear code with distance is -wise independent. Therefore, one can define a

-wise independent data structure problem as the problem of multiplying an input vector

by a fixed parity check matrix of a code with a large distance. In particular, for fields of size , one can achieve -wise independence by taking as the Vandermonde matrix. For smaller fields, one can take rate-optimal linear codes [MS77] and achieve -wise independence for and -wise independence for every , which is tight [CGH85].

Proof of Theorem 1.

Consider a data structure for a -wise independent problem. For such a problem, in order to answer any -tuple of queries, one needs to read at least memory cells.

For , we construct a multigraph with vertices corresponding to the memory cells of the data structure, and edges, each corresponding to the pair of memory cells read for a query. If , then using the first part of Theorem 2 with , we get a set of queries that depends on memory cells, which contradicts the assumption on -wise independence.

For , we construct a -uniform hypergraph on vertices, where the vertices correspond to the memory cells of the data structure, and hyperedges correspond to the -tuples of memory cells read for each query. By the second part of Theorem 2, if , there exists a set of queries that can be answered by reading fewer than memory cells, contradicting the assumption on -wise independence. ∎

Comparison to the cell sampling bound.

The classical cell sampling technique can be viewed as a slightly weaker version of Theorem 2. In the cell sampling argument, one picks random

vertices and proves that with non-zero probability they span at least

hyperedges. This way, each -hyperedge is spanned with probability , and the expected number of spanned edges is . This leads to the lower bound of , which is weaker than the bound of from Theorem 2.

The following proposition shows that the bound of Theorem 2 is essentially tight, which poses a barrier on further improvements using this technique.

Proposition 3.

Let be a -uniform hypergraph with vertices and edges sampled uniformly at random. If , then with high probability does not have a set of vertices spanning at least hyperedges.

Proof.

By the union bound over all -subsets of vertices, and all -subsets of edges, we have that the probability that has vertices spanning hyperedges is at most

 (sk)⋅(mk)⋅((kt)(st))k.

Using the inequalities , we have that this probability is bounded from above by

 (sek)k⋅(mek)k⋅(kes)tk=(et+2mkt−2st−1)k≤(1e4t−7)k≤e−k.

Notation.

All logarithms in this paper are to the base two. By and we denote the path and the cycle graphs on vertices, respectively. By multigraphs we mean graphs that may contain parallel edges and (possibly parallel) self-loops. The degree of a vertex is the number of incident edges, and a self-loop adds two to the degree. For a multigraph and a subset of its vertices , denotes the subgraph of induced on the vertices .

By -hypergraphs we mean hypergraphs where each edge contains at most distinct vertices, and where parallel edges are allowed. A -uniform hypergraph is a -hypergraph where each edge contains exactly vertices. We say that a set of vertices spans a hyperedge , if all the vertices of belong to .

2 Proof of Theorem 2

We will need the following auxiliary claim that shows that for every vertex  of a cubic graph, there is a small tadpole graph starting at .

Claim 4.

Let be a multigraph, and let . Suppose that and for all . Then contains a path and a cycle such that , , , and .

In creftype 4 a cycle of length 1 means a self-loop, and a cycle of length 2 corresponds to a pair of parallel edges between two vertices. In particular, if has a self-loop, then we can take and .

Proof.

In order to prove the claim, we run the Breadth First Search algorithm starting at the vertex . Since all vertices in have degrees at least three, the algorithm will encounter a cycle in one of the first levels of the BFS tree. Combining this cycle with the (possibly empty) path from the root vertex  to the cycle, we get a path connected to a cycle with . ∎

We are now ready to prove the first part of Theorem 2.

Lemma 5.

For every and every multigraph with vertices and edges, there exists a set of vertices spanning at least edges.

Proof.

We iteratively apply the following operations to the graph  as long as at least one of them is applicable.

• If contains a vertex of degree one, then we remove this vertex and the incident edge from the graph. In this case, we remove one edge and one vertex.

• If contains a vertex whose only incident edge is its self-loop, then we remove this vertex with the self-loop. Again, we remove one edge and one vertex.

• If contains a path of length of vertices of degree two, then we remove all vertices of degree two belonging to this path with all the incident edges. In this case, we remove vertices and edges.

Note that both operations do not decrease the average degree of the graph, as the resulting graph has vertices and edges such that . Each of the remaining vertices of degree two in the graph belongs to a path of length less than , we contract each such path into an edge, and obtain a graph of minimum degree three. (Note that such contraction may create a self-loop, in case that the endpoints of the path are the same vertex.)

We apply creftype 4 to the graph and an arbitrary vertex , and get a cycle in of length . (Here, we ignore the path guaranteed by creftype 4.) Next, we consider the following two cases:

• If is a connected component in , then since all vertices have degrees of at least 3, the vertices of must span at least edges. Let be the vertices of the subgraph , where all contracted edges are expanded back into the vertices and edges of . Since each expanded edge adds the same number of vertices and edges, we have that spans at least edges. Then the set satisfies the required property as .

• Otherwise, we contract into a new vertex , and denote the obtained graph by . Since is not a connected component in , it follows that is not an isolated vertex in , and hence . We now apply creftype 4 to and the vertex , and get a path and a cycle in with such that .

Recalling that the vertex in corresponds to in , it is easy to verify that the set of vertices has vertices, and spans at least edges in . By expanding the contracted edges, we again have a set of at most vertices spanning at least edges.

This completes the proof of Lemma 5. ∎

The following lemma generalizes Lemma 5 by finding a small induced subgraph with a large gap between the number of vertices and the number of edges.

Lemma 6.

Let be a multigraph with and . If and , then there is a subset of vertices of size at most that spans at least edges.

Proof.

The proof is by induction on . The base case of follows from Lemma 5 with . Next we assume that the lemma is true for , and prove it for . By the induction hypothesis there is a subset of vertices of size at most such that spans at least edges.

If spans edges, then we are done. Otherwise, must span exactly edges. Consider a graph obtained from by contracting into a new vertex and removing the edges with both ends in . Then has vertices and edges. In particular, . Therefore, by Lemma 5, has a subset of vertices of size at most such that spans at least edges. We remark that may or may not contain the vertex .

By taking we obtain a set of vertices spanning at least edges. ∎

We now finish the proof of Theorem 2. See 2

Proof.

The first part of the theorem is proven in Lemma 5. For the second part of the theorem, without loss of generality, we assume that is -uniform. Indeed, if an edge of has fewer than vertices, then we extend this edge with arbitrary vertices, and the theorem statement for the new graph will imply the statement for .

The proof of the second part of the theorem is by induction on . For the base case of the statement follows immediately from Lemma 6. Indeed, for the bound (1) implies that , and by Lemma 6 with we get the desired conclusion.

For the induction step, let us prove the statement of the theorem for , assuming that it holds for . Let . Since and , we have that . Note that by the assumption on in (1) there exists a subset of vertices such that the number of hyperedges touching them is . Indeed, since , it must be the case that there is a set such that . And since each hyperedge is counted in the sum at most t times, it follows that the number of edges adjacent to is at least .

Associate each hyperedge with some vertex . That is, if contains a unique vertex in , then we associate with this , and if there is more than one such vertex, then we choose arbitrarily.

Define the graph , where . Note that has vertices and the number of hyperedges (of size at most ) is at least

 ℓms ≥3s(2t+1⋅s⋅log(s)k)t−2⋅ℓs =3s(2t+1⋅s⋅log(s)k)t−2⋅k2t+1⋅s⋅log(s) =3s(2t⋅s⋅log(s)k/2)t−3 ≥3s(2t⋅s⋅log(s)k−ℓ)t−3,

where the last inequality uses . Therefore, we can apply the induction hypothesis to the -hypergraph with being the bound on the size of the guaranteed set. We get that has a subset of size that spans at least hyperedges. Define the set . Therefore, , and since the number of hyperedges spanned by in is at least the number of hyperedges spanned by in , it follows that spans at least

 |S∗|+k−ℓ2t−1log(s)≥|S|−ℓ+k−ℓ2t−1log(s)≥|S|−k2t+1log(s)+3k2t+1log(s)≥|S|+k2tlog(s)

edges, as required. ∎

References

• [CGH85] Benny Chor, Oded Goldreich, Johan Håstad, Joel Freidmann, Steven Rudich, and Roman Smolensky. The bit extraction problem or -resilient functions. In FOCS 1985, pages 396–407, 1985.
• [Lar12] Kasper Green Larsen. Higher cell probe lower bounds for evaluating polynomials. In FOCS 2012, pages 293–301, 2012.
• [MS77] Florence Jessie MacWilliams and Neil James Alexander Sloane. The theory of error-correcting codes. Elsevier, 1977.
• [Pǎt08] Mihai Pǎtraşcu. Unifying the landscape of cell-probe lower bounds. In FOCS 2008, pages 434–443, 2008.
• [PTW10] Rina Panigrahy, Kunal Talwar, and Udi Wieder. Lower bounds on near neighbor search via metric expansion. In FOCS 2010, pages 805–814, 2010.
• [Rao20] Anup Rao. Personal communication, 2020.
• [Sie89] Alan Siegel. On universal classes of fast high performance hash functions, their time-space tradeoff, and their applications. In FOCS 1989, pages 20–25, 1989.
• [Vio19] Emanuele Viola. Lower bounds for data structures with space close to maximum imply circuit lower bounds. Theory of Computing, 15(1):1–9, 2019.