Tight Lower Bound for Comparison-Based Quantile Summaries

05/09/2019 ∙ by Graham Cormode, et al. ∙ University of Warwick 0

Quantiles, such as the median or percentiles, provide concise and useful information about the distribution of a collection of items, drawn from a linearly ordered universe. We study data structures, called quantile summaries, which keep track of all quantiles, up to an error of at most ε. That is, an ε-approximate quantile summary first processes a stream of items and then, given any quantile query 0<ϕ< 1, returns an item from the stream, which is a ϕ'-quantile for some ϕ' = ϕ±ε. We focus on comparison-based quantile summaries that can only compare two items and are otherwise completely oblivious of the universe. The best such deterministic quantile summary to date, by Greenwald and Khanna (ACM SIGMOD '01), stores at most O(1/ε·ε N) items, where N is the number of items in the stream. We prove that this space bound is optimal by providing a matching lower bound. Our result thus rules out the possibility of constructing a deterministic comparison-based quantile summary in space f(ε)· o( N), for any function f that does not depend on N. A consequence of our results is also to show a lower bound for randomized algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The streaming model of computation is a useful abstraction to understand the complexity of working with large volumes of data, too large to conveniently store. A number of results are known for this model, and effective algorithms are known for many basic functions, such as finding frequent items, computing the number of distinct items, and measuring the empirical entropy of the data. Typically, in the streaming model we allow just one pass over the data and must use a small amount of memory, that is, sublinear in the data size. While computing sums, averages, or counts is trivial with a constant memory, finding the median, quartiles, percentiles and their generalizations, quantiles, presents a challenging task. Indeed, four decades ago, Munro and Paterson 

[14] showed that finding the exact median in passes over the data requires memory, where is the number of items in the stream. They also provide a -pass algorithm for selecting the -th smallest item in space , and an -pass algorithm running in space .

Thus, either large space, or a large number of passes is necessary for finding the exact median. For this reason, subsequent research has mostly been concerned with the computation of approximate quantiles, which are often sufficient for applications. Namely, for a given precision guarantee and a query , instead of finding the -quantile, i.e., the -th smallest item, we allow the algorithm to return a -quantile for . In other words, when queried for the -th smallest item (where ), the algorithm may return the -th smallest item for some . Such an item is called an -approximate -quantile.

More precisely, we are interested in a data structure, called an -approximate quantile summary, that processes a stream of items from a linearly ordered universe in a single pass. Then, it returns an -approximate -quantile for any query . We optimize the space used by the quantile summary, measured in words, where a word can store any item or an integer with bits (that is, counters, pointers, etc.).111 Hence, if bits are sufficient for storing an item, then the space complexity in bits is at most times the space complexity in words.

We do not assume that items are drawn from a particular distribution, but rather focus on data independent solutions with worst-case guarantees. Quantile summaries are a valuable tool, since they immediately provide solutions for a range of related problems: estimating the cumulative distribution function; answering rank queries; constructing equi-depth histograms (where the number of items in each bucket must be approximately equal); performing Kolmogorov-Smirnov statistical tests

[9]; and balancing parallel computations [16].

Note that offline, with a random access to the whole data set, we can design an -approximate quantile summary with storage cost just . We simply select the -quantile (the smallest item), the -quantile, the -quantile, …, and the -quantile, and arrange them in a sorted array. Moreover, this is optimal, since there cannot be an interval of size more than such that there is no -quantile for any in the quantile summary.

Building on the work of Munro and Paterson [14], Manku, Rajagopalan, and Lindsay [11] designed a (streaming) quantile summary which uses space , although it relies on the advance knowledge of the stream length . Then, shaving off one log factor, Greenwald and Khanna [4] gave an -approximate quantile summary, which needs just words and does not require any advance information about the stream. Both of these deterministic algorithms work for any universe with a linear ordering as they just need to do comparisons of the items. We call such an algorithm comparison-based.

The question of whether one can design a 1-pass deterministic algorithm that runs in a constant space for a constant has been open for a long time, as highlighted by the first author in 2006 [1]. Following the above discussion, there is a trivial lower bound of that holds even offline. This was the best known lower bound till 2010 when Hung and Ting [7] proved that a deterministic comparison-based algorithm needs space .

We significantly improve upon that result by showing that any deterministic comparison-based data structure providing -approximate quantiles needs to use memory on the worst-case input stream. Our lower bound thus matches Greenwald and Khanna’s result, up to a constant factor, and in particular, it rules out an algorithm running in space , for any function that does not depend on . It also follows that a comparison-based data structure with memory must fail to provide a -quantile for some . Using a standard reduction (appending more items to the end of the stream), this implies that there is no deterministic comparison-based streaming algorithm that returns an -approximate median and uses memory. Applying a known reduction, this yields a lower bound of for any randomized comparison based algorithm. We refer to Section 6 for a discussion of this and other corollaries of our result.

1.1 Overview and Comparison to the previous bound [7]

Let be a deterministic comparison-based quantile summary. From a high-level point of view, we prove the space lower bound for by constructing two streams and satisfying two opposing constraints: On one hand, the behavior of on these streams is the same, implying that the memory states after processing and are the same, up to an order-preserving renaming of the stored items. For this reason, and are called indistinguishable. On the other hand, the adversary introduces as much uncertainty as possible. Namely, it makes the difference between the rank of a stored item with respect to (w.r.t.) and the rank of the next stored item w.r.t.  as large as possible, where the rank of an item w.r.t. stream is its position in the ordering of . If this uncertainty, which we call the “gap”, is too large, then fails to provide an -approximate -quantile for some . The crucial part of our lower bound proof is to construct the two streams in a way that yields a good trade-off between the number of items stored by the algorithm and the largest gap introduced.

While the previous lower bound of  [7] is in the same computational model, and also works by creating indistinguishable streams with as much uncertainty as possible, our approach is substantially different. Mainly, the construction by Hung and Ting [7] is inherently sequential as it works in iterations and appends items in each iteration to the streams constructed (and moreover, up to new streams are created from each former stream in each iteration). Thus, their construction produces (a large number of) indistinguishable streams of length . Furthermore, having the number of iterations equal to the number of items appended during each iteration (up to a constant factor) is crucial for the analysis in [7].

In contrast, our construction is recursive and produces just two indistinguishable streams of length for any . For , our lower bound of implies the previous lower bound of , and hence for higher , our lower bound is strictly stronger than the previous one.

Organization of the paper. In Section 2, we start by describing the formal computational model in which our lower bound holds and formally stating our result. In Section 3, we introduce indistinguishable streams, and in Section 4 we describe our construction. Then, Section 5 shows the crucial inequality between the space and the largest gap (the uncertainty), which implies the lower bound. Finally, in Section 6 we give corollaries of the construction and discuss related open problems.

1.2 Related Work

The Greenwald-Khanna algorithm [4] is generally regarded as the best deterministic quantile summary. The space bound of follows from a somewhat involved proof, and it has been questioned whether this approach could be simplified or improved. Our work answers this second question in the negative. For a known universe of bounded size, Shrivastava et al. [15] designed a quantile summary using words. Note that their algorithm is not comparison-based and so the result is incomparable to the upper bound of .

If we tolerate randomization and relax the requirement for worst-case error guarantees, it is possible to design quantile summaries with space close to . After a sequence of improvements [12, 2, 10, 3], Karnin, Lang, and Liberty [8] designed a randomized comparison-based quantile summary with space bounded by , where

is the probability of not returning an

-approximate -quantile for some . The algorithm combines careful sampling, and uses the Greenwald-Khanna summary as a subroutine. They also provide a reduction to transform the deterministic lower bound into a randomized lower bound of , implying optimality of their approach in its model. We discuss further how the deterministic and randomized lower bounds relate in Section 6.

Luo et al. [10] compared quantile summaries experimentally and also provided a simple randomized algorithm with a good practical performance. This paper studies not only streaming algorithms for insertion-only streams (i.e., the cash register model), but also for turnstile streams, in which items may depart. Note that any algorithm for turnstile streams inherently relies on the bounded size of the universe. We refer the interested reader to the survey of Greenwald and Khanna [5] for a description of both deterministic and randomized algorithms, together with algorithms for turnstile streams, the sliding window model, and distributed algorithms.

Other results arise when relaxing the requirement for correctness under adversarial order to assuming that the input arrives in a random order. For random-order streams, Guha and McGregor [6] studied algorithms for exact and approximate selection of quantiles. Among other things, they gave an algorithm for finding the exact -quantile in space using passes over a random-order stream, while with memory we need to do passes on the worst-case stream. The Shifting Sands algorithm [13] reduce the magnitude of the error from to . Since our lower bound relies on carefully constructing an adversarial input sequence, it does not apply to this random order model.

2 Computational Model

We present our lower bounds in a comparison-based model of computation, in line with prior work, most notably that of Hung and Ting [7]. We assume that the items forming the input stream are drawn from a linearly ordered universe , about which the algorithm has no further information. The only allowed operations on items are to perform an equality test and a comparison of two given items. This specifically rules out manipulations which try to combine multiple items into a single storage location, or replace a group of items with an “average” representative. We assume that the universe is unbounded and continuous in the sense that any non-empty open interval contains an unbounded number of items. This fact is relied on in our proof to be able to draw new elements falling between any previously observed pair. An example of such a universe is a large enough set of long incompressible strings, ordered lexicographically (where the continuous assumption may be achieved by making the strings even longer).

Let be a deterministic data structure for processing a stream of items, i.e., a sequence of items arriving one by one. We make the following assumptions about the memory contents of . The memory used by will contain some items from the stream, each considered to occupy one memory cell, and some other information which could include lower and upper bounds on the ranks of stored items, counters, etc. However, we assume that the memory does not contain the result of any function applied on any items from the stream, apart from a comparison, the equality test and the trivial function (since other functions are prohibited by our model). Thus, we can partition the memory state into a pair , where is the item array for storing some items from the input, indexed from , and there are no items stored in the general memory .

We give our lower bound on the memory size only in terms of , the number of items stored, and ignore the size of . For simplicity, we assume without loss of generality that the contents of are sorted non-decreasingly, i.e., . If this were not case, we could equivalently apply an in-place sorting algorithm after processing each item, while the information potentially encoded in the former ordering of can be retained in whose size we do not measure. Finally, we can assume that the minimum and maximum elements of the input stream are always maintained, with at most a constant additional storage space.

Summarizing, we have the following definition.

Definition 1.

We say that a quantile summary is comparison-based if the following holds after processing every item in the input stream:

  1. does not apply any function on any items from the stream, apart from a comparison, the equality test, and the identity function .

  2. The memory of is divided into the item array , which stores only items that have already occurred in the stream (sorted non-decreasingly), and general memory , which does not contain any item. Furthermore, once an item is removed from , it cannot be added back to , unless it appears in the stream again.

  3. Given the -th item from the input stream, the computation of is determined solely by the results of comparisons between and , for , the number of items stored, and the contents of the general memory .

  4. Given a quantile query , its computation is determined solely by the number of items stored (), and the contents of the general memory . Moreover, can only return one of the items stored in .

We are now ready to state our main result formally.

Theorem 2.

For any , there is no deterministic comparison-based -approximate quantile summary which stores items on any input stream of length .

Fix the approximation guarantee and assume for simplicity that is an integer. Let be a deterministic comparison-based -approximate quantile summary. We show that for any integer , data structure needs to store at least items from some input stream of length (thus, we have ).

Notation and conventions. We assume that starts with an empty memory state with . For an item , let be the resulting memory state after processing item if the memory state was before processing . Moreover, for a stream , let be the memory state after processing stream . For brevity, we use , or just for the item array after processing stream .

When referring to an order of a set of items, we always mean the non-decreasing order. For an item in stream , let be the rank of in the order of , i.e., the position of in the ordering of . In our construction, all items in each of the streams will be distinct, thus is well-defined and equal to one more than the number of items that are strictly smaller than .

3 Indistinguishable Streams

We start by defining an equivalence of memory states, which captures their equality up to renaming stored items. Then, we give the definition of indistinguishable streams.

Definition 3.

Two memory states and are said to be equivalent if (i) , i.e., the number of items stored is the same, and (ii) .

Definition 4.

We say that two streams and of length are indistinguishable for if (1) the final memory states and are equivalent, and (2) for any , there exists such that both and .

We remark that condition (2) is implied by (1) if the positions of stored items in the stream are retained in the general memory, but we make this property explicit as we shall use it later. In the following, let and be two indistinguishable streams with items. Note that, after processes one of and and receives a quantile query , must return the -th item of array for some , regardless of whether the stream was or . This follows, since can make its decisions based on the values in , which are identical in both cases, and operations on values in , which are indistinguishable under the comparison-based model.

For any , our general approach is to recursively construct two streams and of length that satisfy two constraints set in opposition to each other: They are indistinguishable for , but at the same time, for some , the rank of in stream and the rank of in stream are as different as possible — we call this difference the “gap”. The latter constraint is captured by the following definition.

Definition 5.

We define the largest gap between indistinguishable streams and (for ) as

As we assume that is sorted, is the next stored item after in the ordering of . We will also ensure that for any . Hence, . We also have that , which follows, since for any it holds that .

Lemma 6.

If is an -approximate quantile summary, then .

Proof.

Suppose that . We show that fails to provide an -approximate -quantile for some , which is a contradiction. Namely, because , there is such that . Let be such that , i.e., is in the middle of the “gap”. Since streams and are indistinguishable and is comparison-based, given query , must return the -th item of item array for some , regardless of whether the stream is or . Observe that if and the input stream is , item is not an -approximate -quantile of items in . Otherwise, when , then item is not an -approximate -quantile of stream . In either case, we get a contradiction. ∎

As the minimum and maximum elements of stream are in , it holds that , thus the number of stored items is at least , where the last inequality is by Lemma 6. This gives an initial lower bound of space. Our construction of adversarial inputs for in the next Section increases this bound.

4 Construction of Indistinguishable Streams

Our construction of the two indistinguishable streams is recursive. Below, we give an adversarial procedure AdvStrategy for generating items in streams and , while making the gap as large as possible and ensuring that they are indistinguishable. The procedure AdvStrategy takes as input the level of recursion and the indistinguishable streams and constructed so far. It also takes two open intervals and of the universe such that so far there is no item from interval in stream and similarly, there is no item from interval in stream . Recall that we consider the universe of items to be continuous, namely, that we can generate sufficiently many items within both the intervals.

The strategy for is trivial: We just send arbitrary items from to and any items from to , ordered in the same way for both streams. For , we first use AdvStrategy recursively for level . From the resulting memory states, we find the largest gap inside the intervals, define two new intervals on the extreme parts of the gap, and use the procedure recursively for again in these new intervals. The remainder of this section specifies this procedure in more detail.

Notation. For an item in stream , let be the next item in the ordering of , i.e., the smallest item in that is larger than (we will not need when is the largest item in ). Similarly, for an item in stream , let be the previous item in the ordering of (undefined for the smallest item in ). Note that or may well not be stored by .

For an interval of items and an array of items, we use to denote the restriction of to , enclosed by and . That is, is the array of items , where and are the minimal and maximal indexes of an item in that belongs to interval , respectively. Items in are again sorted and indexed from 1. Recall also that by our convention, is the item array after processing some stream .

1:Integer , streams and , intervals and of items
2:Streams and , where and are substreams with items from and , respectively
3:if k = 1 then
4:     Append any items from interval in their order to
5:     Append any items from interval in their order to
6:     return streams and
7:else
8:     
9:      and
10:      Position of the largest gap in intervals and
11:      New interval for
12:      New interval for
13:     return
Pseudocode 1 Adversarial procedure AdvStrategy

Figure 1: An illustration of the largest gap, determined in line 10, and of the new intervals and for and , defined in lines 11 and 12, respectively. The items in the streams are real numbers and we depict them on the real line, the top one for and the bottom one for . Each item is represented either by a short line segment if it is stored in the item array, or by a cross otherwise. In this example, there are items in both streams, and we have so that the largest gap can be of size at most . The ranks of stored items are and w.r.t. stream , and and w.r.t. stream , thus the gap of size of is between the first and second stored items, and also between the second and third stored items; the latter is depicted in the figure. Note that we look for the largest gap only in the current intervals.

Pseudocode 1 gives the formal description of the adversarial strategy. See Figure 1 for an illustration and Appendix A for an example of the construction with . Note that and are the item arrays of for and restricted to the current intervals and , respectively. Below, we show that the streams constructed are indeed indistinguishable and that the procedure is well-defined, namely that , which is needed for the definition of the gap in line 10. We first give some observations. The initial call of the strategy for some integer is , where stands for the empty stream and and are the minimum and maximum items in , respectively. Observe that the recursion tree of this call has leaves which correspond to calling the strategy for and that the items are appended to streams only in the leaves, namely, items to each stream in each leaf. It follows that the number of items appended during any execution of AdvStrategy() is . Note that for a general recursive call, the streams and at input time may already contain some items. We remark that items in each of and as constructed are distinct within the streams (but the two streams may share some items, which does not affect our analysis). Also, the behavior of a comparison-based quantile summary may be different when processing items appended during the recursive call in line 8 and when processing items from the call in line 13. The reason is that the computation of is influenced also by items outside the intervals, i.e., by items in streams and that are from other branches of the recursion tree.

Observe that, after processes one of streams and , the “largest gap” in the intervals and is between item of stream and item of , and that we recursively insert new items to the leftmost segment of the gap for stream and to the rightmost segment for stream , where a segment is an open interval between two items that are consecutive in the ordering of the stream. It follows that before the recursive call in line 8, there is so far no item from interval in stream and similarly, there is no item from interval in stream .

Another crucial property is that, after processes one of streams and , for any and it holds that , where the minimum over an empty set is defined as . This holds by the definition of the new intervals in lines 11 and 12.

We now prove that the strategy is well-defined and that the streams constructed are indistinguishable. In the next section, we go on to analyze the space used by the algorithm. We use the following Lemma derived from [7] (which is a simple consequence of the facts that is comparison-based and the memory states and are equivalent).

Lemma 7 (Implied by Lemma 2 in [7]).

Suppose that streams and are indistinguishable for and let and be the corresponding item arrays after processing and , respectively. Let be any two items such that . Then the streams and are indistinguishable.

Lemma 8.

Consider an execution of for and let and be the returned streams. Suppose that streams and are indistinguishable. Then (i) if , then (as defined in line 9), implying that index , determined in line 10, is well-defined in this execution, and (ii) the streams and are indistinguishable.

Proof.

The proof is by induction on . In the base case , we use the fact that the items from the corresponding intervals are appended in their order and that for any and . Thus, applying Lemma 7 for each pair of appended items, we get that and are indistinguishable.

Now consider . By applying the inductive hypothesis for the recursive call in line 8, streams and are indistinguishable. We first show (i), i.e., . Let and be the items in streams and , respectively. Condition (2) in Definition 4 implies that for any (where and are the full item arrays), there exists such that both and . As only the last items of and of are from intervals and , respectively, we get that the restricted item arrays and must have the same size.

To show (ii), we apply the inductive hypothesis for the recursive call in line 13 and get that streams are indistinguishable. ∎

Our final observation is that for any , we have that . The proof follows by the induction on (similarly to Lemma 8) and by the definition of the new intervals in lines 10-12, namely, by the fact that the new interval for is in the leftmost segment of the largest gap, while the new interval for is in the rightmost segment.

5 Space-Gap Inequality

In this section, we analyze the space used by data structure . We again proceed inductively and define to be the maximum size of the item array restricted to during the execution of on stream , where . For simplicity, we use .

We prove a lower bound for that depends on the largest gap between the restricted item arrays for and for . We enhance the definition of the gap to take the restriction of the intervals into account.

Definition 9.

For indistinguishable streams and and intervals and , let and be the substreams of and consisting of items from intervals and , respectively. Moreover, let and be the restricted item arrays after processing and , respectively. We define the largest gap between and in intervals and as

Note that the ranks are with respect to substreams and , and that the largest gap is always at least one, supposing that the ranks of stored items are not smaller for than for . We again have . Also, since the restricted item arrays are enclosed by interval boundaries, the following simple bound holds:

(1)

where . The following lemma (proved below) shows a stronger inequality between the space and the largest gap.

Lemma 10.

Consider an execution of . Let and be the returned streams, and let . Then, for , the following space-gap inequality holds with :

(2)

We remark that we do not optimize the constant . Note that the right-hand side (RHS) of (2) is non-increasing for integer , as is decreasing for and equals for .

We first observe that Theorem 2 directly follows from Lemma 10, and then our subsequent work will be in proving this space-gap inequality. Indeed, consider any integer and let be the streams of length obtained from the strategy. Let . Since and are indistinguishable by Lemma 8, we have by Lemma 6. Since the RHS of (2) is decreasing for and , it becomes the smallest for . Thus, by Lemma 10, the memory used is at least

Proof of Lemma 10.

The proof is by induction on . First, observe that (2) holds almost trivially if . Indeed, we have , and by the bound in (1), . Similarly, if , then (2) holds, since the RHS of (2) is at most and .222 Note, however, that we cannot use Lemma 6 to show , since the largest gap has size bounded by times the length of or , which can be much larger than . We thus assume that , which immediately implies the base case of the induction, since because .

Consider . We refer to streams , item arrays and , and intervals and with the same meaning as in Pseudocode 1. Additionally, we use the following notation:

  • Let be the substreams constructed during the recursive call in line 8. Let be the size of (or, equivalently, of ), and let be the largest gap in the input intervals after processes one of streams and .

  • Let and be the item arrays restricted to the new intervals after processes streams and , respectively. Let be the size of , and let be the largest gap in the new intervals. Let and be the substreams constructed during the recursive call in line 13.

  • Let and be the item arrays restricted to the input intervals after processes streams and , respectively.

  • Finally, let and be the substreams of and , restricted to intervals and , respectively (that is, and consist of the items appended during the considered execution).

We remark that notation abbreviates the restriction to intervals and , while notation implicitly denotes the restriction to the new intervals and . Note that , , and , and similarly for streams , , and . We now show a crucial relation between the gaps.

Claim 11.

Proof.

Let be , i.e., the position of the largest gap in the arrays and . Let be such that (and ). To prove the claim, it is sufficient to show

(3)

as the difference on the LHS is taken into account in the definition of . We have , by the definitions of and . Furthermore, for any items and