 # Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows

We study the distinct elements and ℓ_p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and ℓ_p-heavy hitters that are nearly optimal in both n and ϵ. Applying our new composable histogram framework, we provide an algorithm that outputs a (1+ϵ)-approximation to the number of distinct elements in the sliding window model and uses Ø1/ϵ^2 n1/ϵ n+1/ϵ^2 n bits of space. For ℓ_p-heavy hitters, we provide an algorithm using space O(1/ϵ^p^2 n( n+1/ϵ)) for 0<p< 2, improving upon the best-known algorithm for ℓ_2-heavy hitters (Braverman et al., COCOON 2014), which has space complexity O(1/ϵ^4^3 n). We also show complementing nearly optimal lower bounds of Ω(1/ϵ^2 n+1/ϵ^2 n) for distinct elements and Ω(1/ϵ^p^2 n) for ℓ_p-heavy hitters, both tight up to O( n) and O(1/ϵ) factors.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The streaming model has emerged as a popular computational model to describe large data sets that arrive sequentially. In the streaming model, each element of the input arrives one-by-one and algorithms can only access each element once. This implies that any element that is not explicitly stored by the algorithm is lost forever. While the streaming model is broadly useful, it does not fully capture the situation in domains where data is time-sensitive such as network monitoring [Cor13, CG08, CM05b] and event detection in social media [OMM14]. In these domains, elements of the stream appearing more recently are considered more relevant than older elements. The sliding window model was developed to capture this situation [DGIM02]. In this model, the goal is to maintain computation on only the most recent elements of the stream, rather than on the stream in its entirety. We call the most recent elements active and the remaining elements expired. Any query is performed over the set of active items (referred to as the current window) while ignoring all expired elements.

The problem of identifying the number of distinct elements, is one of the foundational problems in the streaming model.

###### Problem 1 (Distinct elements)

Given an input of elements in , output the number of items whose frequency satisfies .

The objective of identifying heavy hitters, also known as frequent items, is also one of the most well-studied and fundamental problems.

###### Problem 2 (ℓp-heavy hitters)

Given parameters and an input of elements in , output all items whose frequency satisfies and no item for which , where . (The parameter is typically assumed to be at least for some fixed constant .)

In this paper, we study the distinct elements and heavy hitters problems in the sliding window model. We show almost tight results for both problems, using several clean tweaks to existing algorithms. In particular, we introduce the composable histogram, a modification to the exponential histogram [DGIM02] and smooth histogram [BO07], that may be of independent interest. We detail our results and techniques in the following section, but defer complete proofs to the full version of the paper [BGL18].

### 1.1 Our Contributions

#### Distinct elements.

An algorithm storing bits in the insertion-only model was previously provided [KNW10]. Plugging the algorithm into the smooth histogram framework of [BO07] yields a space complexity of bits. We improve this significantly as detailed in the following theorem.

###### Theorem 1

Given

, there exists an algorithm that, with probability at least

, provides a -approximation to the number of distinct elements in the sliding window model, using bits of space.

A known lower bound is bits [AMS99, IW03] for insertion-only streams, which is also applicable to sliding windows since the model is strictly more difficult. We give a lower bound for distinct elements in the sliding window model, showing that our algorithm is nearly optimal, up to and factors, in both and .

###### Theorem 2

Let . Any one-pass sliding window algorithm that returns a -approximation to the number of distinct elements with probability requires bits of space.

#### ℓp-heavy hitters.

We first recall in Lemma 16 a condition that allows the reduction from the problem of finding the -heavy hitters for to the problem of finding the -heavy hitters. An algorithm of [BCI17]

allows us to maintain an estimate of

. However, observe in Problem 2 that an estimate for is only part of the problem. We must also identify which elements are heavy. First, we show how to use tools from [BCIW16] to find a superset of the heavy hitters. This alone is not enough since we may return false-positives (elements such that ). By keeping a careful count of the elements (shown in Section 4), we are able to remove these false-positives and obtain the following result, where we have set :

###### Theorem 3

Given and , there exists an algorithm in the sliding window model that, with probability at least , outputs all indices for which , and reports no indices for which . The algorithm has space complexity (in bits) .

Finally, we obtain a lower bound for -heavy hitters in the sliding window model, showing that our algorithm is nearly optimal (up to and factors) in both and .

###### Theorem 4

Let and . Any one-pass streaming algorithm that returns the -heavy hitters in the sliding window model with probability requires bits of space.

More details are provided in Section 4 and Section 5.

By standard amplification techniques any result that succeeds with probability can be made to succeed with probability while multiplying the space and time complexities by . Therefore Theorem 1 and Theorem 15 can be taken with regard to any positive probability of failure.

See Table 1 for a comparison between our results and previous work.

### 1.2 Our Techniques

We introduce a simple extension of the exponential and smooth histogram frameworks, which use several instances of an underlying streaming algorithm. In contrast with the existing frameworks where different sketches are maintained, we observe in Section 2 when the underlying algorithm has certain guarantees, then we can store these sketches more efficiently.

#### Sketching Algorithms

Consider the sliding window model, where elements eventually expire. A very simple (but wasteful) algorithm is to simply begin a new instance of the insertion-only algorithm upon the arrival of each new element (Figure 1). The smooth histogram of [BO07], summarized in Algorithm 1, shows that storing only instances suffices.

Algorithm 1 may delete indices for either of two reasons. The first (Lines 9-10) is that the index simply expires from the sliding window. The second (Lines 7-8) is that the indices immediately before () and after () are so close that they can be used to approximate .

For the distinct elements problem (Section 3), we first claim that a well-known streaming algorithm [BJK02] provides a -approximation to the number of distinct elements at all points in the stream. Although this algorithm is suboptimal for insertion-only streams, we show that it is amenable to the conditions of a composable histogram (Theorem 6). Namely, we show there is a sketch of this algorithm that is monotonic over suffixes of the stream, and thus there exists an efficient encoding that efficiently stores for each , which allows us to reduce the space overhead for the distinct elements problem.

For -heavy hitters (Section 4), we show that the norm algorithm of [BCI17] also satisfies the sketching requirement. Thus, plugging this into Algorithm 1 yields a method to maintain an estimate of . Algorithm 2 uses this subroutine to return the identities of the heavy hitters. However, we would still require that all instances succeed since even instances that fail adversarially could render the entire structure invalid by tricking the histogram into deleting the wrong information (see [BO07] for details). We show that the norm algorithm of [BCI17] actually contains additional structure that only requires the correctness of instances, thus improving our space usage.

### 1.3 Lower Bounds

#### Distinct elements.

To show a lower bound of for the distinct elements problems, we show in Theorem 19 a lower bound of and we show in Theorem 22 a lower bound of . We first obtain a lower bound of by a reduction from the IndexGreater problem, where Alice is given a string and each has bits so that has bits in total. Bob is given integers and and must determine whether or .

Given an instance of the IndexGreater problem, Alice splits the data stream into blocks of size and further splits each block into pieces of length

, padding the remainder of each block with zeros if necessary. For each

, Alice encodes by inserting the elements into piece of block . Thus, the number of distinct elements in each block is much larger than the sum of the number of distinct elements in the subsequent blocks. Furthermore, the location of the distinct elements in block encodes , so that Bob can recover and compare it with .

We then obtain a lower bound of by a reduction from the GapHamming problem. In this problem, Alice and Bob receive length- bitstrings and , which have Hamming distance either at least or at most , and must decide whether the Hamming distance between and is at least . Recall that for , a -approximation can differentiate between at least and at most . We use this idea to show a lower bound of by embedding instances of GapHamming into the stream. As in the previous case, the number of distinct elements corresponding to each instance is much larger than the sum of the number of distinct elements for the remaining instances, so that a -approximation to the number of distinct elements in the sliding window solves the GapHamming problem for each instance.

#### Heavy hitters.

To show a lower bound on the problem of finding -heavy hitters in the sliding window model, we give a reduction from the AugmentedIndex problem. Recall that in the AugmentedIndex problem, Alice is given a length- string (which we write as ) while Bob is given an index , as well as , and must output the symbol of the string, . To encode for , Alice creates a data stream with the invariant that the heavy hitters in the suffix encode . Specifically, the heavy hitters in the suffix will be concentrated in the substream and the identities of each heavy hitter in gives a bit of information about the value of . To determine , Bob expires the elements so all that remains in the sliding window is , whose heavy hitters encode .

### 1.4 Related Work

The study of the distinct elements problem in the streaming model was initiated by Flajolet and Martin [FM83] and developed by a long line of work [AMS99, GT01, BJK02, DF03, FFGM07]. Kane, Nelson, and Woodruff [KNW10] give an optimal algorithm, using bits of space, for providing a -approximation to the number of distinct elements in a data stream, with constant probability. Blasiok [Bla18] shows that to boost this probability up to for a given , the standard approach of running independent instances is actually sub-optimal and gives an optimal algorithm that uses bits of space.

The -heavy hitters problem was first solved by Misra and Gries, who give a deterministic streaming algorithm using space [MG82]. Other techniques include the sketch [CM05a], sticky sampling [MM12], lossy counting [MM12], sample and hold [EV03], multi-stage bloom filters [CFM09], sketch-guided sampling [KX06], and [CCF04]. Among the numerous applications of the -heavy hitters problem are network monitoring [DLM02, SW04], denial of service prevention [EV03, BAE07, CKMS08]

, moment estimation

[IW05], -sampling [MW10], finding duplicates [GR09], iceberg queries [FSG98], and entropy estimation [CCM10, HNO08].

A stronger notion of “heavy hitters” is the -heavy hitters. This is stronger than the -guarantee since if then (and so ). Thus any algorithm that finds the -heavy hitters will also find all items satisfying the -guarantee. In contrast, consider a stream that has for some and for all other elements in the universe. Then the -heavy hitters algorithm will successfully identify for some constant , whereas an algorithm that only provides the -guarantee requires , and therefore space for identifying . Moreover, the -gaurantee is the best we can do in polylogarithmic space, since for it has been shown that identifying -heavy hitters requires bits of space [CKS03, BJKS04].

The most fundamental data stream setting is the insertion-only model where elements arrive one-by-one. In the insertion-deletion model, a previously inserted element can be deleted (each stream element is assigned or , generalizing the insertion-only model where only is used). Finally, in the sliding window model, a length is given and the stream consists only of insertions; points expire after insertions, meaning that (unlike the insertion-deletion model) the deletions are implicit. Letting be the stream, at time

the frequency vector is built from the window

as the active elements, whereas items are expired. The objective is to identify and report the “heavy hitters”, namely, the items for which is large with respect to .

Table 2 shows prior work for -heavy hitters in the various streaming models. A retuning of in [TZ12] solves the problem of -heavy hitters in bits of space. More recently, [BCIW16] presents an -heavy hitters algorithm using space. This algorithm is further improved to an space algorithm in [BCI17], which is optimal.

In the insertion-deletion model, is space optimal [CCF04, JST11], but the update time per arriving element is improved by [LNNT16]. Thus in some sense, the -heavy hitters problem is completely understood in all regimes except the sliding window model. We provide a nearly optimal algorithm for this setting, as shown in Table 2.

We now turn our attention to the sliding window model. The pioneering work by Datar  et al.  [DGIM02] introduced the exponential histogram as a framework for estimating statistics in the sliding window model. Among the applications of the exponential histogram are quantities such as count, sum of positive integers, average, and norms. Numerous other significant works include improvements to count and sum [GT02], frequent itemsets [CWYM06]

, frequency counts and quantiles

[AM04, LT06], rarity and similarity [DM02]

, variance and

-medians [BDMO03] and other geometric problems [FKZ05, CS06]. Braverman and Ostrovsky [BO07] introduced the smooth histogram as a framework that extends to smooth functions. [BO07]

also provides sliding window algorithms for frequency moments, geometric mean and longest increasing subsequence. The ideas presented by

[BO07] also led to a number of other results in the sliding window model [CMS13, BLLM15, BOR15, BLLM16, CNZ16, ELVZ17, BDUZ18]. In particular, Braverman  et al.  [BGO14] provide an algorithm that finds the -heavy hitters in the sliding window model with for some constant , using bits of space, improving on results by [HT08]. [BEFK16] also implements and provides empirical analysis of algorithms finding heavy hitters in the sliding window model. Significantly, these data structures consider insertion-only data streams for the sliding window model; once an element arrives in the data stream, it remains until it expires. It remains a challenge to provide a general framework for data streams that might contain elements “negative” in magnitude, or even strict turnstile models. For a survey on sliding window algorithms, we refer the reader to [Bra16].

## 2 Composable Histogram Data Structure Framework

We first describe a data structure which improves upon smooth histograms for the estimation of functions with a certain class of algorithms. This data structure provides the intuition for the space bounds in Theorem 1. Before describing the data structure, we need the definition a smooth function.

###### Definition 5

[BO07] A function is -smooth if it has the following properties:

Monotonicity

for ( is a suffix of )

Polynomial boundedness

There exists such that .

Smoothness

For any , there exists , so that if and , then for any adjacent .

We emphasize a crucial observation made in [BO07]. Namely, for , is a -smooth function while for , is a -smooth function.

Given a data stream and a function , let represent applied to the substream . Furthermore, let represent the data structure used to approximate .

###### Theorem 6

Let be an -smooth function so that for some constant . Suppose that for all :

1. There exists an algorithm that maintains at each time a data structure which allows it to output a value so that

 Pr[|^f(1,t)−f(1,t)|≤ϵ2f(1,t),for% all 0≤t≤n]≥1−δ.
2. There exists an algorithm which, given and , can compute . Moreover, suppose storing uses bits of space.

Then there exists an algorithm that provides a -approximation to on the sliding window, using bits of space.

We remark that the first condition of Theorem 6 is called “strong tracking” and well-motivated by [BDN17].

## 3 Distinct Elements

We first show that a well-known streaming algorithm that provides a -approximation to the number of distinct elements actually also provides strong tracking. Although this algorithm uses bits of space and is suboptimal for insertion-only streams, we show that it is amenable to the conditions of Theorem 6. Thus, we describe a few modifications to this algorithm to provide a -approximation to the number of distinct elements in the sliding window model.

Define to be the 0-based index of least significant bit of a non-negative integer in binary representation. For example, and where we assume . Let and be a random hash function. Let so that

is an unbiased estimator for

. Moreover, for such that

, the standard deviation of

is . Let be a pairwise independent random hash function with . Let be the expected number of non-empty bins after balls are thrown at random into bins so that .

###### Fact 7

Blasiok provides an optimal algorithm for a constant factor approximation to the number of distinct elements with strong tracking.

###### Theorem 8

[Bla18] There is a streaming algorithm that, with probability , reports a -approximation to the number of distinct elements in the stream after every update and uses bits of space.

Thus we define an algorithm that provides a -approximation to the number of distinct elements in the stream after every update, using bits of space.

Since we can specifically track up to distinct elements, let us consider the case where the number of distinct elements is . Given access to to output an estimate , which is a -approximation to the number of distinct elements, we can determine an integer for which . Then the quantity provides both strong tracking as well as a -approximation to the number of distinct elements:

###### Lemma 9

[Bla18] The median of estimators is a -approximation at all times for which the number of distinct elements is , with constant probability.

Hence, it suffices to maintain for each , provided access to to find , and parallel repetitions are sufficient to decrease the variance.

Indeed, a well-known algorithm for maintaining simply keeps a table of bits. For , row of the table corresponds to . Specifically, the bit in entry of corresponds to if for all and corresponds to if there exists some such that . Therefore, the table maintains , so then Lemma 9 implies that the table also gives a -approximation to the number of distinct elements at all times, using bits of space and access to . Then the total space is after again using parallel repetitions to decrease the variance.

Naïvely using this algorithm in the sliding window model would give a space usage dependency of . To improve upon this space usage, consider maintaining tables for substreams where . Let represent the table corresponding to substream . Since is a suffix of , then the support of the table representing is a subset of the support of the table representing . That is, if the entry of is one, then the entry of is one, and similarly for each . Thus, instead of maintaining tables of bits corresponding to each of the , it suffices to maintain a single table where each entry represents the ID of the last table containing a bit of one in the entry. For example, if the entry of is zero but the entry of is one, then the entry for is . Hence, is a table of size , with each entry having size bits, for a total space of bits. Finally, we need bits to maintain the starting index for each of the tables represented by . Again using a number of repetitions, the space usage is .

Since this table is simply a clever encoding of the tables used in the smooth histogram data structure, correctness immediately follows. We emphasize that the improvement in space follows from the idea of Theorem 6. That is, instead of storing a separate table for each instance of the algorithm in the smooth histogram, we instead simply keep the difference between each instance.

Finally, observe that each column in is monotonically decreasing. This is because is a subset of . Alternatively, if an item has been sampled to level , it must have also been sampled to level . Instead of using bits per entry, we can efficiently encode the entries for each column in with the observation that each column is monotonically decreasing.

Proof of Theorem 1:   Since the largest index of is and has rows, the number of possible columns is , which can be encoded using bits. Correctness follows immediately from Lemma 9 and the fact that the estimator is monotonic. Again we use bits to maintain the starting index for each of the tables represented by . As has columns and accounting again for the repetitions to decrease the variance, the total space usage is bits.

## 4 ℓp Heavy Hitters

Subsequent analysis by Berinde  et al.  [BICS10] proved that many of the classic -heavy hitter algorithms not only revealed the identity of the heavy hitters, but also provided estimates of their frequencies. Let be the vector whose largest entries are instead set to zero. Then an algorithm that, for each heavy hitter , outputs a quantity such that is said to satisfy the -tail guarantee. Jowhari  et al.  [JST11] show an algorithm that finds the -heavy hitters and satisfies the tail guarantee can also find the -heavy hitters. Thus, we first show results for -heavy hitters and then use this property to prove results for -heavy hitters.

To meet the space guarantees of Theorem 15, we describe an algorithm, Algorithm 2, that only uses the framework of Algorithm 1 to provide a -approximation of the norm of the sliding window. We detail the other aspects of Algorithm 2 in the remainder of the section.

Recall that Algorithm 1 partitions the stream into a series of “jump-points” where increases by a constant multiplicative factor. The oldest jump point is before the sliding window and initiates the active window, while the remaining jump points are within the sliding window. Therefore, it is possible for some items to be reported as heavy hitters after the first jump point, even though they do not appear in the sliding window at all! For example, if the active window has norm , and the sliding window has norm , all instances of a heavy hitter in the active window can appear before the sliding window even begins. Thus, we must prune the list containing all heavy hitters to avoid the elements with low frequency in the sliding window.

To account for this, we begin a counter for each element immediately after the element is reported as a potential heavy hitter. However, the counter must be sensitive to the sliding window, and so we attempt to use a smooth-histogram to count the frequency of each element reported as a potential heavy hitter. Even though the count function is smooth, the necessity to track up to heavy hitters prevents us from being able to -approximate the count of each element. Fortunately, a constant approximation of the frequency of each element suffices to reject the elements whose frequency is less than . This additional data structure improves the space dependency to .

### 4.1 Background for Heavy Hitters

We now introduce concepts from [BCIW16, BCI17] to show the conditions of Theorem 6 apply, first describing an algorithm from [BCI17] that provides a good approximation of at all times.

###### Theorem 10 (Remark 8 in [Bci+17])

For any and , there exists a one-pass streaming algorithm that outputs at each time a value so that

 Pr[|^F(t)2−F(t)2|≤ϵF(t)2,for all 0≤t≤n]≥1−δ,

and uses bits of space and update time.

The algorithm of Theorem 10 is a modified version of the AMS estimator [AMS99] as follows. Given vectors of 6-wise independent Rademacher (i.e. uniform

) random variables, let

, where is the frequency vector at time . Then [BCI17] shows that is a reasonably good estimator for . By keeping , we can compute from these sketches. Hence, the conditions of Theorem 6 are satisfied for , so Algorithm 1 can be applied to estimate the norm. One caveat is that naïvely, we still require the probability of failure for each instance of to be at most for the data structure to succeed with probability at least . We show in Appendix A that it suffices to only require the probability of failure for each instance of to be at most , thus incurring only additional space rather than . We now refer to a heavy hitter algorithm from [BCI17] that is space optimal up to factors.

###### Theorem 11 (Theorem 11 in [Bci+17])

For any and , there exists a one-pass streaming algorithm, denoted , that with probability at least , returns a set of -heavy hitters containing every -heavy hitter and an approximate frequency for every item returned satisfying the -tail guarantee. The algorithm uses bits of space and has update time and retrieval time.

Observe that Theorem 10 combined with Theorem 6 already yields a prohibitively expensive dependency on . Thus, we can only afford to set to some constant in Theorem 10 and have a constant approximation to in the sliding window.

At the conclusion of the stream, the data structure of Theorem 6 has another dilemma: either it reports the heavy hitters for a set of elements that is a superset of the sliding window or it reports the heavy hitters for a set of elements that is a subset of the sliding window. In the former case, we can report a number of unacceptable false positives, elements that are heavy hitters for but may not appear at all in the sliding window. In the latter case, we may entirely miss a number of heavy hitters, elements that are heavy hitters for the sliding window but arrive before begins. Therefore, we require a separate smooth histogram to track the counter of specific elements.

###### Theorem 12

For any , there exists an algorithm, denoted , that outputs a -approximation to the frequency of a given element in the sliding window model, using bits of space.

The algorithm follows directly from Theorem 6 and the observation that is -smooth.

### 4.2 ℓ2-Heavy Hitters Algorithm

We now prove Theorem 15 using Algorithm 2. We detail our -heavy hitters algorithm in full, using and -heavy hitters to refer to the -heavy hitters problem with parameter .

###### Lemma 13

Any element with frequency is output by Algorithm 2.

Proof :   Since the norm is a smooth function, and so there exists a smooth-histogram which is an -estimation of the norm of the sliding window by Theorem 6. Thus, . With probability , any element whose frequency satisfies must have and is reported by in Step 4.

Since is instantiated along with , the sliding window may begin either before or after reports each heavy hitter. If the sliding window begins after the heavy hitter is reported, then all instances are counted by . Thus, the count of estimated by is at least , and so Step 12 will output .

On the other hand, the sliding window may begin before the heavy hitter is reported. Recall that the algorithm identifies and reports an element when it becomes an -heavy hitter with respect to the estimate of . Hence, there are at most instances of an element appearing in the active window before it is reported by . Since , any element whose frequency satisfies must have and therefore must have at least instances appearing in the stream after it is reported by . Thus, the count of estimated by is at least , and so Step 12 will output .

###### Lemma 14

No element with frequency is output by Algorithm 2.

Proof :   If is output by Step 12, then . By the properties of and , , where the last inequality comes from the fact that .

###### Theorem 15

Given , there exists an algorithm in the sliding window model (Algorithm 2) that with probability at least outputs all indices for which , and reports no indices for which . The algorithm has space complexity (in bits) .

Proof :   By Lemma 13 and Lemma 14, Algorithm 2 outputs all elements with frequency at least and no elements with frequency less than . We now proceed to analyze the space complexity of the algorithm. Step 1 uses Algorithm 1 in conjunction with the routine to maintain a -approximation to the -norm of the sliding window. By requiring the probability of failure to be in Theorem 10 and observing that in Theorem 6 suffices for a -approximation, it follows that Step 1 uses